squirrel.driver
¶
Submodules¶
Package Contents¶
Classes¶
Drives the access to a data source. |
|
Drives the access to a data source. |
|
Drives the access to a data source. |
|
Drives the access to a data source. |
|
A Driver that allows iteration over the items in the data source. |
|
A StoreDriver that by default uses SquirrelStore with jsonl serialization. |
|
A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items. |
|
A StoreDriver that by default uses SquirrelStore with messagepack serialization. |
|
A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items. |
|
A :py:class`MapDriver` implementation, which uses an :py:class`AbstractStore` instance to retrieve its items. |
|
A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items. |
-
class
squirrel.driver.
CsvDriver
(path: str, **kwargs)¶ Bases:
squirrel.driver.file_driver.FileDriver
,squirrel.driver.driver.DataFrameDriver
Drives the access to a data source.
Initializes CsvDriver.
- Parameters
path (str) – Path to a .csv file.
**kwargs – Keyword arguments passed to the super class initializer.
-
name
= csv¶
-
get_df
(self, **kwargs) → dask.dataframe.DataFrame¶ Returns the data in the .csv file as a Dask DataFrame.
- Parameters
**kwargs – Keyword arguments passed to
dask.dataframe.read_csv()
to read the .csv file.- Returns
(dask.dataframe.DataFrame) Dask DataFrame constructed from the .csv file.
-
get_iter
(self, itertuples_kwargs: Optional[Dict] = None, read_csv_kwargs: Optional[Dict] = None) → squirrel.iterstream.Composable¶ Returns an iterator over rows.
Note that first the csv file is read into a DataFrame and then
df.itertuples()
is called.- Parameters
itertuples_kwargs – Keyword arguments to be passed to
dask.dataframe.DataFrame.itertuples()
.read_csv_kwargs – Keyword arguments to be passed to
dask.dataframe.read_csv()
.
- Returns
(squirrel.iterstream.Composable) Iterable over the rows of the data frame as namedtuples.
-
class
squirrel.driver.
DataFrameDriver
(catalog: Optional[squirrel.catalog.Catalog] = None, **kwargs)¶ Bases:
Driver
Drives the access to a data source.
Initializes driver with a catalog and arbitrary kwargs.
-
abstract
get_df
(self, **kwargs) → pandas.DataFrame¶ Returns a dataframe of the data.
-
abstract
-
class
squirrel.driver.
Driver
(catalog: Optional[squirrel.catalog.Catalog] = None, **kwargs)¶ Bases:
abc.ABC
Drives the access to a data source.
Initializes driver with a catalog and arbitrary kwargs.
-
name
:str¶
-
-
class
squirrel.driver.
FileDriver
(path: str, **kwargs)¶ Bases:
squirrel.driver.driver.Driver
Drives the access to a data source.
Initializes FileDriver.
- Parameters
path (str) – Path to a file.
**kwargs – Keyword arguments passed to the super class initializer.
-
name
= file¶
-
open
(self, mode: str = 'r', create_if_not_exists: bool = False, **kwargs) → IO¶ Returns a handler for the file.
Uses
squirrel.fsspec.fs.get_fs_from_url()
to get a filesystem object corresponding to self.path. Simply returns the handler returned from the open() method of the filesystem.- Parameters
mode (str) – IO mode to use when opening the file. Will be forwarded to filesystem.open() method. Defaults to “r”.
create_if_not_exists (bool) – If True, the file will be created if it does not exist (along with the parent directories). This is achieved by providing auto_mkdir=create_if_not_exists as a storage option to the filesystem. Defaults to False.
**kwargs – Keyword arguments that are passed to the filesystem.open() method.
- Returns
(IO) File handler for the file at self.path.
-
class
squirrel.driver.
IterDriver
(catalog: Optional[squirrel.catalog.Catalog] = None, **kwargs)¶ Bases:
Driver
A Driver that allows iteration over the items in the data source.
Items can be iterated over using the
get_iter()
method.Initializes driver with a catalog and arbitrary kwargs.
-
abstract
get_iter
(self, **kwargs) → squirrel.iterstream.Composable¶ Returns an iterable of items in the form of a
Composable
, which allows various stream manipulation functionalities.The order of the items in the iterable may or may not be randomized, depending on the implementation and kwargs.
-
abstract
-
class
squirrel.driver.
JsonlDriver
(url: str, deser_hook: Optional[Callable] = None, **kwargs)¶ Bases:
squirrel.driver.store_driver.StoreDriver
A StoreDriver that by default uses SquirrelStore with jsonl serialization.
Initializes JsonlDriver with default serializer.
- Parameters
url (str) – Path to the root directory. If this path does not exist, it will be created.
deser_hook (Callable) – Callable that is passed as object_hook to
JsonDecoder
during json deserialization. Defaults to None.**kwargs – Keyword arguments passed to the super class initializer.
-
name
= jsonl¶
-
get_iter
(self, get_kwargs: Optional[Dict] = None, **kwargs) → squirrel.iterstream.Composable¶ Returns an iterable of samples as specified by fetcher_func.
- Parameters
get_kwargs (Dict) – Keyword arguments that will be passed as get_kwargs to
MapDriver.get_iter()
. get_kwargs will always have compression=”gzip”. Defaults to None.**kwargs – Other keyword arguments that will be passed to
MapDriver.get_iter()
.
- Returns
(Composable) Iterable over the items in the store.
-
class
squirrel.driver.
MapDriver
(catalog: Optional[squirrel.catalog.Catalog] = None, **kwargs)¶ Bases:
IterDriver
A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.
Initializes driver with a catalog and arbitrary kwargs.
-
abstract
get
(self, key: Any, **kwargs) → Any¶ Returns an iterable over the items corresponding to key.
Note that it is possible to implement this method according to your needs. There is no restriction on the type or number of items. For example, a key might be corresponding to a single item that holds a single sample or to a single item that contains one shard of multiple samples.
If the method returns a single sample, then the
get_iter()
method should be called with flatten=False since the stream does not need to be flattened. Otherwise, e.g. if the method returns an iterable of samples, then theget_iter()
method should be called with flatten=True if it desirable to have individual samples in the iterstream.
-
get_iter
(self, keys_iterable: Optional[Iterable] = None, shuffle_key_buffer: int = 1, key_hooks: Optional[Iterable[Union[Callable, Type[squirrel.iterstream.Composable], functools.partial]]] = None, max_workers: Optional[int] = None, prefetch_buffer: int = 10, shuffle_item_buffer: int = 1, flatten: bool = False, keys_kwargs: Optional[Dict] = None, get_kwargs: Optional[Dict] = None, key_shuffle_kwargs: Optional[Dict] = None, item_shuffle_kwargs: Optional[Dict] = None) → squirrel.iterstream.Composable¶ Returns an iterable of items in the form of a
squirrel.iterstream.Composable
, which allows various stream manipulation functionalities.Items are fetched using the
get()
method. The returnedComposable
iterates over the items in the order of the keys returned by thekeys()
method.- Parameters
keys_iterable (Iterable, optional) – If provided, only the keys in keys_iterable will be used to fetch items. If not provided, all keys in the store are used.
shuffle_key_buffer (int) – Size of the buffer used to shuffle keys.
key_hooks (Iterable[Iterable[Union[Callable, Type[Composable], functools.partial]]], optional) –
Hooks to apply to keys before fetching the items. It is an Iterable any of these objects:
1) subclass of
Composable()
: in this case, .compose(hook, **kw) will be applied to the stream 2) A Callable: .to(hook, **kw) will be applied to the stream 3) A partial function: the three attributes args, keywords and func will be retrieved, and depending on whether func is a subclass ofComposable()
or a Callable, one of the above cases will happen, with the only difference that arguments are passed too. This is useful for passing arguments.max_workers (int, Optional) – If larger than 1 or None,
async_map()
is called to fetch multiple items simultaneously and max_workers refers to the maximum number of workers in the ThreadPoolExecutor used by async_map. Otherwise,map()
is called and max_workers is not used. Defaults to None.prefetch_buffer (int) – Size of the buffer used for prefetching items if async_map is used. See max_workers for more details. Please be aware of the memory footprint when setting this parameter.
shuffle_item_buffer (int) – Size of the buffer used to shuffle items after being fetched. Please be aware of the memory footprint when setting this parameter.
flatten (bool) – Whether to flatten the returned iterable. Defaults to False.
keys_kwargs (Dict, optional) – Keyword arguments passed to
keys()
when getting the keys in the store. Not used if keys_iterable is provided. Defaults to None.get_kwargs (Dict, optional) – Keyword arguments passed to
get()
when fetching items. Defaults to None.key_shuffle_kwargs (Dict, optional) – Keyword arguments passed to
shuffle()
when shuffling keys. Defaults to None. Can be useful to e.g. set the seed etc.item_shuffle_kwargs (Dict, optional) – Keyword arguments passed to
shuffle()
when shuffling items. Defaults to None. Can be useful to e.g. set the seed etc.
- Returns
(squirrel.iterstream.Composable) Iterable over the items in the store.
-
abstract
keys
(self, **kwargs) → Iterable¶ Returns an iterable of the keys for the objects that are obtainable through the driver.
-
abstract
-
class
squirrel.driver.
MessagepackDriver
(url: str, **kwargs)¶ Bases:
squirrel.driver.store_driver.StoreDriver
A StoreDriver that by default uses SquirrelStore with messagepack serialization.
Initializes MessagepackDriver with default serializer.
-
name
= messagepack¶
-
-
class
squirrel.driver.
SourceCombiner
(subsets: Dict[str, squirrel.catalog.CatalogKey], catalog: squirrel.catalog.Catalog, **kwargs)¶ Bases:
squirrel.driver.driver.MapDriver
,squirrel.driver.csv_driver.DataFrameDriver
A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.
Initializes SourceCombiner.
- Parameters
subsets (Dict[str, CatalogKey]) – Keys define the names of the subsets, values are tuples of the corresponding (catalog entry, version) combinations.
catalog (Catalog) – The parent catalog which the subset sources are part of.
**kwargs – Keyword arguments to be passed to the super class.
-
name
= source_combiner¶
-
get
(self, subset: str, key: Any, **kwargs) → Iterable¶ Routes to the
get()
method of the appropriate subset driver.
-
get_df
(self, subset: str, **kwargs) → dask.dataframe.DataFrame¶ Routes to the
get_df()
method of the appropriate subset driver.- Parameters
subset (str) – Id of the subset in this source definition.
**kwargs – Keyword arguments passed to the subset driver.
- Returns
(DataFrame) Data of the subset driver subset as a Dask DataFrame.
-
get_iter
(self, subset: Optional[str] = None, **kwargs) → squirrel.iterstream.Composable¶ Routes to the
get_iter()
method of the appropriate subset driver.- Parameters
subset (str) – Id of the subset in this source definition. If None, interleaves iterables obtained from all subset drivers.
**kwargs – Keyword arguments passed to the subset driver.
- Returns
(Composable) Iterable over the items of subset driver(s) in the form of a
Composable
.
-
get_iter_sampler
(self, probs: Optional[List[float]] = None, rng: Optional[random.Random] = None, seed: Optional[int] = None, **kwargs) → squirrel.iterstream.Composable¶ Returns an iterstream that samples from the subsets of this source.
- Parameters
rng (random.Random) – A random number generator.
probs (List[float]) – List of probabilities to sample from the subsets. If None, sample uniform.
**kwargs – Keyword arguments passed to the
get_iter()
method of each subset driver.
- Returns
(Composable) Iterable over samples randomly sampled from subsets.
-
get_source
(self, subset: str) → squirrel.catalog.source.Source¶ Returns subset source based on subset id.
-
get_store
(self, subset: str) → squirrel.store.AbstractStore¶ Returns the store of the appropriate subset driver.
- Parameters
subset (str) – Id of the subset in this source definition.
- Returns
(AbstractStore) Store of the subset driver subset.
-
keys
(self, subset: str, **kwargs) → Iterable¶ Routes to the
keys()
method of the appropriate subset driver.- Parameters
subset (str) – Id of the subset in this source definition.
**kwargs – Keyword arguments passed to the subset driver.
- Returns
(Iterable) Iterable over the keys for subset driver subset.
-
class
squirrel.driver.
StoreDriver
(url: str, serializer: squirrel.serialization.SquirrelSerializer, **kwargs)¶ Bases:
squirrel.driver.driver.MapDriver
A :py:class`MapDriver` implementation, which uses an :py:class`AbstractStore` instance to retrieve its items.
The store used by the driver can be accessed via the :py:property:`store` property.
Initializes StoreDriver.
- Parameters
url (str) – the url of the store
serializer (SquirrelSerializer) – serializer to be passed to SquirrelStore
**kwargs – Keyword arguments to pass to the super class initializer.
-
name
= store_driver¶
-
get
(self, key: Any, **kwargs) → Iterable¶ Returns an iterable over the items corresponding to key using the store instance.
Calls and returns the result of
self.store.get()
. Subclasses might filter or manipulate the iterable over items returned from the store.- Parameters
key (Any) – Key with which the items will be retrieved. Must be of type and format that is supported by the
instance. (store) –
**kwargs – Keyword arguments passed to the
self.store.get()
method.
- Returns
(Iterable) Iterable over the items corresponding to key, as returned from the store.
-
get_iter
(self, flatten: bool = True, **kwargs) → squirrel.iterstream.Composable¶ Returns an iterable of items in the form of a
squirrel.iterstream.Composable
, which allows various stream manipulation functionalities.Items are fetched using the
get()
method. The returnedComposable
iterates over the items in the order of the keys returned by thekeys()
method.- Parameters
flatten (bool) – Whether to flatten the returned iterable. Defaults to True.
**kwargs – Other keyword arguments passed to super().get_iter(). For details, see
squirrel.driver.MapDriver.get_iter()
.
- Returns
(squirrel.iterstream.Composable) Iterable over the items in the store.
-
keys
(self, **kwargs) → Iterable¶ Returns an iterable over all keys to the items that are obtainable through the driver.
Calls and returns the result of
self.store.keys()
. Subclasses might filter or manipulate the iterable over keys returned from the store.- Parameters
**kwargs – Keyword arguments passed to the
self.store.keys()
method.- Returns
(Iterable) Iterable over all keys in the store, as returned from the store.
-
property
store
(self) → squirrel.store.store.AbstractStore¶ Store that is used by the driver.
-
class
squirrel.driver.
ZarrDriver
(url: str, **kwargs)¶ Bases:
squirrel.driver.driver.MapDriver
A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.
Initializes ZarrDriver.
-
name
= zarr¶
-
get
(self, key: str, fetcher_func: Callable[[zarr.hierarchy.Group, str], Any], **storage_options) → Iterator¶ Given key, returns a sample defined by self.fetcher_func.
-
get_iter
(self, fetcher_func: Callable[[zarr.hierarchy.Group, str], Any] = fetch, storage_options: Optional[Dict] = None, flatten: bool = True, **kwargs) → squirrel.iterstream.Composable¶ Returns an iterable of samples as specified by fetcher_func.
- Parameters
fetcher_func – A function with two arguments, a zarr.hierarchy.Group object and a key of type string. This function is used to fetch required fields and attributes of a sample. Defaults to
squirrel.driver.zarr.fetch()
.storage_options – Keyword arguments passed to
squirrel.zarr.convenience.get_group()
, which willcalled to retrieve the store that will be provided to fetcher_func. (be) –
flatten – Whether to flatten the returned iterable. Defaults to True.
**kwargs – Keyword arguments that will be passed to
MapDriver.get_iter()
.
- Returns
(Composable) Iterable over the items in the store.
-
get_root_group
(self, mode: str = 'r', **storage_options) → squirrel.zarr.group.SquirrelGroup¶ Returns the root zarr group, i.e. zarr group at self.url.
- Parameters
mode (str) – IO mode (e.g. “r”, “w”, “a”). Defaults to “r”. mode affects the store of the returned group. See
squirrel.zarr.convenience.get_group()
for more information.**storage_options – Keyword arguments passed to
squirrel.zarr.convenience.get_group()
.
-