`squirrel.driver`¶

Submodules¶

Package Contents¶

Classes¶

`CsvDriver`	Drives the access to a data source.
`DataFrameDriver`	Drives the access to a data source.
`Driver`	Drives the access to a data source.
`ExcelDriver`	Drives the access to a data source.
`FeatherDriver`	Drives the access to a data source.
`FileDriver`	Drives the access to a data source.
`IterDriver`	A Driver that allows iteration over the items in the data source.
`JsonlDriver`	A StoreDriver that by default uses SquirrelStore with jsonl serialization. Please see the parent class for
`MapDriver`	A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.
`MessagepackDriver`	A StoreDriver that by default uses SquirrelStore with messagepack serialization.
`ParquetDriver`	Drives the access to a data source.
`SourceCombiner`	A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.
`StoreDriver`	A :py:class`MapDriver` implementation, which uses an :py:class`AbstractStore` instance to retrieve its items.
`ZarrDriver`	A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.

class squirrel.driver.CsvDriver(url: squirrel.constants.URL, *args, **kwargs)¶

Bases: squirrel.driver.data_frame.DataFrameDriver

Drives the access to a data source.

Driver to read CSV files into a DataFrame.

Parameters

url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. For a full list of accepted types, refer to pandas.read_csv() or dask.dataframe.read_csv().
*args – See DataFrameDriver.
**kwargs – See DataFrameDriver.

name = csv¶

class squirrel.driver.DataFrameDriver(url: squirrel.constants.URL, storage_options: dict[str, Any] | None = None, engine: ENGINE = 'pandas', df_hooks: Iterable[Callable] | None = None, read_kwargs: dict | None = None, **kwargs)¶

Bases: squirrel.driver.file.FileDriver

Drives the access to a data source.

Abstract DataFrameDriver.

This defines a common interface for all driver using different read methods to read a dataframe such as from .csv, .xls, .parqet etc. These derived drivers have to only specify the read() method.

Parameters

url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. Data type may depend on the derived class.
storage_options (Optional[Dict[str, Any]]) – a dict with keyword arguments passed to file system initializer. Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}
engine (ENGINE) – Which engine to use for DataFrame loading. Currently, all drivers support “pandas” to use Pandas and some support “dask” to asynchronously load DataFrames using Dask.
df_hooks (Iterable[Callable], optional) – Preprocessing hooks to execute on the dataframe. The first hook must accept a dask.dataframe.DataFrame or pandas.Dataframe depending on the used engine.
read_kwargs – Arguments passed to all read methods of the derived driver.
**kwargs – Keyword arguments passed to the Driver class initializer.

get_df(**read_kwargs) → DataFrame | pd.DataFrame¶

Returns the data as a DataFrame.

Parameters: **read_kwargs – Keyword arguments to be passed to read(). Takes precedence over arguments specified at class initialization.
Returns: (dask.dataframe.DataFrame | pandas.DataFrame) Dask or Pandas DataFrame constructed from the file.

get_iter(itertuples_kwargs: dict | None = None, read_kwargs: dict | None = None) → squirrel.iterstream.Composable ¶

Returns an iterator over DataFrame rows.

Note that first the file is read into a DataFrame and then df.itertuples() is called.

Parameters

itertuples_kwargs – Keyword arguments to be passed to dask.dataframe.DataFrame.itertuples(). or pandas.dataframe.DataFrame.itertuples()
read_kwargs – Keyword arguments to be passed to read(). Takes precedence over arguments specified at class initialization.

Returns

(squirrel.iterstream.Composable) Iterable over the rows of the data frame as namedtuples.

class squirrel.driver.Driver(catalog: Catalog | None = None, **kwargs)¶

Bases: abc.ABC

Drives the access to a data source.

Initializes driver with a catalog and arbitrary kwargs.

name :str¶

class squirrel.driver.ExcelDriver(url: squirrel.constants.URL, **kwargs)¶

Bases: squirrel.driver.data_frame.DataFrameDriver

Drives the access to a data source.

Driver to read Excel files into a DataFrame.

Parameters

url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. For a full list of accepted types, refer to pandas.read_excel().
**kwargs – See DataFrameDriver.

name = excel¶

class squirrel.driver.FeatherDriver(url: squirrel.constants.URL, *args, **kwargs)¶

Bases: squirrel.driver.data_frame.DataFrameDriver

Drives the access to a data source.

Driver to read Feather files into a DataFrame.

Parameters

url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. For a full list of accepted types, refer to pandas.read_feather().
*args – See DataFrameDriver.
**kwargs – See DataFrameDriver.

name = feather¶

class squirrel.driver.FileDriver(url: squirrel.constants.URL, storage_options: dict[str, Any] | None = None, **kwargs)¶

Bases: squirrel.driver.driver.Driver

Drives the access to a data source.

Initializes FileDriver.

Parameters

url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. For a full list of supported types, refer to fsspec.open().
storage_options (dict[str, Any] | None) – A dict with keyword arguments passed to file system initializer. Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}
**kwargs – Keyword arguments passed to the super class initializer.

name = file¶

open(mode: str = 'r', create_if_not_exists: bool = False, **kwargs) → IO¶

Returns a handler for the file.

Uses squirrel.fsspec.fs.get_fs_from_url() to get a filesystem object corresponding to self.url. Simply returns the handler returned from the open() method of the filesystem.

Parameters

mode (str) – IO mode to use when opening the file. Will be forwarded to filesystem.open() method. Defaults to “r”.
create_if_not_exists (bool) – If True, the file will be created if it does not exist (along with the parent directories). This is achieved by providing auto_mkdir=create_if_not_exists as a storage option to the filesystem. No matter what you set in the FileDriver’s storage_options, create_if_not_exists will override the key auto_mkdir. Defaults to False.
**kwargs – Keyword arguments that are passed to the filesystem.open() method.

Returns

(IO) File handler for the file at self.path.

class squirrel.driver.IterDriver(catalog: Catalog | None = None, **kwargs)¶

Bases: Driver

A Driver that allows iteration over the items in the data source.

Items can be iterated over using the get_iter() method.

Initializes driver with a catalog and arbitrary kwargs.

abstract get_iter(**kwargs) → squirrel.iterstream.Composable ¶

Returns an iterable of items in the form of a Composable, which allows various stream manipulation functionalities.

The order of the items in the iterable may or may not be randomized, depending on the implementation and kwargs.

class squirrel.driver.JsonlDriver(url: str, deser_hook: Optional[Callable] = None, storage_options: dict[str, t.Any] | None = None, **kwargs)¶

Bases: squirrel.driver.store.StoreDriver

A StoreDriver that by default uses SquirrelStore with jsonl serialization. Please see the parent class for additional configuration

Initializes JsonlDriver with default serializer.

Parameters

url (str) – Path to the root directory. If this path does not exist, it will be created.
deser_hook (Callable) – Callable that is passed as object_hook to JsonDecoder during json deserialization. Defaults to None.
storage_options (Dict) – a dictionary containing storage_options to be passed to fsspec. Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}
**kwargs – Keyword arguments passed to the super class initializer.

name = jsonl¶

get_iter(get_kwargs: Optional[Dict] = None, **kwargs) → squirrel.iterstream.Composable ¶

Returns an iterable of samples as specified by fetcher_func.

Parameters

get_kwargs (Dict) – Keyword arguments that will be passed as get_kwargs to MapDriver.get_iter(). get_kwargs will always have compression=”gzip”. Defaults to None.
**kwargs – Other keyword arguments that will be passed to MapDriver.get_iter().

Returns

(Composable) Iterable over the items in the store.

class squirrel.driver.MapDriver(catalog: Catalog | None = None, **kwargs)¶

Bases: IterDriver

A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.

Initializes driver with a catalog and arbitrary kwargs.

abstract get(key: Any, **kwargs) → Any¶

Returns an iterable over the items corresponding to key.

Note that it is possible to implement this method according to your needs. There is no restriction on the type or number of items. For example, a key might be corresponding to a single item that holds a single sample or to a single item that contains one shard of multiple samples.

If the method returns a single sample, then the get_iter() method should be called with flatten=False since the stream does not need to be flattened. Otherwise, e.g. if the method returns an iterable of samples, then the get_iter() method should be called with flatten=True if it desirable to have individual samples in the iterstream.

get_iter(keys_iterable: Iterable | None = None, shuffle_key_buffer: int = 1, key_hooks: Iterable[Callable | type[Composable] | functools.partial] | None = None, max_workers: int | None = None, prefetch_buffer: int = 10, shuffle_item_buffer: int = 1, flatten: bool = False, keys_kwargs: dict | None = None, get_kwargs: dict | None = None, key_shuffle_kwargs: dict | None = None, item_shuffle_kwargs: dict | None = None) → squirrel.iterstream.Composable ¶

Returns an iterable of items in the form of a squirrel.iterstream.Composable, which allows various stream manipulation functionalities.

Items are fetched using the get() method. The returned Composable iterates over the items in the order of the keys returned by the keys() method.

Parameters

keys_iterable (Iterable, optional) – If provided, only the keys in keys_iterable will be used to fetch items. If not provided, all keys in the store are used.
shuffle_key_buffer (int) – Size of the buffer used to shuffle keys.
key_hooks (Iterable[Iterable[Union[Callable, Type[Composable], functools.partial]]], optional) –
Hooks to apply to keys before fetching the items. It is an Iterable any of these objects:
1. subclass of Composable(): in this case, .compose(hook, **kw) will be applied to the stream
2. A Callable: .to(hook, **kw) will be applied to the stream
3. A partial function: the three attributes args, keywords and func will be retrieved, and depending on whether func is a subclass of Composable() or a Callable, one of the above cases will happen, with the only difference that arguments are passed too. This is useful for passing arguments.
max_workers (int, optional) – If max_workers is equal to 0 or 1, map() is called to fetch the items iteratively. If max_workers is bigger than 1 or equal to None, async_map() is called to fetch multiple items simultaneously. In this case, the max_workers argument refers to the maximum number of workers in the ThreadPoolExecutor Pool of async_map(). None has a special meaning in this context and uses an internal heuristic for the number of workers. The exact number of workers with max_workers=None depends on the specific Python version. See ThreadPoolExecutor for details. Defaults to None.
prefetch_buffer (int) – Size of the buffer used for prefetching items if async_map is used. See max_workers for more details. Please be aware of the memory footprint when setting this parameter.
shuffle_item_buffer (int) – Size of the buffer used to shuffle items after being fetched. Please be aware of the memory footprint when setting this parameter.
flatten (bool) – Whether to flatten the returned iterable. Defaults to False.
keys_kwargs (Dict, optional) – Keyword arguments passed to keys() when getting the keys in the store. Not used if keys_iterable is provided. Defaults to None.
get_kwargs (Dict, optional) – Keyword arguments passed to get() when fetching items. Defaults to None.
key_shuffle_kwargs (Dict, optional) – Keyword arguments passed to shuffle() when shuffling keys. Defaults to None. Can be useful to e.g. set the seed etc.
item_shuffle_kwargs (Dict, optional) – Keyword arguments passed to shuffle() when shuffling items. Defaults to None. Can be useful to e.g. set the seed etc.

Returns

(squirrel.iterstream.Composable) Iterable over the items in the store.

abstract keys(**kwargs) → Iterable¶: Returns an iterable of the keys for the objects that are obtainable through the driver.

class squirrel.driver.MessagepackDriver(url: str, storage_options: dict[str, Any] | None = None, **kwargs)¶

Bases: squirrel.driver.store.StoreDriver

A StoreDriver that by default uses SquirrelStore with messagepack serialization.

Initializes MessagepackDriver with default serializer. See parent class for more options.

Parameters

url (str) – Path to the root directory. If this path does not exist, it will be created.
storage_options (Dict) – a dictionary containing storage_options to be passed to fsspec. Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}
**kwargs – Keyword arguments passed to the super class initializer.

name = messagepack¶

class squirrel.driver.ParquetDriver(url: squirrel.constants.URL, *args, **kwargs)¶

Bases: squirrel.driver.data_frame.DataFrameDriver

Drives the access to a data source.

Driver to read JSON files into a DataFrame.

Parameters

url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. For a full list of accepted types, refer to pandas.read_parquet() or dask.dataframe.read_parquet().
*args – See DataFrameDriver.
**kwargs – See DataFrameDriver.

name = parquet¶

class squirrel.driver.SourceCombiner(subsets: dict[str, squirrel.catalog.CatalogKey], catalog: squirrel.catalog.Catalog, **kwargs)¶

Bases: squirrel.driver.driver.MapDriver

A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.

Initializes SourceCombiner.

Parameters

subsets (Dict[str, CatalogKey]) – Keys define the names of the subsets, values are tuples of the corresponding (catalog entry, version) combinations.
catalog (Catalog) – The parent catalog which the subset sources are part of.
**kwargs – Keyword arguments to be passed to the super class.

name = source_combiner¶

get(subset: str, key: Any, **kwargs) → Iterable¶

Routes to the get() method of the appropriate subset driver.

Parameters

subset (str) – Id of the subset in this source definition.
key (str) – Key of the item to get.
**kwargs – Keyword arguments passed to the subset driver.

Returns

(Iterable) Iterable over the items corresponding to key for subset driver subset.

get_df(subset: str, **kwargs) → dask.dataframe.DataFrame ¶

Routes to the get_df() method of the appropriate subset driver.

Parameters

subset (str) – Id of the subset in this source definition.
**kwargs – Keyword arguments passed to the subset driver.

Returns

(DataFrame) Data of the subset driver subset as a Dask or Pandas DataFrame.

get_iter(subset: str | None = None, **kwargs) → squirrel.iterstream.Composable ¶

Routes to the get_iter() method of the appropriate subset driver.

Parameters

subset (str) – Id of the subset in this source definition. If None, interleaves iterables obtained from all subset drivers.
**kwargs – Keyword arguments passed to the subset driver.

Returns

(Composable) Iterable over the items of subset driver(s) in the form of a Composable.

get_iter_sampler(probs: list[float] | None = None, rng: random.Random | None = None, seed: int | None = None, **kwargs) → squirrel.iterstream.Composable ¶

Returns an iterstream that samples from the subsets of this source.

Parameters

rng (random.Random) – A random number generator.
probs (List[float]) – List of probabilities to sample from the subsets. If None, sample uniform.
**kwargs – Keyword arguments passed to the get_iter() method of each subset driver.

Returns

(Composable) Iterable over samples randomly sampled from subsets.

get_source(subset: str) → squirrel.catalog.source.Source ¶

Returns subset source based on subset id.

Parameters: subset (str) – Id of subset in this source definition.
Returns: Subset source.
Return type: (Source)

get_store(subset: str) → squirrel.store.AbstractStore ¶

Returns the store of the appropriate subset driver.

Parameters: subset (str) – Id of the subset in this source definition.
Returns: (AbstractStore) Store of the subset driver subset.

keys(subset: str, **kwargs) → Iterable¶

Routes to the keys() method of the appropriate subset driver.

Parameters

subset (str) – Id of the subset in this source definition.
**kwargs – Keyword arguments passed to the subset driver.

Returns

(Iterable) Iterable over the keys for subset driver subset.

property subsets → list[str]¶: Ids of all subsets defined by this source.

class squirrel.driver.StoreDriver(url: str, serializer: squirrel.serialization.SquirrelSerializer, storage_options: dict[str, Any] | None = None, **kwargs)¶

Bases: squirrel.driver.driver.MapDriver

A :py:class`MapDriver` implementation, which uses an :py:class`AbstractStore` instance to retrieve its items.

The store used by the driver can be accessed via the :py:property:`store` property.

Initializes StoreDriver.

Parameters

url (str) – the url of the store
serializer (SquirrelSerializer) – serializer to be passed to SquirrelStore
storage_options (Optional[Dict[str, Any]]) – a dict with keyword arguments to be passed to store initializer Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}
**kwargs – Keyword arguments to pass to the super class initializer.

name = store_driver¶

get(key: Any, **kwargs) → Iterable¶

Returns an iterable over the items corresponding to key using the store instance.

Calls and returns the result of self.store.get(). Subclasses might filter or manipulate the iterable over items returned from the store.

Parameters

key (Any) – Key with which the items will be retrieved. Must be of type and format that is supported by the
instance. (store) –
**kwargs – Keyword arguments passed to the self.store.get() method.

Returns

(Iterable) Iterable over the items corresponding to key, as returned from the store.

get_iter(flatten: bool = True, **kwargs) → squirrel.iterstream.Composable ¶

Returns an iterable of items in the form of a squirrel.iterstream.Composable, which allows various stream manipulation functionalities.

Items are fetched using the get() method. The returned Composable iterates over the items in the order of the keys returned by the keys() method.

Parameters

flatten (bool) – Whether to flatten the returned iterable. Defaults to True.
**kwargs – Other keyword arguments passed to super().get_iter(). For details, see squirrel.driver.MapDriver.get_iter().

Returns

(squirrel.iterstream.Composable) Iterable over the items in the store.

keys(**kwargs) → Iterable¶

Returns an iterable over all keys to the items that are obtainable through the driver.

Calls and returns the result of self.store.keys(). Subclasses might filter or manipulate the iterable over keys returned from the store.

Parameters: **kwargs – Keyword arguments passed to the self.store.keys() method.
Returns: (Iterable) Iterable over all keys in the store, as returned from the store.

property store → squirrel.store.store.AbstractStore ¶: Store that is used by the driver.

class squirrel.driver.ZarrDriver(url: str, **kwargs)¶

Bases: squirrel.driver.driver.MapDriver

A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.

Initializes ZarrDriver.

name = zarr¶

get(key: str, fetcher_func: Callable[[zarr.hierarchy.Group, str], Any], **storage_options) → Iterator¶: Given key, returns a sample defined by self.fetcher_func.

get_iter(fetcher_func: Callable[[zarr.hierarchy.Group, str], Any] = fetch, storage_options: Optional[Dict] = None, flatten: bool = True, **kwargs) → squirrel.iterstream.Composable ¶

Returns an iterable of samples as specified by fetcher_func.

Parameters

fetcher_func – A function with two arguments, a zarr.hierarchy.Group object and a key of type string. This function is used to fetch required fields and attributes of a sample. Defaults to squirrel.driver.zarr.fetch().
storage_options – Keyword arguments passed to squirrel.zarr.convenience.get_group(), which will
called to retrieve the store that will be provided to fetcher_func. (be) –
flatten – Whether to flatten the returned iterable. Defaults to True.
**kwargs – Keyword arguments that will be passed to MapDriver.get_iter().

Returns

(Composable) Iterable over the items in the store.

get_root_group(mode: str = 'r', **storage_options) → squirrel.zarr.group.SquirrelGroup ¶

Returns the root zarr group, i.e. zarr group at self.url.

Parameters

mode (str) – IO mode (e.g. “r”, “w”, “a”). Defaults to “r”. mode affects the store of the returned group. See squirrel.zarr.convenience.get_group() for more information.
**storage_options – Keyword arguments passed to squirrel.zarr.convenience.get_group().

keys() → Iterator[str]¶: Returns the keys of the root zarr group.

squirrel.driver¶

Submodules¶

Package Contents¶

Classes¶

`squirrel.driver`¶