squirrel.driver

Package Contents

Classes

CsvDriver

Drives the access to a data source.

DataFrameDriver

Drives the access to a data source.

Driver

Drives the access to a data source.

ExcelDriver

Drives the access to a data source.

FeatherDriver

Drives the access to a data source.

FileDriver

Drives the access to a data source.

IterDriver

A Driver that allows iteration over the items in the data source.

JsonlDriver

A StoreDriver that by default uses SquirrelStore with jsonl serialization. Please see the parent class for

MapDriver

A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.

MessagepackDriver

A StoreDriver that by default uses SquirrelStore with messagepack serialization.

ParquetDriver

Drives the access to a data source.

SourceCombiner

A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.

StoreDriver

A :py:class`MapDriver` implementation, which uses an :py:class`AbstractStore` instance to retrieve its items.

ZarrDriver

A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.

class squirrel.driver.CsvDriver(url: squirrel.constants.URL, *args, **kwargs)

Bases: squirrel.driver.data_frame.DataFrameDriver

Drives the access to a data source.

Driver to read CSV files into a DataFrame.

Parameters
  • url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. For a full list of accepted types, refer to pandas.read_csv() or dask.dataframe.read_csv().

  • *args – See DataFrameDriver.

  • **kwargs – See DataFrameDriver.

name = csv
class squirrel.driver.DataFrameDriver(url: squirrel.constants.URL, storage_options: dict[str, Any] | None = None, engine: ENGINE = 'pandas', df_hooks: Iterable[Callable] | None = None, read_kwargs: dict | None = None, **kwargs)

Bases: squirrel.driver.file.FileDriver

Drives the access to a data source.

Abstract DataFrameDriver.

This defines a common interface for all driver using different read methods to read a dataframe such as from .csv, .xls, .parqet etc. These derived drivers have to only specify the read() method.

Parameters
  • url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. Data type may depend on the derived class.

  • storage_options (Optional[Dict[str, Any]]) – a dict with keyword arguments passed to file system initializer. Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}

  • engine (ENGINE) – Which engine to use for DataFrame loading. Currently, all drivers support “pandas” to use Pandas and some support “dask” to asynchronously load DataFrames using Dask.

  • df_hooks (Iterable[Callable], optional) – Preprocessing hooks to execute on the dataframe. The first hook must accept a dask.dataframe.DataFrame or pandas.Dataframe depending on the used engine.

  • read_kwargs – Arguments passed to all read methods of the derived driver.

  • **kwargs – Keyword arguments passed to the Driver class initializer.

get_df(**read_kwargs)DataFrame | pd.DataFrame

Returns the data as a DataFrame.

Parameters

**read_kwargs – Keyword arguments to be passed to read(). Takes precedence over arguments specified at class initialization.

Returns

(dask.dataframe.DataFrame | pandas.DataFrame) Dask or Pandas DataFrame constructed from the file.

get_iter(itertuples_kwargs: dict | None = None, read_kwargs: dict | None = None)squirrel.iterstream.Composable

Returns an iterator over DataFrame rows.

Note that first the file is read into a DataFrame and then df.itertuples() is called.

Parameters
  • itertuples_kwargs – Keyword arguments to be passed to dask.dataframe.DataFrame.itertuples(). or pandas.dataframe.DataFrame.itertuples()

  • read_kwargs – Keyword arguments to be passed to read(). Takes precedence over arguments specified at class initialization.

Returns

(squirrel.iterstream.Composable) Iterable over the rows of the data frame as namedtuples.

class squirrel.driver.Driver(catalog: Catalog | None = None, **kwargs)

Bases: abc.ABC

Drives the access to a data source.

Initializes driver with a catalog and arbitrary kwargs.

name :str
class squirrel.driver.ExcelDriver(url: squirrel.constants.URL, **kwargs)

Bases: squirrel.driver.data_frame.DataFrameDriver

Drives the access to a data source.

Driver to read Excel files into a DataFrame.

Parameters
  • url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. For a full list of accepted types, refer to pandas.read_excel().

  • **kwargs – See DataFrameDriver.

name = excel
class squirrel.driver.FeatherDriver(url: squirrel.constants.URL, *args, **kwargs)

Bases: squirrel.driver.data_frame.DataFrameDriver

Drives the access to a data source.

Driver to read Feather files into a DataFrame.

Parameters
  • url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. For a full list of accepted types, refer to pandas.read_feather().

  • *args – See DataFrameDriver.

  • **kwargs – See DataFrameDriver.

name = feather
class squirrel.driver.FileDriver(url: squirrel.constants.URL, storage_options: dict[str, Any] | None = None, **kwargs)

Bases: squirrel.driver.driver.Driver

Drives the access to a data source.

Initializes FileDriver.

Parameters
  • url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. For a full list of supported types, refer to fsspec.open().

  • storage_options (dict[str, Any] | None) – A dict with keyword arguments passed to file system initializer. Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}

  • **kwargs – Keyword arguments passed to the super class initializer.

name = file
open(mode: str = 'r', create_if_not_exists: bool = False, **kwargs)IO

Returns a handler for the file.

Uses squirrel.fsspec.fs.get_fs_from_url() to get a filesystem object corresponding to self.url. Simply returns the handler returned from the open() method of the filesystem.

Parameters
  • mode (str) – IO mode to use when opening the file. Will be forwarded to filesystem.open() method. Defaults to “r”.

  • create_if_not_exists (bool) – If True, the file will be created if it does not exist (along with the parent directories). This is achieved by providing auto_mkdir=create_if_not_exists as a storage option to the filesystem. No matter what you set in the FileDriver’s storage_options, create_if_not_exists will override the key auto_mkdir. Defaults to False.

  • **kwargs – Keyword arguments that are passed to the filesystem.open() method.

Returns

(IO) File handler for the file at self.path.

class squirrel.driver.IterDriver(catalog: Catalog | None = None, **kwargs)

Bases: Driver

A Driver that allows iteration over the items in the data source.

Items can be iterated over using the get_iter() method.

Initializes driver with a catalog and arbitrary kwargs.

abstract get_iter(**kwargs)squirrel.iterstream.Composable

Returns an iterable of items in the form of a Composable, which allows various stream manipulation functionalities.

The order of the items in the iterable may or may not be randomized, depending on the implementation and kwargs.

class squirrel.driver.JsonlDriver(url: str, deser_hook: Optional[Callable] = None, storage_options: dict[str, t.Any] | None = None, **kwargs)

Bases: squirrel.driver.store.StoreDriver

A StoreDriver that by default uses SquirrelStore with jsonl serialization. Please see the parent class for additional configuration

Initializes JsonlDriver with default serializer.

Parameters
  • url (str) – Path to the root directory. If this path does not exist, it will be created.

  • deser_hook (Callable) – Callable that is passed as object_hook to JsonDecoder during json deserialization. Defaults to None.

  • storage_options (Dict) – a dictionary containing storage_options to be passed to fsspec. Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}

  • **kwargs – Keyword arguments passed to the super class initializer.

name = jsonl
get_iter(get_kwargs: Optional[Dict] = None, **kwargs)squirrel.iterstream.Composable

Returns an iterable of samples as specified by fetcher_func.

Parameters
  • get_kwargs (Dict) – Keyword arguments that will be passed as get_kwargs to MapDriver.get_iter(). get_kwargs will always have compression=”gzip”. Defaults to None.

  • **kwargs – Other keyword arguments that will be passed to MapDriver.get_iter().

Returns

(Composable) Iterable over the items in the store.

class squirrel.driver.MapDriver(catalog: Catalog | None = None, **kwargs)

Bases: IterDriver

A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.

Initializes driver with a catalog and arbitrary kwargs.

abstract get(key: Any, **kwargs)Any

Returns an iterable over the items corresponding to key.

Note that it is possible to implement this method according to your needs. There is no restriction on the type or number of items. For example, a key might be corresponding to a single item that holds a single sample or to a single item that contains one shard of multiple samples.

If the method returns a single sample, then the get_iter() method should be called with flatten=False since the stream does not need to be flattened. Otherwise, e.g. if the method returns an iterable of samples, then the get_iter() method should be called with flatten=True if it desirable to have individual samples in the iterstream.

get_iter(keys_iterable: Iterable | None = None, shuffle_key_buffer: int = 1, key_hooks: Iterable[Callable | type[Composable] | functools.partial] | None = None, max_workers: int | None = None, prefetch_buffer: int = 10, shuffle_item_buffer: int = 1, flatten: bool = False, keys_kwargs: dict | None = None, get_kwargs: dict | None = None, key_shuffle_kwargs: dict | None = None, item_shuffle_kwargs: dict | None = None)squirrel.iterstream.Composable

Returns an iterable of items in the form of a squirrel.iterstream.Composable, which allows various stream manipulation functionalities.

Items are fetched using the get() method. The returned Composable iterates over the items in the order of the keys returned by the keys() method.

Parameters
  • keys_iterable (Iterable, optional) – If provided, only the keys in keys_iterable will be used to fetch items. If not provided, all keys in the store are used.

  • shuffle_key_buffer (int) – Size of the buffer used to shuffle keys.

  • key_hooks (Iterable[Iterable[Union[Callable, Type[Composable], functools.partial]]], optional) –

    Hooks to apply to keys before fetching the items. It is an Iterable any of these objects:

    1. subclass of Composable(): in this case, .compose(hook, **kw) will be applied to the stream

    2. A Callable: .to(hook, **kw) will be applied to the stream

    3. A partial function: the three attributes args, keywords and func will be retrieved, and depending on whether func is a subclass of Composable() or a Callable, one of the above cases will happen, with the only difference that arguments are passed too. This is useful for passing arguments.

  • max_workers (int, optional) – If max_workers is equal to 0 or 1, map() is called to fetch the items iteratively. If max_workers is bigger than 1 or equal to None, async_map() is called to fetch multiple items simultaneously. In this case, the max_workers argument refers to the maximum number of workers in the ThreadPoolExecutor Pool of async_map(). None has a special meaning in this context and uses an internal heuristic for the number of workers. The exact number of workers with max_workers=None depends on the specific Python version. See ThreadPoolExecutor for details. Defaults to None.

  • prefetch_buffer (int) – Size of the buffer used for prefetching items if async_map is used. See max_workers for more details. Please be aware of the memory footprint when setting this parameter.

  • shuffle_item_buffer (int) – Size of the buffer used to shuffle items after being fetched. Please be aware of the memory footprint when setting this parameter.

  • flatten (bool) – Whether to flatten the returned iterable. Defaults to False.

  • keys_kwargs (Dict, optional) – Keyword arguments passed to keys() when getting the keys in the store. Not used if keys_iterable is provided. Defaults to None.

  • get_kwargs (Dict, optional) – Keyword arguments passed to get() when fetching items. Defaults to None.

  • key_shuffle_kwargs (Dict, optional) – Keyword arguments passed to shuffle() when shuffling keys. Defaults to None. Can be useful to e.g. set the seed etc.

  • item_shuffle_kwargs (Dict, optional) – Keyword arguments passed to shuffle() when shuffling items. Defaults to None. Can be useful to e.g. set the seed etc.

Returns

(squirrel.iterstream.Composable) Iterable over the items in the store.

abstract keys(**kwargs)Iterable

Returns an iterable of the keys for the objects that are obtainable through the driver.

class squirrel.driver.MessagepackDriver(url: str, storage_options: dict[str, Any] | None = None, **kwargs)

Bases: squirrel.driver.store.StoreDriver

A StoreDriver that by default uses SquirrelStore with messagepack serialization.

Initializes MessagepackDriver with default serializer. See parent class for more options.

Parameters
  • url (str) – Path to the root directory. If this path does not exist, it will be created.

  • storage_options (Dict) – a dictionary containing storage_options to be passed to fsspec. Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}

  • **kwargs – Keyword arguments passed to the super class initializer.

name = messagepack
class squirrel.driver.ParquetDriver(url: squirrel.constants.URL, *args, **kwargs)

Bases: squirrel.driver.data_frame.DataFrameDriver

Drives the access to a data source.

Driver to read JSON files into a DataFrame.

Parameters
  • url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. For a full list of accepted types, refer to pandas.read_parquet() or dask.dataframe.read_parquet().

  • *args – See DataFrameDriver.

  • **kwargs – See DataFrameDriver.

name = parquet
class squirrel.driver.SourceCombiner(subsets: dict[str, squirrel.catalog.CatalogKey], catalog: squirrel.catalog.Catalog, **kwargs)

Bases: squirrel.driver.driver.MapDriver

A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.

Initializes SourceCombiner.

Parameters
  • subsets (Dict[str, CatalogKey]) – Keys define the names of the subsets, values are tuples of the corresponding (catalog entry, version) combinations.

  • catalog (Catalog) – The parent catalog which the subset sources are part of.

  • **kwargs – Keyword arguments to be passed to the super class.

name = source_combiner
get(subset: str, key: Any, **kwargs)Iterable

Routes to the get() method of the appropriate subset driver.

Parameters
  • subset (str) – Id of the subset in this source definition.

  • key (str) – Key of the item to get.

  • **kwargs – Keyword arguments passed to the subset driver.

Returns

(Iterable) Iterable over the items corresponding to key for subset driver subset.

get_df(subset: str, **kwargs)dask.dataframe.DataFrame

Routes to the get_df() method of the appropriate subset driver.

Parameters
  • subset (str) – Id of the subset in this source definition.

  • **kwargs – Keyword arguments passed to the subset driver.

Returns

(DataFrame) Data of the subset driver subset as a Dask or Pandas DataFrame.

get_iter(subset: str | None = None, **kwargs)squirrel.iterstream.Composable

Routes to the get_iter() method of the appropriate subset driver.

Parameters
  • subset (str) – Id of the subset in this source definition. If None, interleaves iterables obtained from all subset drivers.

  • **kwargs – Keyword arguments passed to the subset driver.

Returns

(Composable) Iterable over the items of subset driver(s) in the form of a Composable.

get_iter_sampler(probs: list[float] | None = None, rng: random.Random | None = None, seed: int | None = None, **kwargs)squirrel.iterstream.Composable

Returns an iterstream that samples from the subsets of this source.

Parameters
  • rng (random.Random) – A random number generator.

  • probs (List[float]) – List of probabilities to sample from the subsets. If None, sample uniform.

  • **kwargs – Keyword arguments passed to the get_iter() method of each subset driver.

Returns

(Composable) Iterable over samples randomly sampled from subsets.

get_source(subset: str)squirrel.catalog.source.Source

Returns subset source based on subset id.

Parameters

subset (str) – Id of subset in this source definition.

Returns

Subset source.

Return type

(Source)

get_store(subset: str)squirrel.store.AbstractStore

Returns the store of the appropriate subset driver.

Parameters

subset (str) – Id of the subset in this source definition.

Returns

(AbstractStore) Store of the subset driver subset.

keys(subset: str, **kwargs)Iterable

Routes to the keys() method of the appropriate subset driver.

Parameters
  • subset (str) – Id of the subset in this source definition.

  • **kwargs – Keyword arguments passed to the subset driver.

Returns

(Iterable) Iterable over the keys for subset driver subset.

property subsetslist[str]

Ids of all subsets defined by this source.

class squirrel.driver.StoreDriver(url: str, serializer: squirrel.serialization.SquirrelSerializer, storage_options: dict[str, Any] | None = None, **kwargs)

Bases: squirrel.driver.driver.MapDriver

A :py:class`MapDriver` implementation, which uses an :py:class`AbstractStore` instance to retrieve its items.

The store used by the driver can be accessed via the :py:property:`store` property.

Initializes StoreDriver.

Parameters
  • url (str) – the url of the store

  • serializer (SquirrelSerializer) – serializer to be passed to SquirrelStore

  • storage_options (Optional[Dict[str, Any]]) – a dict with keyword arguments to be passed to store initializer Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}

  • **kwargs – Keyword arguments to pass to the super class initializer.

name = store_driver
get(key: Any, **kwargs)Iterable

Returns an iterable over the items corresponding to key using the store instance.

Calls and returns the result of self.store.get(). Subclasses might filter or manipulate the iterable over items returned from the store.

Parameters
  • key (Any) – Key with which the items will be retrieved. Must be of type and format that is supported by the

  • instance. (store) –

  • **kwargs – Keyword arguments passed to the self.store.get() method.

Returns

(Iterable) Iterable over the items corresponding to key, as returned from the store.

get_iter(flatten: bool = True, **kwargs)squirrel.iterstream.Composable

Returns an iterable of items in the form of a squirrel.iterstream.Composable, which allows various stream manipulation functionalities.

Items are fetched using the get() method. The returned Composable iterates over the items in the order of the keys returned by the keys() method.

Parameters
  • flatten (bool) – Whether to flatten the returned iterable. Defaults to True.

  • **kwargs – Other keyword arguments passed to super().get_iter(). For details, see squirrel.driver.MapDriver.get_iter().

Returns

(squirrel.iterstream.Composable) Iterable over the items in the store.

keys(**kwargs)Iterable

Returns an iterable over all keys to the items that are obtainable through the driver.

Calls and returns the result of self.store.keys(). Subclasses might filter or manipulate the iterable over keys returned from the store.

Parameters

**kwargs – Keyword arguments passed to the self.store.keys() method.

Returns

(Iterable) Iterable over all keys in the store, as returned from the store.

property storesquirrel.store.store.AbstractStore

Store that is used by the driver.

class squirrel.driver.ZarrDriver(url: str, **kwargs)

Bases: squirrel.driver.driver.MapDriver

A Driver that allows retrieval of items using keys, in addition to allowing iteration over the items.

Initializes ZarrDriver.

name = zarr
get(key: str, fetcher_func: Callable[[zarr.hierarchy.Group, str], Any], **storage_options)Iterator

Given key, returns a sample defined by self.fetcher_func.

get_iter(fetcher_func: Callable[[zarr.hierarchy.Group, str], Any] = fetch, storage_options: Optional[Dict] = None, flatten: bool = True, **kwargs)squirrel.iterstream.Composable

Returns an iterable of samples as specified by fetcher_func.

Parameters
  • fetcher_func – A function with two arguments, a zarr.hierarchy.Group object and a key of type string. This function is used to fetch required fields and attributes of a sample. Defaults to squirrel.driver.zarr.fetch().

  • storage_options – Keyword arguments passed to squirrel.zarr.convenience.get_group(), which will

  • called to retrieve the store that will be provided to fetcher_func. (be) –

  • flatten – Whether to flatten the returned iterable. Defaults to True.

  • **kwargs – Keyword arguments that will be passed to MapDriver.get_iter().

Returns

(Composable) Iterable over the items in the store.

get_root_group(mode: str = 'r', **storage_options)squirrel.zarr.group.SquirrelGroup

Returns the root zarr group, i.e. zarr group at self.url.

Parameters
  • mode (str) – IO mode (e.g. “r”, “w”, “a”). Defaults to “r”. mode affects the store of the returned group. See squirrel.zarr.convenience.get_group() for more information.

  • **storage_options – Keyword arguments passed to squirrel.zarr.convenience.get_group().

keys()Iterator[str]

Returns the keys of the root zarr group.