squirrel.driver.data_frame

Module Contents

Classes

DataFrameDriver

Drives the access to a data source.

Attributes

ENGINE

squirrel.driver.data_frame.ENGINE
class squirrel.driver.data_frame.DataFrameDriver(url: squirrel.constants.URL, storage_options: dict[str, Any] | None = None, engine: ENGINE = 'pandas', df_hooks: Iterable[Callable] | None = None, read_kwargs: dict | None = None, **kwargs)

Bases: squirrel.driver.file.FileDriver

Drives the access to a data source.

Abstract DataFrameDriver.

This defines a common interface for all driver using different read methods to read a dataframe such as from .csv, .xls, .parqet etc. These derived drivers have to only specify the read() method.

Parameters
  • url (URL) – URL to file. Prefix with a protocol like s3:// or gs:// to read from other filesystems. Data type may depend on the derived class.

  • storage_options (Optional[Dict[str, Any]]) – a dict with keyword arguments passed to file system initializer. Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}

  • engine (ENGINE) – Which engine to use for DataFrame loading. Currently, all drivers support “pandas” to use Pandas and some support “dask” to asynchronously load DataFrames using Dask.

  • df_hooks (Iterable[Callable], optional) – Preprocessing hooks to execute on the dataframe. The first hook must accept a dask.dataframe.DataFrame or pandas.Dataframe depending on the used engine.

  • read_kwargs – Arguments passed to all read methods of the derived driver.

  • **kwargs – Keyword arguments passed to the Driver class initializer.

get_df(**read_kwargs)DataFrame | pd.DataFrame

Returns the data as a DataFrame.

Parameters

**read_kwargs – Keyword arguments to be passed to read(). Takes precedence over arguments specified at class initialization.

Returns

(dask.dataframe.DataFrame | pandas.DataFrame) Dask or Pandas DataFrame constructed from the file.

get_iter(itertuples_kwargs: dict | None = None, read_kwargs: dict | None = None)squirrel.iterstream.Composable

Returns an iterator over DataFrame rows.

Note that first the file is read into a DataFrame and then df.itertuples() is called.

Parameters
  • itertuples_kwargs – Keyword arguments to be passed to dask.dataframe.DataFrame.itertuples(). or pandas.dataframe.DataFrame.itertuples()

  • read_kwargs – Keyword arguments to be passed to read(). Takes precedence over arguments specified at class initialization.

Returns

(squirrel.iterstream.Composable) Iterable over the rows of the data frame as namedtuples.