squirrel.driver.data_frame
¶
Module Contents¶
Classes¶
Drives the access to a data source. |
Attributes¶
-
squirrel.driver.data_frame.
ENGINE
¶
-
class
squirrel.driver.data_frame.
DataFrameDriver
(url: squirrel.constants.URL, storage_options: dict[str, Any] | None = None, engine: ENGINE = 'pandas', df_hooks: Iterable[Callable] | None = None, read_kwargs: dict | None = None, **kwargs)¶ Bases:
squirrel.driver.file.FileDriver
Drives the access to a data source.
Abstract DataFrameDriver.
This defines a common interface for all driver using different read methods to read a dataframe such as from .csv, .xls, .parqet etc. These derived drivers have to only specify the read() method.
- Parameters
url (URL) – URL to file. Prefix with a protocol like
s3://
orgs://
to read from other filesystems. Data type may depend on the derived class.storage_options (Optional[Dict[str, Any]]) – a dict with keyword arguments passed to file system initializer. Example of storage_options if you want to enable fsspec caching: storage_options={“protocol”: “simplecache”, “target_protocol”: “gs”, “cache_storage”: “path/to/cache”}
engine (ENGINE) – Which engine to use for DataFrame loading. Currently, all drivers support “pandas” to use Pandas and some support “dask” to asynchronously load DataFrames using Dask.
df_hooks (Iterable[Callable], optional) – Preprocessing hooks to execute on the dataframe. The first hook must accept a dask.dataframe.DataFrame or pandas.Dataframe depending on the used engine.
read_kwargs – Arguments passed to all read methods of the derived driver.
**kwargs – Keyword arguments passed to the Driver class initializer.
-
get_df
(**read_kwargs) → DataFrame | pd.DataFrame¶ Returns the data as a DataFrame.
- Parameters
**read_kwargs – Keyword arguments to be passed to read(). Takes precedence over arguments specified at class initialization.
- Returns
(dask.dataframe.DataFrame | pandas.DataFrame) Dask or Pandas DataFrame constructed from the file.
-
get_iter
(itertuples_kwargs: dict | None = None, read_kwargs: dict | None = None) → squirrel.iterstream.Composable¶ Returns an iterator over DataFrame rows.
Note that first the file is read into a DataFrame and then
df.itertuples()
is called.- Parameters
itertuples_kwargs – Keyword arguments to be passed to
dask.dataframe.DataFrame.itertuples()
. orpandas.dataframe.DataFrame.itertuples()
read_kwargs – Keyword arguments to be passed to read(). Takes precedence over arguments specified at class initialization.
- Returns
(squirrel.iterstream.Composable) Iterable over the rows of the data frame as namedtuples.