squirrel.iterstream.iterators

Module Contents

Functions

batched_(→ Iterator[List])

Yield batches of the given size.

dask_delayed_(→ Iterator)

Convert items in the iterable into a dask.delayed object by applying callback

filter_(→ Iterator)

Filter items in the iterable by the predicate callable.

flatten_(→ Iterator)

Iterate over iterables in the stream and yield their items.

get_random_range(→ random.Random)

Returns a random number as range, calculated based on the input rng and seed.

getsize(→ int)

Return estimated size (in terms of bytes) of a python object. Currently use numpy method to calculate size.

map_(→ Iterator)

Apply the callback to each item in the iterable and yield the item.

monitor_(→ Iterator)

Iterate through an iterable and calculate the metrics based on a rolling window. Notice that you can configure

shuffle_(→ Iterator)

Shuffle the data in the stream.

take_(→ Iterator)

Yield the first n elements from the iterable.

tqdm_(→ Iterator)

Iterate while using tqdm.

squirrel.iterstream.iterators.batched_(iterable: Iterable, batchsize: int = 20, collation_fn: Optional[Callable] = None, drop_last_if_not_full: bool = True)Iterator[List]

Yield batches of the given size.

Parameters
  • iterable (Iterable) – Iterable to be batched.

  • batchsize (int, optional) – Target batch size. Defaults to 20.

  • collation_fn (Callable, optional) – Collation function. Defaults to None.

  • drop_last_if_not_full (bool, optional) – If the length of the last batch is less than batchsize, drop it. Defaults to True.

Yields

Batches (i.e. lists) of samples.

squirrel.iterstream.iterators.dask_delayed_(iterable: Iterable, callback: Callable)Iterator

Convert items in the iterable into a dask.delayed object by applying callback

squirrel.iterstream.iterators.filter_(iterable: Iterable, predicate: Callable)Iterator

Filter items in the iterable by the predicate callable.

squirrel.iterstream.iterators.flatten_(iterables: Iterable[Iterable])Iterator

Iterate over iterables in the stream and yield their items.

squirrel.iterstream.iterators.get_random_range(rng: Optional[random.Random] = None, seed: squirrel.constants.SeedType = None)random.Random

Returns a random number as range, calculated based on the input rng and seed.

Parameters
  • rng (random.Random, optional) – Either random module or a random.Random instance. If None, a random.Random() is used.

  • seed (Union[int, float, str, bytes, bytearray, None]) – seed (Optional[int]): An int or other acceptable types that works for random.seed(). Will be used to seed rng. If None, a unique identifier will be used to seed.

squirrel.iterstream.iterators.getsize(item: Any)int

Return estimated size (in terms of bytes) of a python object. Currently use numpy method to calculate size. Otherwise, the size is estimated through its pickled size. This is considered a better performant option than zarr.storage.getsize().

squirrel.iterstream.iterators.map_(iterable: Iterable, callback: Callable)Iterator

Apply the callback to each item in the iterable and yield the item.

squirrel.iterstream.iterators.monitor_(iterable: Iterable, callback: Callable, prefix: Optional[str] = None, metrics_conf: squirrel.iterstream.metrics.MetricsConf = MetricsConf, *, window_size: int = 5, **kwargs)Iterator

Iterate through an iterable and calculate the metrics based on a rolling window. Notice that you can configure metrics to output only IOPS or throughput or None. All metrics are by default turned on and calculated. If only one metric is turned on, the calculation of the other metric will be skipped, and a dummy value 0 is reported instead. When all metrics are turned off, this method has no actual effect.

Parameters
  • iterable (Iterable) – Any Iterable-like object.

  • callback (Callable) – wandb.log, mlflow.log_metrics or other metrics logger.

  • prefix (str) – If not None, will add this as a prefix to the metrics name. Can be used to monitor the same metric in different point in an iterstream in one run. Spaces are allowed.

  • metrics_conf (MetricsConf) – A config dataclass to control metrics calculated. Details see squirrel.metrics.MetricsConf

  • window_size (int) – How many items to average over the metrics calculation. Since each item passes by in a very small time window, for better accuracy, a rolling window cal is more accurate. Its value must be bigger than 0.

  • **kwargs – arguments to pass to your callback function.

squirrel.iterstream.iterators.shuffle_(iterable: Iterable, bufsize: int = 1000, initial: int = 100, rng: Optional[random.Random] = None, seed: squirrel.constants.SeedType = None)Iterator

Shuffle the data in the stream.

Uses a buffer of size bufsize. Shuffling at startup is less random; this is traded off against yielding samples quickly.

Parameters
  • iterable (Iterable) – Iterable to shuffle.

  • bufsize (int, optional) – Buffer size for shuffling. Defaults to 1000.

  • initial (int, optional) – Minimum number of elements in the buffer before yielding the first element. Must be less than or equal to bufsize, otherwise will be set to bufsize. Defaults to 100.

  • rng (random.Random, optional) – Either random module or a random.Random instance. If None, a random.Random() is used.

  • seed (Union[int, float, str, bytes, bytearray, None]) – A data input that can be used for random.seed().

Yields

Any – Shuffled items of iterable.

squirrel.iterstream.iterators.take_(iterable: Iterable, n: int)Iterator

Yield the first n elements from the iterable.

Parameters
  • iterable (Iterable) – Iterable to take from.

  • n (int) – Number of samples to take.

Yields

Any

First n elements of iterable. Less elements can be yielded if the iterable does not have enough

elements.

squirrel.iterstream.iterators.tqdm_(iterable: Iterable, **kwargs)Iterator

Iterate while using tqdm.