squirrel.iterstream.iterators
¶
Module Contents¶
Functions¶
|
Yield batches of the given size. |
|
Convert items in the iterable into a dask.delayed object by applying callback |
|
Filter items in the iterable by the predicate callable. |
|
Iterate over iterables in the stream and yield their items. |
|
Returns a random number as range, calculated based on the input rng and seed. |
|
Return estimated size (in terms of bytes) of a python object. Currently use numpy method to calculate size. |
|
Apply the callback to each item in the iterable and yield the item. |
|
Iterate through an iterable and calculate the metrics based on a rolling window. Notice that you can configure |
|
Shuffle the data in the stream. |
|
Yield the first n elements from the iterable. |
|
Iterate while using tqdm. |
-
squirrel.iterstream.iterators.
batched_
(iterable: Iterable, batchsize: int = 20, collation_fn: Optional[Callable] = None, drop_last_if_not_full: bool = True) → Iterator[List]¶ Yield batches of the given size.
- Parameters
iterable (Iterable) – Iterable to be batched.
batchsize (int, optional) – Target batch size. Defaults to 20.
collation_fn (Callable, optional) – Collation function. Defaults to None.
drop_last_if_not_full (bool, optional) – If the length of the last batch is less than batchsize, drop it. Defaults to True.
- Yields
Batches (i.e. lists) of samples.
-
squirrel.iterstream.iterators.
dask_delayed_
(iterable: Iterable, callback: Callable) → Iterator¶ Convert items in the iterable into a dask.delayed object by applying callback
-
squirrel.iterstream.iterators.
filter_
(iterable: Iterable, predicate: Callable) → Iterator¶ Filter items in the iterable by the predicate callable.
-
squirrel.iterstream.iterators.
flatten_
(iterables: Iterable[Iterable]) → Iterator¶ Iterate over iterables in the stream and yield their items.
-
squirrel.iterstream.iterators.
get_random_range
(rng: Optional[random.Random] = None, seed: squirrel.constants.SeedType = None) → random.Random¶ Returns a random number as range, calculated based on the input rng and seed.
- Parameters
rng (random.Random, optional) – Either random module or a
random.Random
instance. If None, a random.Random() is used.seed (Union[int, float, str, bytes, bytearray, None]) – seed (Optional[int]): An int or other acceptable types that works for random.seed(). Will be used to seed rng. If None, a unique identifier will be used to seed.
-
squirrel.iterstream.iterators.
getsize
(item: Any) → int¶ Return estimated size (in terms of bytes) of a python object. Currently use numpy method to calculate size. Otherwise, the size is estimated through its pickled size. This is considered a better performant option than zarr.storage.getsize().
-
squirrel.iterstream.iterators.
map_
(iterable: Iterable, callback: Callable) → Iterator¶ Apply the callback to each item in the iterable and yield the item.
-
squirrel.iterstream.iterators.
monitor_
(iterable: Iterable, callback: Callable, prefix: Optional[str] = None, metrics_conf: squirrel.iterstream.metrics.MetricsConf = MetricsConf, *, window_size: int = 5, **kwargs) → Iterator¶ Iterate through an iterable and calculate the metrics based on a rolling window. Notice that you can configure metrics to output only IOPS or throughput or None. All metrics are by default turned on and calculated. If only one metric is turned on, the calculation of the other metric will be skipped, and a dummy value 0 is reported instead. When all metrics are turned off, this method has no actual effect.
- Parameters
iterable (Iterable) – Any Iterable-like object.
callback (Callable) – wandb.log, mlflow.log_metrics or other metrics logger.
prefix (str) – If not None, will add this as a prefix to the metrics name. Can be used to monitor the same metric in different point in an iterstream in one run. Spaces are allowed.
metrics_conf (MetricsConf) – A config dataclass to control metrics calculated. Details see squirrel.metrics.MetricsConf
window_size (int) – How many items to average over the metrics calculation. Since each item passes by in a very small time window, for better accuracy, a rolling window cal is more accurate. Its value must be bigger than 0.
**kwargs – arguments to pass to your callback function.
-
squirrel.iterstream.iterators.
shuffle_
(iterable: Iterable, bufsize: int = 1000, initial: int = 100, rng: Optional[random.Random] = None, seed: squirrel.constants.SeedType = None) → Iterator¶ Shuffle the data in the stream.
Uses a buffer of size bufsize. Shuffling at startup is less random; this is traded off against yielding samples quickly.
- Parameters
iterable (Iterable) – Iterable to shuffle.
bufsize (int, optional) – Buffer size for shuffling. Defaults to 1000.
initial (int, optional) – Minimum number of elements in the buffer before yielding the first element. Must be less than or equal to bufsize, otherwise will be set to bufsize. Defaults to 100.
rng (random.Random, optional) – Either random module or a
random.Random
instance. If None, a random.Random() is used.seed (Union[int, float, str, bytes, bytearray, None]) – A data input that can be used for random.seed().
- Yields
Any – Shuffled items of iterable.
-
squirrel.iterstream.iterators.
take_
(iterable: Iterable, n: int) → Iterator¶ Yield the first n elements from the iterable.
- Parameters
iterable (Iterable) – Iterable to take from.
n (int) – Number of samples to take.
- Yields
Any – First n elements of iterable. Less elements can be yielded if the iterable does not have enough elements.
-
squirrel.iterstream.iterators.
tqdm_
(iterable: Iterable, **kwargs) → Iterator¶ Iterate while using tqdm.