Store¶

Store manages the storage and retrieval of data and serve as an abstraction layer under StoreDriver to ease the implementation of custom drivers.

Squirrel store API defines three methods:

Store.set(): Used to store a value with a key.
Store.get(): Used to retrieve a previously stored value.
Store.keys(): Returns all the keys for which the store has a value.

Note

A Store permits persisting of value via the set() method whereas a Driver can only read from a data source and cannot write to it.

SquirrelStore¶

SquirrelStore is the recommended store to use with squirrel. It comes with several optimizations to improve read/write speed and reduce storage size.

With SquirrelStore, it is possible to:

Save shards (i.e. a collection of samples) in the store and retrieve them fast (see Performance Benchmark)
Serialize shards using a SquirrelSerializer instance

A Store can be initialized as below:

import tempfile

from squirrel.serialization import MessagepackSerializer
from squirrel.store import SquirrelStore

tmpdir = tempfile.TemporaryDirectory()
msg_store = SquirrelStore(url=tmpdir.name, serializer=MessagepackSerializer())

You can get an instance of a store from driver too. This is the recommended approach, unless low-level control is needed.

from squirrel.driver import MessagepackDriver

driver = MessagepackDriver(tmpdir.name)
store = driver.store

Writing samples as shards using SquirrelStore¶

Approach 1: Write/read shards sequentially¶

import numpy as np


def get_sample(i):
    return {
        "image": np.random.random((3, 3, 3)),
        "label": np.random.choice([1, 2]),
        "metadata": {"key": i},
    }


N_SAMPLES, N_SHARDS = 100, 10
samples = [get_sample(i) for i in range(N_SAMPLES)]
shards = [samples[i : i + 10] for i in range(N_SHARDS)]

Shards can be saved by using the set() method.

for i, shard in enumerate(shards):
    store.set(
        shard,
        key=f"shard_{i}",  # dont need to set key, if omitted, a random key will be used
    )

assert len(list(store.keys())) == N_SHARDS

Let’s check out a sample:

for key in store.keys():
    shard = store.get(key)
    for sample in shard:
        print(sample)
        break
    break

# Clean up
tmpdir.cleanup()

Approach 2: Write/read shards asynchronously using iterstream¶

SquirrelStore does not buffer any data, as soon as set() is called, the data is written to the store. Because of this, writing to the store can be easily parallelized. In the following example, we use async_map from Iterstream module to write shards to the store in parallel and also read from the store in parallel.

from squirrel.iterstream import IterableSource

tmpdir = tempfile.TemporaryDirectory()
store = MessagepackDriver(tmpdir.name).store

# note that we are not providing keys for the shards here, random keys will be used
IterableSource(shards).async_map(store.set).join()
assert len(list(store.keys())) == 10

samples = IterableSource(store.keys()).async_map(store.get).flatten().collect()
assert len(samples) == 100

# Clean up
tmpdir.cleanup()