Store types and the backend layer¶
Terms¶
- Storage backend
A pluggable adapter that exposes a byte-level key/value interface to a concrete storage system (local file system, S3, GCS, Azure, HTTP). Defined by the
StorageBackendprotocol inzarr_vectors/core/backends/base.py; every concrete backend implements both the sync and async variants.- Backend name
A short string that selects a backend:
"local","obstore", or"fsspec". Passed as thebackend=kwarg to public store entry points or set globally via theZARR_VECTORS_BACKENDenvironment variable.- URL scheme
The leading
scheme://of the store URL (e.g.s3://bucket/foo). Used for backend auto-detection when no explicitbackend=is given. Bare paths (no scheme) andfile://resolve to the local backend.- ZIP store
A Zarr store that packages all chunk files into a single ZIP archive. Read-only via
zarr.storage.ZipStore; useful for archiving or transferring a complete store as a single file. Not produced byzarr-vectorswriters directly.- In-memory store
Backed by a Python dict. Useful for unit tests and intermediate processing pipelines that do not need disk persistence.
- Consolidated metadata
An optional
.zmetadatafile at the store root that caches everyzarr.json/.zattrsfile in the hierarchy. Avoids many small metadata reads on high-latency stores (S3/GCS).
Introduction¶
ZV stores are backend-agnostic: the same create_store / open_store /
open_zv calls work whether the data lives on a local SSD, a ZIP
archive, an in-memory dict, or a cloud object store. The backing store
type affects performance characteristics (latency, throughput, cost per
request) but not the data model or the semantics of any operation.
The library ships with three built-in backends:
Backend name |
Class |
Implementation |
Optional dep |
|---|---|---|---|
|
|
none (always available) |
|
|
|
|
|
|
|
|
For most users the relevant choice is between local (development,
HPC analysis) and cloud (sharing, Neuroglancer serving). The two
cloud backends are interchangeable from the caller’s perspective — the
library picks obstore automatically when it’s installed and falls
back to fsspec otherwise.
Technical reference¶
Public entry points¶
All three entry points accept backend= and **backend_kwargs:
from zarr_vectors.core.store import create_store, open_store
from zarr_vectors.lazy import open_zv
create_store(path, *, bounds=None, chunk_shape=None, axes=None,
geometry_types=None, ..., backend=None, **backend_kwargs) -> Group
open_store(path, mode="r", *, backend=None, **backend_kwargs) -> Group
open_zv(path, *, backend=None, **backend_kwargs) -> ZVStore
backend is one of "local" / "obstore" / "fsspec" or None for
auto-detect. Extra **backend_kwargs are forwarded to the backend
constructor (credentials, region, etc.).
URL scheme dispatch¶
When backend=None, the resolution order in
backends.resolve_backend_name
is:
Explicit
backend=kwarg — always wins.ZARR_VECTORS_BACKENDenvironment variable — second.URL-scheme auto-detect:
Scheme(s) |
Resolved backend |
|---|---|
(no scheme), |
|
|
|
SCHEMES_LOCAL and SCHEMES_OBJECT_STORE in
backends/__init__.py
are the canonical tables.
Local file system store¶
The default. Pass any local path as a string or pathlib.Path:
from zarr_vectors.types.points import write_points, read_points
write_points("scan.zarrvectors", positions, chunk_shape=(200., 200., 200.))
result = read_points("scan.zarrvectors")
Internally the local backend uses zarr.storage.LocalStore. Each chunk
is one file; directories correspond to group and array paths. A typical
tree:
scan.zarrvectors/
├── zarr.json
├── .zattrs
├── 0/
│ ├── zarr.json
│ ├── .zattrs
│ ├── vertices/
│ │ ├── zarr.json
│ │ └── c/
│ │ ├── 0/0/0
│ │ └── …
│ └── …
└── …
Performance notes:
File creation overhead dominates write performance on file systems with high inode metadata costs (many NFS mounts, HDD-backed systems). Use larger
chunk_shapevalues to reduce file count.Linux
ext4andxfshandle large directories well;FAT32and some network shares have per-directory file count limits. For more than ~100 000 chunks per array, use the sharding codec.On HPC systems with Lustre or GPFS, stripe the
.zarrvectorsdirectory for write parallelism:lfs setstripe -c 8 scan.zarrvectors.
Cloud object stores¶
The preferred cloud backend is obstore (Rust bindings, parallel
range reads). It’s enabled automatically when installed:
pip install "zarr-vectors[obstore]" # preferred
# OR
pip install "zarr-vectors[cloud]" # fsspec + scheme drivers (fallback)
S3 (read):
from zarr_vectors.types.points import read_points
result = read_points(
"s3://open-neuro-data/datasets/synchrotron.zarrvectors",
level=2,
)
If obstore is installed, the URL scheme auto-routes to it; otherwise
fsspec is used. Force a specific backend with backend=:
read_points(
"s3://my-bucket/scan.zarrvectors",
backend="obstore",
region="us-west-2",
)
GCS / Azure (read):
read_points("gs://my-bucket/scan.zarrvectors") # obstore or fsspec
read_points("az://account/container/scan.zarrvectors")
Authenticated writes: pass credentials via **backend_kwargs (they
are forwarded to the backend constructor) or rely on the standard
ambient credentials (~/.aws/credentials, gcloud auth application-default, environment variables).
from zarr_vectors.types.points import write_points
write_points(
"s3://my-bucket/scan.zarrvectors",
positions,
chunk_shape=(500., 500., 500.), # larger chunks = fewer S3 objects
backend="obstore",
region="us-east-1",
)
Performance notes for object stores:
Each ZV chunk is one object on cloud. Minimise the per-chunk count by increasing
chunk_shaperelative to your typical query bbox (target ≥ 100 KB compressed per chunk).Enable consolidated metadata (below) to collapse store-open latency on stores with many resolution levels / attribute arrays.
For Neuroglancer serving, use
zv-ngtoolswhich implements its own request batching on top of the same backend layer.
Selecting the backend explicitly¶
Use the backend= kwarg when you need to:
Override the auto-detect (e.g. force
fsspeceven thoughobstoreis installed, for compatibility with a custom filesystem):open_store("s3://bucket/scan.zarrvectors", backend="fsspec")
Use a backend with a non-cloud URL (e.g. an
fsspecdriver pointed at SFTP):open_store("sftp://host/path/scan.zarrvectors", backend="fsspec")
Or set ZARR_VECTORS_BACKEND=fsspec once in the environment to apply
the override for the whole process.
ZIP store¶
ZIP stores are read-only at the Zarr layer and not produced by
zarr-vectors writers directly. To distribute a store as a single
file, write to a local store and then archive:
zip -r scan.zarrvectors.zip scan.zarrvectors/
Open the archive via zarr.storage.ZipStore directly (not through
open_store):
import zarr
store = zarr.storage.ZipStore("scan.zarrvectors.zip", mode="r")
root = zarr.open_group(store, mode="r")
In-memory store¶
Useful for tests and one-off pipelines. Pass a zarr.storage.MemoryStore
through to the Zarr layer directly:
import zarr
from zarr_vectors.types.points import write_points, read_points
store = zarr.storage.MemoryStore()
write_points(store, positions, chunk_shape=(200., 200., 200.))
result = read_points(store)
In-memory stores are discarded when the Python process exits.
Consolidated metadata¶
On stores with many groups (large pyramids, many attribute arrays), opening the store issues one metadata request per Zarr group and array. On S3 with ~50 ms per request, this adds noticeable latency.
Consolidate after writing:
import zarr
from zarr_vectors.core.store import open_store
root = open_store("scan.zarrvectors", mode="r+")
zarr.consolidate_metadata(root.zarr_group.store)
This writes a single .zmetadata file at the store root; subsequent
open_store() calls read from it instead of issuing individual
metadata requests. Regenerate after any structural change (e.g. adding
a resolution level).
Choosing a store type¶
Scenario |
Recommended setup |
|---|---|
Local analysis / development |
Local backend (default) |
HPC batch processing |
Local backend on Lustre/GPFS with striping |
Archiving a completed dataset |
Local write, then |
Cloud sharing / Neuroglancer serving |
|
Unit tests |
In-memory store via |
Very large stores with many small chunks |
Local or cloud + the sharding codec |
Capability tokens¶
The backend layer is independent of the
format capability tokens stamped on
RootMetadata.format_capabilities — backends carry data bytes, not
format semantics. See the capability list for CAP_MULTISCALE_LINKS,
CAP_PRESERVED_OBJECT_IDS, CAP_SHARED_VERTEX_GROUPS.