Cloud stores¶
ZV stores on Amazon S3, Google Cloud Storage, Azure Blob Storage, and public HTTP are accessed through the backend layer described in the store types spec page. The read/write API is identical to local stores — only the URL changes.
Installation¶
The library ships with two cloud-capable backends; install whichever
matches your needs. obstore is preferred (Rust-based, faster parallel
reads), and the library auto-prefers it when both are installed.
pip install "zarr-vectors[obstore]" # preferred
# OR
pip install "zarr-vectors[cloud]" # fsspec + s3fs/gcsfs/adlfs (fallback)
You can also install both — the library will pick obstore and fall back
to fsspec for any URL scheme it can’t handle.
Backend resolution at a glance¶
When you pass a cloud URL to any read_* / write_* / open_store /
open_zv call, the backend is chosen in this order:
Explicit
backend=kwarg — e.g.backend="fsspec"forces fsspec even if obstore is installed.ZARR_VECTORS_BACKENDenvironment variable — e.g.export ZARR_VECTORS_BACKEND=obstore.URL-scheme auto-detect —
s3://,gs://,gcs://,az://,azure://,abfs://,http(s)://→obstoreif installed elsefsspec.
If neither cloud backend is installed for a cloud URL, the call raises a
StoreError with an install hint.
See zarr_vectors/core/backends/__init__.py
for the canonical scheme table and
tests/test_backends.py for the
test matrix.
Amazon S3¶
Anonymous (public) read access¶
Many open neuroscience datasets on S3 allow anonymous access. The
backend layer handles it transparently — there’s no anon=True kwarg
to pass:
from zarr_vectors.types.points import read_points
result = read_points(
"s3://open-neuro-data/datasets/synchrotron.zarrvectors",
level=2, # coarse level — fast
)
print(result["vertex_count"])
Authentication is opt-in: if the bucket allows anonymous reads, the default backend config will use it. If you need to force anonymous access in a credentialed environment, pass it through:
read_points(
"s3://open-neuro-data/scan.zarrvectors",
backend="obstore",
skip_signature=True, # obstore-specific kwarg
)
Authenticated access¶
Ambient credentials work without configuration — obstore and fsspec
both read ~/.aws/credentials, environment variables, and IAM roles:
result = read_points("s3://my-bucket/scan.zarrvectors")
To pass credentials explicitly, forward them through **backend_kwargs:
import os
from zarr_vectors.types.points import read_points
result = read_points(
"s3://my-bucket/scan.zarrvectors",
backend="obstore",
aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
region="us-east-1",
)
Keyword names match the active backend (obstore uses aws_*;
fsspec/s3fs uses key / secret). Prefer ambient credentials when
possible.
Writing to S3¶
import numpy as np
from zarr_vectors.types.points import write_points
rng = np.random.default_rng(0)
positions = rng.uniform(0, 1000, (100_000, 3)).astype(np.float32)
write_points(
"s3://my-bucket/datasets/scan.zarrvectors",
positions,
chunk_shape=(500., 500., 500.), # larger chunks = fewer S3 objects
bin_shape=(100., 100., 100.),
backend="obstore",
region="us-east-1",
)
Chunk size guidance for S3. Each ZV spatial chunk becomes one S3
object. S3 charges per PUT (write) and GET (read) request. To minimise
cost and request count, use chunk_shape values that produce chunks of
at least 100 KB compressed. For typical synchrotron point clouds at
~100 000 vertices per chunk (float32, Blosc-compressed), this is
roughly 200–500 µm per axis.
S3 bucket configuration for Neuroglancer serving¶
To serve a ZV store from S3 to zv-ngtools or the Neuroglancer web
app, configure CORS on the bucket:
[
{
"AllowedHeaders": ["*"],
"AllowedMethods": ["GET", "HEAD"],
"AllowedOrigins": ["*"],
"ExposeHeaders": ["ETag", "Content-Length"],
"MaxAgeSeconds": 3600
}
]
Apply with the AWS CLI:
aws s3api put-bucket-cors \
--bucket my-bucket \
--cors-configuration file://cors.json
Google Cloud Storage¶
from zarr_vectors.types.polylines import read_polylines, write_polylines
# Read — uses Application Default Credentials
result = read_polylines("gs://my-bucket/tracts.zarrvectors", level=1)
print(result["polyline_count"])
# Write
write_polylines(
"gs://my-bucket/tracts.zarrvectors",
streamlines,
chunk_shape=(100., 100., 100.),
bin_shape=(25., 25., 25.),
geometry_type="streamline",
)
To pass GCS credentials explicitly:
read_polylines(
"gs://my-bucket/tracts.zarrvectors",
backend="fsspec", # gcsfs route
token="/path/to/service-account.json",
)
GCS CORS:
gsutil cors set cors.json gs://my-bucket
Azure Blob Storage¶
from zarr_vectors.types.points import read_points
result = read_points("az://account/container/scan.zarrvectors")
# or: "abfs://container@account.dfs.core.windows.net/scan.zarrvectors"
Ambient credentials follow the standard DefaultAzureCredential chain
(env vars, managed identity, Azure CLI session).
Building a pyramid on a remote store¶
The pyramid builder takes a path or URL the same way the writers do:
from zarr_vectors.multiresolution.coarsen import build_pyramid
build_pyramid(
"s3://my-bucket/scan.zarrvectors",
factors=[(2.0, 1.0), (2.0, 1.0)], # coarsen 2× per level, no sparsity
cross_level_depth=1, # ±1 cross-level edges per pair
cross_level_storage="explicit", # write both +1 and -1
)
For very large datasets on cloud, the pyramid build is I/O bound. Run
on a cloud VM in the same region as the bucket — building a pyramid
from within AWS us-east-1 against a bucket in the same region is
~10× faster than from a laptop.
See docs/tutorials/multiscale/building_pyramids.md
for the full pyramid API, and examples/07_multiscale_links.ipynb
for the cross-level link layout that build_pyramid produces.
Consolidated metadata¶
On stores with many resolution levels and attribute arrays, opening the store requires one metadata request per Zarr group and array. On S3 with ~50 ms per request, this adds noticeable latency.
Consolidated metadata packs all .zattrs and zarr.json files into a
single .zmetadata key, reducing store-open latency to one request:
import zarr
from zarr_vectors.core.store import open_store
root = open_store("s3://my-bucket/scan.zarrvectors", mode="r+")
zarr.consolidate_metadata(root.zarr_group.store)
After consolidation, subsequent opens are dramatically faster. Regenerate consolidated metadata after any structural change (adding a resolution level, writing new attributes).
Reading from cloud with the lazy API¶
import numpy as np
from zarr_vectors.lazy import open_zv
store = open_zv("s3://open-neuro/scan.zarrvectors")
print(store.levels) # metadata only — no chunk I/O
print(store[2].vertex_count) # one metadata request
# Coarse overview — a handful of chunk requests
coarse = store[store.levels[-1]].vertices.compute()
# Detail in a small region — N chunk requests
from zarr_vectors.types.points import read_points
detail = read_points(
"s3://open-neuro/scan.zarrvectors",
bbox=(np.array([500., 500., 500.]),
np.array([700., 700., 700.])),
)
open_zv accepts the same backend= / **backend_kwargs as
open_store.
Forcing a specific backend¶
Pass backend="obstore" or backend="fsspec" to override
auto-detection:
# Force fsspec even though obstore is installed
read_points("s3://my-bucket/scan.zarrvectors", backend="fsspec")
# Use fsspec for a non-cloud URL (e.g. SFTP)
read_points("sftp://host/path/scan.zarrvectors", backend="fsspec")
Or set it globally for the process:
export ZARR_VECTORS_BACKEND=fsspec
Estimating cloud storage cost¶
A quick estimate for an S3-hosted point cloud store:
import zarr
from zarr_vectors.core.store import open_store
root = open_store("s3://my-bucket/scan.zarrvectors", mode="r")
zg = root.zarr_group
# Walk every array and sum stored bytes / chunk counts.
total_bytes = sum(a.nbytes_stored for _, a in zg.arrays(recurse=True))
total_chunks = sum(a.nchunks_initialized for _, a in zg.arrays(recurse=True))
print(f"Total compressed size: {total_bytes / 1e9:.2f} GB")
print(f"Total S3 objects: {total_chunks:,}")
print(f"Monthly S3 storage: ~${total_bytes / 1e9 * 0.023:.2f}"
f" (us-east-1 standard)")
print(f"Cost per 1M GETs: ~${total_chunks / 1e6 * 0.40:.4f}")
This counts every chunk across all resolution levels and every array
family — including the new links/<delta>/, cross_chunk_links/<delta>/,
and cross_chunk_link_attributes/<name>/<delta>/ arrays produced by
the multiscale-links layout.