Lazy loading

The ZVF read functions (read_points, read_polylines, etc.) are eager: they fetch and return all requested data immediately. For large stores or remote datasets, an eager read of the full store is impractical.

The lazy API provides a open_zv object that opens the store metadata without reading any array data. Array slices are fetched on demand — only when accessed. This is the recommended access pattern for:

  • Stores too large to fit in memory.

  • Remote stores (S3, GCS) where each array fetch is a network request.

  • Interactive viewers that need the coarsest level first and finer levels on demand.

  • Analysis pipelines that filter by metadata before deciding which data to load.


Opening a store lazily

from zarr_vectors.lazy import open_zv

# Opens metadata only — no vertex data fetched
store = open_zv("synchrotron.zarrvectors")

print(store.geometry_type)           # "point_cloud"
print(store.spatial_dims)            # 3
print(store.chunk_shape)             # (200.0, 200.0, 200.0)
print(store.levels)                  # [0, 1, 2, 3]
print(store.vertex_count(level=0))   # 500000 (from metadata, no data read)
print(store.vertex_count(level=2))   # 8022
print(store.bounding_box)            # (array([0,0,0]), array([2000,2000,2000]))

Opening a remote store is identical — pass an fsspec URL:

import s3fs
from zarr_vectors.lazy import open_zv

# 0.4+: backend layer auto-routes cloud URLs via obstore (or fsspec).
# Public access works without explicit anon=True.
"s3://open-neuro/synchrotron.zarrvectors"
)
print(store.vertex_count(level=0))   # metadata only — one S3 LIST request

Level-of-detail reads

Automatic level selection

auto_level selects the coarsest level whose bin size is smaller than a given target resolution:

# Load the coarsest level adequate for a 200 µm resolution viewport
result = store.read(target_resolution=200.0)
print(result["level"])           # 2 (bin_shape = [200, 200, 200] at level 2)
print(result["vertex_count"])    # 8022

# Load for a detailed 50 µm view
result = store.read(target_resolution=50.0)
print(result["level"])           # 0 (finest level; bin_shape = [50, 50, 50])

target_resolution is compared against bin_shape at each level. The selected level is the highest level N such that max(bin_shape[N]) target_resolution.

Bbox + level-of-detail

Combine spatial restriction with level selection for viewport-driven reads:

# Viewport: 500³ µm region, medium detail
result = store.read(
    bbox=(np.array([800., 800., 800.]), np.array([1300., 1300., 1300.])),
    target_resolution=100.0,
)
print(result["level"])           # level where bin_shape ≤ 100 µm
print(result["vertex_count"])

Explicit level override

result = store.read(level=1, bbox=(lo, hi))

Streaming large stores chunk by chunk

For datasets that do not fit in memory, iterate over chunks instead of loading the full store at once:

Iterate over all chunks at a level

for chunk_coord, chunk_data in store.iter_chunks(level=0):
    positions  = chunk_data["positions"]    # (N_chunk, 3) float32
    attributes = chunk_data["attributes"]
    # Process this chunk — e.g. compute statistics, write to database
    yield process_chunk(positions, attributes)

Iterate over chunks in a bounding box

lo = np.array([0., 0., 0.])
hi = np.array([500., 500., 500.])

for chunk_coord, chunk_data in store.iter_chunks(level=0, bbox=(lo, hi)):
    print(chunk_coord, chunk_data["positions"].shape)

This yields only the chunks that overlap the bbox, in row-major order. Each chunk is fetched and decompressed exactly once.

Streaming statistics over a large point cloud

total_count = 0
intensity_sum = 0.0

for _, chunk_data in store.iter_chunks(level=0):
    n = len(chunk_data["positions"])
    total_count   += n
    intensity_sum += chunk_data["attributes"]["intensity"].sum()

mean_intensity = intensity_sum / total_count
print(f"Mean intensity over {total_count} points: {mean_intensity:.4f}")

Lazy array access

The open_zv exposes each array as a lazy zarr.Array that can be sliced directly:

# Access the raw vertices array for level 0 without reading it
verts_array = store.raw_array("vertices", level=0)
print(verts_array.shape)    # (Cx, Cy, Cz, N_max, 3)
print(verts_array.dtype)    # float32

# Read one specific chunk (fetches only that chunk from disk/S3)
chunk_verts = verts_array[2, 3, 1]   # chunk at grid coord (2,3,1)
print(chunk_verts.shape)    # (N_max, 3) — may include fill-value rows

# Access the fragment index for one chunk
from zarr_vectors.core.arrays import read_fragment_index

fidx = read_fragment_index(store.level_group(0), "vertex_fragments", (2, 3, 1))
print(fidx.num_fragments)              # F: number of fragments in chunk (2,3,1)
print(fidx.num_range_fragments)        # R: number of range fragments
start, count = fidx.range(0)           # first fragment's row range (if it's a range)
rows = fidx.indices(0)                 # row indices into vertices/<2.3.1>

See Fragment-index arrays for the byte layout and the full FragmentIndex API.

Accessing object attributes without loading vertices

# Read all per-streamline FA values without fetching vertex data
fa_array = store.raw_array("object_attributes/mean_fa", level=0)
fa_values = fa_array[:]   # shape (n_objects,) — one request

# Filter objects by FA
high_fa_ids = np.where(fa_values > 0.5)[0]
print(f"{len(high_fa_ids)} high-FA streamlines")

Remote stores (S3 / GCS)

S3 with credentials

import s3fs
from zarr_vectors.lazy import open_zv

store = open_zv("s3://my-bucket/dataset/tracts.zarrvectors")

GCS

import gcsfs
from zarr_vectors.lazy import open_zv

store = open_zv("gs://my-bucket/tracts.zarrvectors")

Performance on object stores

Remote stores have per-request latency (~50–200 ms for S3). The lazy API minimises requests by:

  1. Reading consolidated metadata in a single request (if available).

  2. Batching chunk reads for spatial queries (multiple chunks fetched in parallel if n_workers > 1).

  3. Caching decompressed chunks in an LRU cache (configurable size).

store = open_zv(
    "s3://my-bucket/tracts.zarrvectors",
    cache_size=256,     # cache up to 256 decompressed chunks in memory
    n_workers=8,        # fetch up to 8 chunks in parallel
)

Prefetching for sequential access

When iterating over chunks sequentially, enable prefetch to overlap decompression of future chunks with processing of the current one:

for chunk_coord, chunk_data in store.iter_chunks(level=1, prefetch=4):
    # Process current chunk while next 4 are fetching in the background
    yield analyse(chunk_data)

open_zv API summary

from zarr_vectors.lazy import open_zv

store = open_zv(path_or_store)

# Metadata (no data I/O)
store.geometry_type           # str
store.spatial_dims            # int
store.chunk_shape             # tuple
store.bin_shape               # tuple (at level 0)
store.levels                  # list[int]
store.n_objects               # int (for discrete-object types)
store.bounding_box            # (lo, hi) arrays

# Per-level metadata
store.vertex_count(level)     # int
store.bin_shape_at(level)     # tuple
store.bin_ratio_at(level)     # tuple
store.object_count_at(level)  # int (discrete-object types)

# Data reads
store.read(level, bbox, target_resolution, attributes)
store.iter_chunks(level, bbox, prefetch, n_workers)
store.raw_array(array_path, level)

# Utilities
store.close()
store.__enter__() / store.__exit__()   # context manager

Using as a context manager

with open_zv("scan.zarrvectors") as store:
    result = store.read(level=2)
# Store is closed and cache is freed on exit

Common patterns

Thumbnail generation

Load the coarsest level for a quick full-volume thumbnail:

store    = open_zv("scan.zarrvectors")
coarsest = store.levels[-1]
result   = store.read(level=coarsest)
# Use result["positions"] to render a low-density overview

Memory-bounded streaming

Process a store in chunks, keeping peak memory under a target:

MEMORY_LIMIT_BYTES = 1 * 1024**3   # 1 GB

chunk_bytes   = store.vertex_count(level=0) / len(list(store.iter_chunks(level=0))) * 12
chunks_at_once = int(MEMORY_LIMIT_BYTES / chunk_bytes)

batch = []
for i, (coord, data) in enumerate(store.iter_chunks(level=0)):
    batch.append(data["positions"])
    if len(batch) >= chunks_at_once:
        process_batch(np.concatenate(batch))
        batch.clear()

if batch:
    process_batch(np.concatenate(batch))