Memory-efficient writes¶
zarr-vectors-py write functions buffer each chunk in memory before
writing it to disk. For large datasets, all vertices in a chunk must be
in memory simultaneously. This guide describes strategies for writing
datasets larger than available RAM.
Why writes are memory-bound¶
The fragment index requires all vertices in a chunk to be sorted by bin coordinate before writing. This means a full chunk worth of vertices (positions + attributes) must be in memory at once.
For a chunk with 100 000 vertices and 5 float32 attributes:
100 000 × (3 + 5) × 4 bytes ≈ 3.2 MB per chunk. For 125 chunks this
is 400 MB — manageable. For 10 000 chunks with 1M vertices each, this
approach requires partitioning.
Strategy 1: Write in spatial partitions¶
The most memory-efficient approach for very large datasets. Divide the
data into spatial partitions, each covering one or more full chunks, and
write one partition at a time using write_points_partition:
import numpy as np
from zarr_vectors.types.points import write_points_partition
# Large dataset: 500M vertices in a 10000³ µm volume
# Partition by z-slabs, each covering 500 µm (one chunk deep in z)
chunk_shape = (500., 500., 500.)
slab_depth = 500.0 # one chunk deep in z per partition
# Process one z-slab at a time
for z_idx in range(20): # 20 slabs × 500 µm = 10 000 µm
z_min = z_idx * slab_depth
z_max = z_min + slab_depth
# Load only this z-slab from disk/database
slab_positions = load_positions_in_range(z_min, z_max)
slab_intensity = load_intensity_in_range(z_min, z_max)
write_points_partition(
"large_scan.zarrvectors",
slab_positions,
chunk_shape=chunk_shape,
bin_shape=(125., 125., 125.),
attributes={"intensity": slab_intensity},
z_range=(z_min, z_max), # restricts which chunks are written
)
del slab_positions, slab_intensity # free memory
# Finalise metadata and build object index
from zarr_vectors.core.store import open_store
from zarr_vectors.core.multiscale import write_multiscale_metadata
root = open_store("large_scan.zarrvectors", mode="r+")
write_multiscale_metadata(root)
Chunk-aligned partitions¶
For maximum efficiency, align partitions to chunk boundaries so that each partition writes complete chunks without partial-chunk merging:
# For chunk_shape=(500, 500, 500), write one x-y tile at a time
for x_idx in range(n_x_chunks):
for y_idx in range(n_y_chunks):
x_lo, x_hi = x_idx * 500, (x_idx + 1) * 500
y_lo, y_hi = y_idx * 500, (y_idx + 1) * 500
tile = load_tile(x_lo, x_hi, y_lo, y_hi)
write_points_partition(
"scan.zarrvectors", tile.positions,
chunk_shape=(500., 500., 500.),
attributes=tile.attributes,
xy_tile=(x_lo, y_lo),
)
Strategy 2: Use the streaming writer¶
The StreamingPointWriter class accumulates vertices in a rolling
per-chunk buffer, flushing completed chunks to disk as they fill up.
Peak memory usage is bounded by the number of chunks in the internal
buffer multiplied by the expected vertices per chunk:
from zarr_vectors.types.points import StreamingPointWriter
# Peak memory ≈ buffer_chunks × avg_vertices/chunk × bytes_per_vertex
# At buffer_chunks=8, 50k vertices/chunk, 32 bytes/vertex: ~13 MB
writer = StreamingPointWriter(
"scan.zarrvectors",
chunk_shape=(500., 500., 500.),
bin_shape=(125., 125., 125.),
attribute_names=["intensity", "label"],
buffer_chunks=8, # keep at most 8 in-flight chunks in memory
)
# Feed data in arbitrary-sized batches
for batch in data_source: # any iterable yielding (positions, attributes)
writer.write_batch(batch["positions"], batch["attributes"])
# Flush all remaining buffered chunks and write metadata
writer.close()
The streaming writer is slower than the bulk writer (because it cannot exploit batch-level sorting optimisations) but has bounded memory usage regardless of dataset size.
Strategy 3: Generator-based ingest¶
For formats read by external libraries (LAS, TRX, etc.), the companion
package zarr-vectors-tools provides streaming converters (e.g.
ingest_las_streaming) that yield vertex batches without loading the
full file. The streaming converters call the same chunked-writer
primitives shown in Strategy 1 above.
Strategy 4: Rechunk after initial write¶
If you already have data in a format that can be read block-by-block (e.g. a Zarr array, HDF5, or numpy memmap), write it with large chunk_shape first (entire dataset in one chunk per axis if needed), then rechunk to the target shape using the streaming rechunker:
from zarr_vectors.core.rechunk import rechunk_store
# Initial write: no spatial chunking (entire volume is one big chunk)
write_points("scan_temp.zarrvectors", huge_array, chunk_shape=(10000., 10000., 10000.))
# Rechunk to target: streaming, bounded memory
rechunk_store(
"scan_temp.zarrvectors",
"scan.zarrvectors",
chunk_shape=(500., 500., 500.),
bin_shape=(125., 125., 125.),
streaming=True,
buffer_chunks=16,
)
This is the least memory-efficient approach but requires the least modification to existing pipelines.
Monitoring memory usage¶
Use tracemalloc or memory_profiler to identify peak usage:
import tracemalloc
tracemalloc.start()
write_points("scan.zarrvectors", positions, chunk_shape=(500., 500., 500.))
current, peak = tracemalloc.get_traced_memory()
print(f"Peak memory: {peak / 1e6:.1f} MB")
tracemalloc.stop()