Rechunking

Terms

Rechunking

The process of rewriting an existing ZVF store with a different chunk_shape (and optionally a different bin_shape). Rechunking produces a new store; the original is not modified in-place unless explicitly requested.

In-place rechunking

Rechunking that overwrites the source store. Supported by zarr_vectors.rechunk.rechunk(path, spec) when output is left as None: the engine writes to a sibling <name>.rechunked directory, renames the source to <name>.backup, moves the new store into place, and deletes the backup. The source path identifier is preserved across the swap.

Out-of-place rechunking

Rechunking that writes to a new destination path. The original store is preserved. This is the default behaviour.

Chunk boundary crossing

When rechunking from a fine chunk_shape to a coarser one, vertices that were in separate source chunks may end up in the same destination chunk. When rechunking from a coarse chunk_shape to a finer one, vertices that were in the same source chunk may end up in different destination chunks.

Cross-chunk link invalidation

When rechunking a polyline or streamline store, all cross_chunk_links must be recomputed, because the chunk boundaries change.


Introduction

Because chunk_shape controls the physical file layout, it cannot be modified without rewriting the underlying data. Rechunking is the operation that achieves this. It reads all vertices from the source store (chunk by chunk), re-assigns each vertex to its chunk in the destination layout, and writes the result.

Rechunking is relatively rare in practice: most users choose chunk_shape once before writing and do not change it. The most common scenario is discovering, after writing a large dataset, that the chosen chunk_shape is poorly matched to the actual query access pattern.

Rechunking is also necessary when importing data from a different format (e.g. TRK or LAS) that has its own chunking scheme, and when building a ZVF store that will be served from cloud storage where chunk size requirements differ from those for local access.


Technical reference

Rechunking API

from zarr_vectors.core.rechunk import rechunk_store

rechunk_store(
    source="scan.zarrvectors",
    dest="scan_rechunked.zarrvectors",
    chunk_shape=(500.0, 500.0, 500.0),   # new chunk shape
    bin_shape=(100.0, 100.0, 100.0),     # new bin shape (optional)
)

If bin_shape is omitted, the new bin shape defaults to chunk_shape / 4 per axis (isotropic), or the closest valid divisor.

What rechunking does

Rechunking proceeds in three phases:

Phase 1 — Vertex repartitioning. For each vertex in the source store, compute its destination chunk coordinate in the new chunk grid:

new_chunk_coord = tuple(
    int(math.floor(p[d] / new_chunk_shape[d])) for d in range(D)
)

Vertices are buffered per destination chunk. Once all vertices for a destination chunk have been collected, the chunk is sorted into fragment order (by new bin coordinate) and written.

Phase 2 — Attribute and link re-assignment. Per-vertex attributes are re-ordered to match the new vertex ordering. For polyline and streamline stores, links/<delta> arrays are also recomputed: within-chunk edges are reconstructed from the new vertex ordering; cross-chunk edges are identified and written to cross_chunk_links.

Phase 3 — Object index rebuild. For stores with an object_index, the index is rebuilt from scratch against the new chunk layout.

Rechunking and resolution levels

Rechunking operates on a single resolution level. To rechunk a multi- resolution store, rechunk each level separately:

from zarr_vectors.core.rechunk import rechunk_store

for level in [0, 1, 2]:
    rechunk_store(
        source=f"scan.zarrvectors/resolution_{level}",
        dest=f"scan_rechunked.zarrvectors/resolution_{level}",
        chunk_shape=(500.0, 500.0, 500.0),
        bin_shape=(100.0, 100.0, 100.0),
        level=level,
        source_root="scan.zarrvectors",       # for metadata
        dest_root="scan_rechunked.zarrvectors",
    )

Alternatively, rechunk only the base level and rebuild coarser levels:

from zarr_vectors.core.rechunk import rechunk_store
from zarr_vectors.multiresolution.coarsen import build_pyramid

rechunk_store("scan.zarrvectors", "scan_rechunked.zarrvectors",
              chunk_shape=(500.0, 500.0, 500.0), levels=[0])

build_pyramid("scan_rechunked.zarrvectors", factors=[(2.0, 1.00), (4.0, 2.00)])

Memory usage

Rechunking requires buffering all vertices that will land in a single destination chunk. For a store with uniform vertex density and isotropic rechunking from chunk_shape=(200,…) to chunk_shape=(500,…), each destination chunk receives approximately (500/200)³ 15.6× as many vertices as the average source chunk. Peak memory usage scales with new_chunk_volume / old_chunk_volume × avg_vertices_per_source_chunk.

For very large rechunking ratios (e.g. 10× per axis), use the streaming rechunker which writes destination chunks as they fill rather than buffering all source data:

rechunk_store(
    source="scan.zarrvectors",
    dest="scan_rechunked.zarrvectors",
    chunk_shape=(2000.0, 2000.0, 2000.0),
    streaming=True,          # lower memory; slower due to multiple source passes
)

CLI usage

The zarr-vectors rechunk CLI subcommand lives in the companion package zarr-vectors-tools.

What is preserved vs recomputed

Array

Preserved?

Notes

vertices/ values

Yes

Same positions, different chunk assignment

vertex_fragments/

Recomputed

Fragment partition changes with new bin grid

link_fragments/

Recomputed

Parallel to vertex_fragments/ for links/0/

links/<delta>/

Recomputed

Vertex indices are local to chunks

cross_chunk_links/

Recomputed

Chunk boundaries change

attributes/ values

Yes

Reordered to match new vertex ordering

object_index/

Recomputed

Chunk coordinates and fragment indices change

object_attributes/

Preserved

Per-object, not per-chunk

groupings/

Preserved

Per-group, not per-chunk

Root .zattrs

Updated

chunk_shape and base_bin_shape updated

Per-level .zattrs

Updated

bin_shape updated; bin_ratio unchanged

Rechunking only bin_shape

If only bin_shape changes (i.e. chunk_shape is the same), rechunking can be performed more cheaply because vertex positions do not change chunk assignment. Only the fragment index needs to be recomputed:

from zarr_vectors.core.rechunk import rebin_store

rebin_store(
    "scan.zarrvectors",
    new_bin_shape=(25.0, 25.0, 25.0),   # finer bins within the same chunks
)

rebin_store is an in-place operation (it rewrites vertex_fragments/ and re-sorts vertices within each chunk). Because the fragment indices within a chunk change, the per-object manifests in object_index/ must also be rewritten to reference the new fragment numbering. Chunk file paths and cross_chunk_links/ are unaffected.

Validation after rechunking

Run the full L5 validator after rechunking to confirm the new store is consistent:

from zarr_vectors.validate import validate
result = validate("scan_rechunked.zarrvectors", level=5)

Pay particular attention to L3 checks (fragment offset consistency) and L4 checks (cross-chunk link validity), as these are most likely to expose bugs in the rechunking implementation.