Validation and repair¶
The zarr-vectors validator checks ZVF stores for conformance at five
levels of increasing thoroughness. This tutorial covers running validation,
interpreting results, and repairing the most common failure modes.
For the complete check catalogue by level, see Validation overview, L1 structural, L2 metadata, and L3 consistency.
Running validation¶
The zarr-vectors CLI (with validate and info subcommands) lives in
the companion package zarr-vectors-tools. The Python API shown
below is part of this core package.
Python API¶
from zarr_vectors.validate import validate
result = validate("scan.zarrvectors", level=5)
# One-line status
print(result.summary())
# Level 5 validation: PASS — 54 passed, 0 warnings, 0 errors
# Full report (all checks listed)
print(result.report())
# Programmatic access
print(result.is_valid) # bool
print(len(result.errors)) # int
print(len(result.warnings)) # int
for err in result.errors:
print(f"[L{err.level}] {err.check}: {err.message}")
print(f" at: {err.path}")
Choosing a validation level¶
Situation |
Recommended level |
|---|---|
Quick structural check (CI, file open) |
1 |
After writing a new store |
3 |
After ingest from external format |
3 |
After rechunking |
3 |
Before publishing / sharing a dataset |
5 |
Nightly CI on reference fixtures |
5 |
Large store (> 100 GB), quick sanity check |
2 with |
Level 3 reads all array data and is the minimum recommended for any store that will be shared or used in analysis. Level 5 additionally checks multi-resolution pyramid correctness.
Interpreting common errors¶
L1 errors¶
cross_chunk_links missing
ERROR [L1] cross_chunk_links 0/cross_chunk_links/ missing;
required for streamline type
The store’s geometry type requires cross_chunk_links/ but the array
is absent. This typically means the store was written with an older version
of zarr-vectors-py that did not generate cross-chunk links, or was
written by a third-party tool that omitted the array.
Repair: regenerate cross-chunk links from the existing vertex data:
from zarr_vectors.repair import rebuild_cross_chunk_links
rebuild_cross_chunk_links("tracts.zarrvectors", level=0)
object_index missing
ERROR [L1] object_index 0/object_index/ missing
Repair: rebuild the object index from vertices and edges:
from zarr_vectors.repair import rebuild_object_index
rebuild_object_index("tracts.zarrvectors", level=0)
L2 errors¶
divisibility
ERROR [L2] divisibility [d=2] chunk_shape[2]=200.0, bin_shape[2]=60.0
200.0 % 60.0 = 20.0 ≠ 0
The bin_shape does not evenly divide chunk_shape. This cannot be
repaired in-place — the store must be rechunked with a valid bin_shape:
from zarr_vectors.core.rechunk import rebin_store
# Change bin_shape to something that divides chunk_shape
rebin_store("scan.zarrvectors", new_bin_shape=(50., 50., 50.))
bin_shape_inconsistent
ERROR [L2] bin_shape_inconsistent [level=1]
bin_shape [100,100,80] ≠ base [50,50,50] × ratio [2,2,2] = [100,100,100]
The bin_shape declared in the per-level .zattrs does not match
base_bin_shape × bin_ratio. Usually caused by a manual edit to .zattrs.
Repair: recompute and overwrite the per-level bin_shape:
from zarr_vectors.core.store import open_store
import numpy as np
root = open_store("scan.zarrvectors", mode="r+")
base = np.array(root.attrs["base_bin_shape"])
for level_group in root.values():
if hasattr(level_group, "attrs") and "bin_ratio" in level_group.attrs:
ratio = np.array(level_group.attrs["bin_ratio"])
level_group.attrs["bin_shape"] = (base * ratio).tolist()
levels_match_groups / level_0_present
ERROR [L2] levels_match_groups multiscales entry for 2
references non-existent group
The multiscales metadata references a level group that does not exist.
Regenerate multiscale metadata:
from zarr_vectors.core.multiscale import write_multiscale_metadata
from zarr_vectors.core.store import open_store
root = open_store("scan.zarrvectors", mode="r+")
write_multiscale_metadata(root)
L3 errors¶
frag_range_in_bounds
ERROR [L3] frag_range_in_bounds [chunk (2,3,1)]
range fragment 7: start+count = 4200 > vertex_count = 4092
A range fragment’s [start, start + count) extends past the chunk’s
vertex count. This indicates a bug in the writer — the most common cause
is reordering vertices without re-encoding the fragment index.
Repair: rebuild the fragment index by re-sorting vertices and recomputing fragments:
from zarr_vectors.repair import rebuild_vg_index
rebuild_vg_index("scan.zarrvectors", level=0)
# Reads vertices, re-sorts into bin order, rewrites vertex_fragments
ccl_different_chunks
ERROR [L3] ccl_different_chunks 2 cross-chunk links found where
src chunk == dst chunk (rows 14502, 87331)
Cross-chunk links where both endpoints are in the same chunk — these
should be intra-chunk edges in links/<delta>/. Caused by incorrect link
generation logic that triggers on bin boundaries instead of chunk
boundaries.
Repair: regenerate all cross-chunk links from scratch:
from zarr_vectors.repair import rebuild_cross_chunk_links
rebuild_cross_chunk_links("tracts.zarrvectors", level=0)
attr_length_matches
ERROR [L3] attr_length_matches [chunk (1,0,2), attr "intensity"]
attr_length=3800 ≠ vertex_count=4200
A per-vertex attribute array has the wrong length in a specific chunk. This means the attribute was not reordered when vertices were sorted into VG order — a writer bug.
Repair: re-ingest the data from the original source, or use the repair function if vertex order can be recovered:
from zarr_vectors.repair import realign_attribute
# Re-sort the attribute array to match the current vertex VG order
realign_attribute("scan.zarrvectors", attribute_name="intensity", level=0)
# WARNING: This assumes vertices are already in correct VG order.
# If vertex order is also wrong, rebuild_vg_index must run first.
obj_index_nonempty_vg
ERROR [L3] obj_index_nonempty_vg object 1042 primary VG at
(chunk=8843, bin=12) has count=0 (empty VG)
The object index points to an empty VG. This usually means the object’s vertices were moved by a rechunking operation that did not update the object index.
Repair: rebuild the object index:
from zarr_vectors.repair import rebuild_object_index
rebuild_object_index("tracts.zarrvectors", level=0)
Validation after repair¶
Always re-run the validator at the same or higher level after any repair:
result = validate("tracts.zarrvectors", level=3)
assert result.is_valid, result.report()
print("Store is valid after repair.")
Sampled validation for large stores¶
Full L3 validation on stores > 100 GB can take tens of minutes. For routine health checks, sample a fraction of chunks:
result = validate(
"large_scan.zarrvectors",
level=3,
sample_fraction=0.05, # validate 5% of chunks, chosen randomly
seed=42,
)
print(result.summary())
# Level 3 validation (sampled 5%): PASS — 38 passed, 0 warnings, 0 errors
# NOTE: sampled validation may miss errors in unsampled chunks
Sampled validation is never a substitute for full validation before publishing a dataset. Use it for fast incremental checks during development.
Automated validation in CI¶
Use the Python API in a small driver script if you want a CI step, or
invoke zarr-vectors validate from zarr-vectors-tools if it is
already installed in the CI environment.
In a pytest fixture:
import pytest
from zarr_vectors.validate import validate
from pathlib import Path
FIXTURES = list((Path("tests") / "fixtures").glob("*/store.zarrvectors"))
@pytest.mark.parametrize("store_path", FIXTURES, ids=lambda p: p.parent.name)
@pytest.mark.slow
def test_fixture_passes_l5(store_path):
result = validate(str(store_path), level=5)
assert result.is_valid, result.report()