Codec pipeline¶
Terms¶
- Codec
A single transformation applied to chunk data. A codec takes bytes (or an array) as input and produces bytes as output. Codecs are composable; an ordered sequence of codecs forms a pipeline.
- Array-to-bytes codec
The first codec in a Zarr v3 pipeline, which converts a typed N-dimensional array into a flat byte sequence. Zarr v3 requires exactly one such codec. The standard array-to-bytes codec is
bytes, which serialises the array in C-order with a declared byte order.- Byte-to-byte codec
A codec that transforms one byte sequence into another — typically compression or encryption. Multiple byte-to-byte codecs may be stacked. Common codecs:
blosc,zstd,gzip,lz4.- Blosc
A high-performance meta-compressor that internally uses one of several compression algorithms (zstd, lz4, blosclz, zlib). Blosc also performs byte-shuffling before compression, which improves compression ratios on typed numeric arrays. It is the default byte-to-byte codec in
zarr-vectors-py.- Sharding codec (
sharding_indexed) A Zarr v3 codec that packs multiple logical chunks into a single storage key. Reduces file count and cloud storage request overhead. Discussed in detail in Sharding.
- Draco
Google’s geometry compression library, optimised for 3-D mesh data.
zarr-vectors-pyregisters a custom Zarr codec (draco) that applies Draco encoding to thelinks/<delta>andvertices/arrays of mesh-type stores. Requireszarr-vectors[draco].- Fill value
The value written to a chunk position that is not explicitly stored. In Zarr v3, the fill value is declared per array in
zarr.json. ZVF uses0.0for floating-point position arrays and-1for integer index arrays.
Introduction¶
A Zarr v3 codec pipeline controls how chunk data is serialised to bytes and
compressed before storage. ZVF does not mandate a specific codec; any Zarr
v3-compatible pipeline is valid. However, zarr-vectors-py ships with
sensible defaults and provides helper functions to configure specialised
pipelines for specific geometry types (e.g. Draco for meshes).
Understanding the codec pipeline matters for contributors adding new geometry types and for users who want to tune compression ratios or compatibility with downstream tools.
Technical reference¶
Default pipeline¶
The default codec pipeline used by zarr-vectors-py for all position and
attribute arrays is:
[
{ "name": "bytes", "configuration": { "endian": "little" } },
{ "name": "blosc", "configuration": { "cname": "zstd", "clevel": 5,
"shuffle": "bitshuffle",
"blocksize": 0, "nthreads": 1 } }
]
bytes(array-to-bytes): serialises the array in C-order, little-endian.bloscwithzstdat level 5: good compression ratio with fast decode.bitshuffleis used instead ofbyteshufflefor floating-point arrays; it typically achieves better compression on 32-bit floats.
To override the default pipeline, pass codec_config to the write function:
import numcodecs
from zarr_vectors.types.points import write_points
write_points(
"scan.zarrvectors",
positions,
chunk_shape=(200., 200., 200.),
codec_config={
"positions": [
{"name": "bytes", "configuration": {"endian": "little"}},
{"name": "zstd", "configuration": {"level": 3}},
]
},
)
Per-array codec configuration¶
zarr-vectors-py writes every chunk array with the same codec
pipeline, set per-write via the compressor= kwarg on the public
write functions (write_points, write_graph, write_polylines,
write_lines, write_mesh). The default is BytesCodec-only — no
compression. This keeps the fast async PUT path active, which is
the original cloud-write design point of the batched writer
(compression CPU is typically 10×+ smaller than cloud PUT latency, so
the wins are negligible against object stores). Opt into compression
explicitly when disk size matters more than throughput.
Accepted compressor= values:
Value |
Resulting |
|---|---|
|
|
|
|
|
Blosc(Zstd, BitShuffle, l5) — what this spec historically recommended for |
|
Caller-supplied Zarr V3 codec specs, passed through verbatim. |
Per-array-type codec selection (different codecs for vertices/
vs vertex_fragments/ vs links/) is future work — today every
chunk array in a single write_* call gets the same codec.
Note
vertex_fragments/ and link_fragments/ are pre-packed binary
layouts (header + bitmap + dense range table + uint32/int64 CSR)
with little redundancy for a general-purpose compressor to exploit.
The chunk arrays still go through whatever compressor= was passed,
but in practice these payloads gain little from compression and
suffer the per-chunk codec overhead on hot decode paths. The arrays’
group metadata carries encoding: "fragment_index_v1" to identify
the on-disk format independently of any outer compression.
object_index/manifests uses the vlen-bytes codec (variable-
length byte array). Each entry is one object’s encoded manifest blob;
per-object random access reads only the zarr chunk containing the
requested OID (~16K objects per chunk).
The Zarr V3 specification for variable-length byte arrays is still
in development (tracked at
zarr-extensions);
ZVF 0.x stores written with vlen-bytes may need to be re-encoded if
the eventual spec lands incompatibly.
Legacy stores written before this layout change have
object_index/data and object_index/offsets byte-blob children
instead of object_index/manifests; readers in zarr_vectors
auto-detect and handle both layouts.
See Fragment-index arrays for the fragment-index byte layout and Object manifest for the manifest-blob format.
Draco codec (mesh geometry)¶
When zarr-vectors[draco] is installed, the links/<delta> and vertices/
arrays of mesh-type stores can use Draco compression. Enable it by
passing use_draco=True and draco_quantization=<bits> to write_mesh()
(or to the OBJ/STL/PLY converters in zarr-vectors-tools).
The custom codec is registered with Zarr as:
{
"name": "draco",
"configuration": {
"quantization_bits": 11,
"compression_level": 7
}
}
Draco-compressed stores are not readable by tools that do not have the
draco codec registered. zarr-vectors-py will raise ValueError when
attempting to read a Draco-compressed store without the extra installed.
Quantisation and precision¶
Draco uses integer quantisation of vertex positions. The quantisation level controls the trade-off between compression ratio and positional precision:
|
Precision (relative to bounding box) |
Typical ratio |
|---|---|---|
8 |
1 / 256 of bbox per axis |
10–15× |
11 |
1 / 2048 of bbox per axis |
6–10× |
14 |
1 / 16384 of bbox per axis |
3–6× |
For neuroscience mesh data (brain surfaces, cell boundaries) quantised to nanometre scale, 11 bits is a common choice that preserves sub-micrometre precision while achieving 6–10× compression over float32.
Choosing a codec¶
Scenario |
Recommended codec |
|---|---|
Default / general purpose |
|
Fastest possible read (RAM-limited) |
|
Maximum compression (archival) |
|
Mesh geometry with size budget |
|
Compatibility with non-Zarr tools |
|
Cloud with many small chunks |
Consider sharding codec first |
Codec availability and compatibility¶
The following codecs are available without extra dependencies:
bytes(array-to-bytes)gzip(standard library)zstd(vianumcodecs)blosc(vianumcodecs)lz4(vianumcodecs)
Codecs requiring extras:
draco— requireszarr-vectors[draco]sharding_indexed— built into Zarr v3 ≥ 2.18; no extra required
When distributing ZVF stores, prefer the default Blosc pipeline for
maximum compatibility. Document any use of non-default codecs prominently
(e.g. in the store’s .zattrs "notes" field) so consumers know what
is required.