Codec pipeline¶
Terms¶
- Codec
A single transformation applied to chunk data. A codec takes bytes (or an array) as input and produces bytes as output. Codecs are composable; an ordered sequence of codecs forms a pipeline.
- Array-to-bytes codec
The first codec in a Zarr v3 pipeline, which converts a typed N-dimensional array into a flat byte sequence. Zarr v3 requires exactly one such codec. The standard array-to-bytes codec is
bytes, which serialises the array in C-order with a declared byte order.- Byte-to-byte codec
A codec that transforms one byte sequence into another — typically compression or encryption. Multiple byte-to-byte codecs may be stacked. Common codecs:
blosc,zstd,gzip,lz4.- Blosc
A high-performance meta-compressor that internally uses one of several compression algorithms (zstd, lz4, blosclz, zlib). Blosc also performs byte-shuffling before compression, which improves compression ratios on typed numeric arrays. It is the default byte-to-byte codec in
zarr-vectors-py.- Sharding codec (
sharding_indexed) A Zarr v3 codec that packs multiple logical chunks into a single storage key. Reduces file count and cloud storage request overhead. Discussed in detail in Sharding.
- Draco
Google’s geometry compression library, optimised for 3-D mesh data.
zarr-vectors-pyregisters a custom Zarr codec (draco) that applies Draco encoding to thelinks/<delta>andvertices/arrays of mesh-type stores. Requireszarr-vectors[draco].- Fill value
The value written to a chunk position that is not explicitly stored. In Zarr v3, the fill value is declared per array in
zarr.json. ZVF uses0.0for floating-point position arrays and-1for integer index arrays.
Introduction¶
A Zarr v3 codec pipeline controls how chunk data is serialised to bytes and
compressed before storage. ZVF does not mandate a specific codec; any Zarr
v3-compatible pipeline is valid. However, zarr-vectors-py ships with
sensible defaults and provides helper functions to configure specialised
pipelines for specific geometry types (e.g. Draco for meshes).
Understanding the codec pipeline matters for contributors adding new geometry types and for users who want to tune compression ratios or compatibility with downstream tools.
Technical reference¶
Default pipeline¶
The default codec pipeline used by zarr-vectors-py for all position and
attribute arrays is:
[
{ "name": "bytes", "configuration": { "endian": "little" } },
{ "name": "blosc", "configuration": { "cname": "zstd", "clevel": 5,
"shuffle": "bitshuffle",
"blocksize": 0, "nthreads": 1 } }
]
bytes(array-to-bytes): serialises the array in C-order, little-endian.bloscwithzstdat level 5: good compression ratio with fast decode.bitshuffleis used instead ofbyteshufflefor floating-point arrays; it typically achieves better compression on 32-bit floats.
To override the default pipeline, pass codec_config to the write function:
import numcodecs
from zarr_vectors.types.points import write_points
write_points(
"scan.zarrvectors",
positions,
chunk_shape=(200., 200., 200.),
codec_config={
"positions": [
{"name": "bytes", "configuration": {"endian": "little"}},
{"name": "zstd", "configuration": {"level": 3}},
]
},
)
Per-array codec configuration¶
Different arrays in a ZVF store may use different codec pipelines. The
zarr-vectors-py defaults by array type are:
Array |
Default codec |
Rationale |
|---|---|---|
|
|
Float32 positions compress well with bitshuffle |
|
none (opaque |
Pre-packed fragment-index format; see note below |
|
none (opaque |
Parallel to |
|
|
Int32/Int64 index pairs; same codec for |
|
|
Varies by dtype; parallel to |
|
|
Typically small flat blob |
|
|
Parallel to |
|
|
Varies by dtype |
|
|
One chunk per ~16K objects; random-access read fetches only the chunk holding the requested OID |
Note
vertex_fragments/ and link_fragments/ bypass the Zarr codec
pipeline. Each chunk is written as opaque uint8 bytes via the
FsGroup write_bytes path. The fragment-index format is already a
packed binary layout (header + bitmap + dense range table +
uint32/int64 CSR) with little redundancy for a general-purpose
compressor to exploit, and adding a compression step introduces
latency on hot decode paths. The arrays’ group metadata carries
encoding: "fragment_index_v1" to identify the on-disk format
independently of any outer compression a backend might apply to the
bytes themselves.
object_index/manifests uses the vlen-bytes codec. Each entry
is one object’s encoded manifest blob; per-object random access reads
only the zarr chunk containing the requested OID (~16K objects per
chunk), not the whole index. The manifest-blob wire format itself is
unchanged from earlier ZVF versions — only the on-disk container
changed from a data + offsets byte-blob pair to a single ragged
zarr array.
The Zarr V3 specification for variable-length byte arrays is still
in development (tracked at
zarr-extensions);
ZVF 0.x stores written with vlen-bytes may need to be re-encoded if
the eventual spec lands incompatibly.
Legacy stores written before this layout change have
object_index/data and object_index/offsets byte-blob children
instead of object_index/manifests; readers in zarr_vectors
auto-detect and handle both layouts.
See Fragment-index arrays for the fragment-index byte layout and Object manifest for the manifest-blob format.
Draco codec (mesh geometry)¶
When zarr-vectors[draco] is installed, the links/<delta> and vertices/
arrays of mesh-type stores can use Draco compression. Enable it by
passing use_draco=True and draco_quantization=<bits> to write_mesh()
(or to the OBJ/STL/PLY converters in zarr-vectors-tools).
The custom codec is registered with Zarr as:
{
"name": "draco",
"configuration": {
"quantization_bits": 11,
"compression_level": 7
}
}
Draco-compressed stores are not readable by tools that do not have the
draco codec registered. zarr-vectors-py will raise ValueError when
attempting to read a Draco-compressed store without the extra installed.
Quantisation and precision¶
Draco uses integer quantisation of vertex positions. The quantisation level controls the trade-off between compression ratio and positional precision:
|
Precision (relative to bounding box) |
Typical ratio |
|---|---|---|
8 |
1 / 256 of bbox per axis |
10–15× |
11 |
1 / 2048 of bbox per axis |
6–10× |
14 |
1 / 16384 of bbox per axis |
3–6× |
For neuroscience mesh data (brain surfaces, cell boundaries) quantised to nanometre scale, 11 bits is a common choice that preserves sub-micrometre precision while achieving 6–10× compression over float32.
Choosing a codec¶
Scenario |
Recommended codec |
|---|---|
Default / general purpose |
|
Fastest possible read (RAM-limited) |
|
Maximum compression (archival) |
|
Mesh geometry with size budget |
|
Compatibility with non-Zarr tools |
|
Cloud with many small chunks |
Consider sharding codec first |
Codec availability and compatibility¶
The following codecs are available without extra dependencies:
bytes(array-to-bytes)gzip(standard library)zstd(vianumcodecs)blosc(vianumcodecs)lz4(vianumcodecs)
Codecs requiring extras:
draco— requireszarr-vectors[draco]sharding_indexed— built into Zarr v3 ≥ 2.18; no extra required
When distributing ZVF stores, prefer the default Blosc pipeline for
maximum compatibility. Document any use of non-default codecs prominently
(e.g. in the store’s .zattrs "notes" field) so consumers know what
is required.