Choosing chunk shape and bin shape

chunk_shape and bin_shape are the two most important parameters when writing a ZVF store. This guide gives practical heuristics and worked examples for choosing them well.


The quick version

If you are unsure, start here and tune later:

# Rule of thumb for 3-D biological data
# chunk_shape ≈ L such that expected vertices per chunk ≈ 50 000
# bin_shape   ≈ chunk_shape / 4 per axis

# Point cloud, 10M vertices in a 4000³ µm volume:
# expected/chunk at chunk=500 ≈ 10M × (500³/4000³) ≈ 48 800 ✓
write_points(..., chunk_shape=(500., 500., 500.), bin_shape=(125., 125., 125.))

# Streamlines, 100k tracts in a 200³ mm volume:
# target 200–500 streamlines per chunk
write_polylines(..., chunk_shape=(50., 50., 50.), bin_shape=(10., 10., 10.))

Chunk shape guidance

Primary consideration: I/O unit size

Each chunk is one file on disk or one object in cloud storage. Optimise chunk_shape so that:

  • Typical queries load 1–8 chunks. If a query always hits exactly one chunk, chunk_shape is well-matched to the query size.

  • Each chunk is 50 KB–50 MB compressed. Chunks smaller than ~50 KB have disproportionate per-request overhead; chunks larger than ~50 MB load more data than needed for small queries.

Estimating expected vertices per chunk

def estimate_chunk_vertices(total_vertices, total_volume, chunk_shape):
    chunk_volume = chunk_shape[0] * chunk_shape[1] * chunk_shape[2]
    return total_vertices * chunk_volume / total_volume

# 10M vertices, 4000³ µm volume, 500³ µm chunks
n = estimate_chunk_vertices(10_000_000, 4000**3, (500,500,500))
# n ≈ 48 828 — good

Target 10 000–100 000 vertices per chunk for point clouds. For sparser geometry types (streamlines, skeletons), target 100–500 objects per chunk.

Chunk shape by use case

Use case

Suggested chunk_shape

Rationale

Interactive local viewer

100–200 physical units

Small enough for fast partial loads

Cloud serving (S3/GCS)

300–500 physical units

Fewer objects → lower request cost

HPC batch analysis

500–2000 physical units

Large chunks reduce file count overhead

Neuroglancer (fine mesh)

10–50 physical units

Meshes are dense; small regions needed

Synchrotron tractography

50–100 mm

Typical white-matter query region

Anisotropic data

For data with anisotropic voxels (e.g. 1 µm × 1 µm × 4 µm), choose chunk_shape so that each chunk covers roughly equal physical extents in all dimensions:

# 1×1×4 µm voxels: make chunks ~200 µm in x,y and ~200 µm in z
chunk_shape = (200., 200., 200.)   # 200×200×50 voxels

Bin shape guidance

Primary consideration: query granularity

bin_shape controls the finest spatial resolution of a bounding-box query. A query that requests a 50³ µm region will load only the bins that overlap it — not the full chunk.

Rule of thumb: set bin_shape so that a typical query spans 2–8 bins per axis.

# Typical query size 50³ µm, chunk_shape 200³ µm:
# 200 / bin_shape ≥ query_size / bin_shape ≥ 2
# → bin_shape ≤ 100 µm and bin_shape ≥ 25 µm
# → bin_shape = 50 µm (4 bins per axis) ✓

Bins per chunk and index overhead

The VG index is B_per_chunk × 16 bytes per chunk. For bin_shape = chunk_shape / 4 in 3-D: B_per_chunk = 64, index size = 1 KB per chunk. This is negligible compared to vertex data.

chunk/bin ratio per axis

B_per_chunk (3-D)

Index size per chunk

2

8

128 bytes

4

64

1 KB

8

512

8 KB

16

4096

64 KB

Ratios above 8 per axis are rarely needed and add non-trivial index overhead.

One bin per chunk (no sub-chunk indexing)

Omit bin_shape or set it equal to chunk_shape when:

  • All queries read entire chunks.

  • The store is written for sequential batch processing only.

  • You need maximum compatibility with tools that do not understand bins.

write_points(..., chunk_shape=(200., 200., 200.))
# bin_shape defaults to chunk_shape → 1 bin per chunk

Worked examples

Synchrotron point cloud (HiP-CT)

Dataset:     200M vertices in 8000³ µm HiP-CT scan
Query size:  ~100³ µm (interactive viewport at high zoom)
Platform:    S3, Neuroglancer serving
# 500³ chunk → ~245k vertices/chunk (dense dataset)
# Reduce to 200³ for ~16k vertices/chunk — more manageable
# bin_shape = 50³ → 4³ = 64 bins/chunk, query granularity = 50 µm ✓

write_points(
    store,
    positions,
    chunk_shape=(200., 200., 200.),
    bin_shape=(50., 50., 50.),
)

DWI tractography (1M streamlines, 50-vertex average)

Dataset:     1M streamlines × 50 vertices = 50M vertices
             Streamlines span a 180³ mm MRI volume
Query size:  Typically one white-matter bundle (~30×30×80 mm)
Platform:    Local analysis + Neuroglancer
# At chunk_shape=50³ mm: 50³/180³ × 1M ≈ 2143 streams/chunk ✓
# (target 200–500; slightly high but acceptable)
# bin_shape = 10³ mm → 125 bins/chunk; query granularity = 10 mm ✓

write_polylines(
    store,
    streamlines,
    chunk_shape=(50., 50., 50.),
    bin_shape=(10., 10., 10.),
)

EM connectome skeletons (10 000 neurons)

Dataset:     10 000 skeletons, avg 5000 nodes/neuron = 50M nodes
             Data in a 1000³ µm EM volume
Query:       Single neuron retrieval by ID (object_index lookup)
             Spatial query by region (~100³ µm)
Platform:    Local analysis
# chunk_shape = 100³ µm → 100³/1000³ × 50M ≈ 50k vertices/chunk ✓
# bin_shape = 25³ µm → 4³ = 64 bins/chunk, 25 µm resolution ✓

write_graph(
    store,
    positions,
    edges,
    chunk_shape=(100., 100., 100.),
    bin_shape=(25., 25., 25.),
    geometry_type="skeleton",
)

After writing: profile and tune

Open the store and inspect the distribution of vertices per chunk (the zarr-vectors info CLI in zarr-vectors-tools prints this summary directly):

0:  100000 vertices, 125 chunks
  vertices/chunk:  min=124  median=812  max=1843  p95=1602

If median << 10000, your chunk_shape is too small relative to the data density — increase it. If p95 >> 100000, chunks may be too large for interactive use — decrease it.