Benchmarks

The benchmark notebooks under benchmarks/ each follow a tight setup → sweep → table → plot shape and run in ~5 minutes on a laptop. Numbers below come from 01_size_scaling.ipynb on a typical SSD-backed workstation; treat them as order-of-magnitude references — wall times depend on disk, filesystem, and OS cache state.

Notebook

Axis swept

Fixed

01_size_scaling.ipynb

vertex count N {10³, 10⁴, 10⁵, 10⁶}

point cloud, local backend

02_data_types.ipynb

geometry type (all seven)

N = 50 000, local backend

03_backends.ipynb

storage backend (local, obstore, fsspec)

point cloud, N = 100 000

Vertex scaling — 01_size_scaling

A point cloud with N random vertices in [0, 1000)³ is written and read four ways against a pandas / CSV baseline. Each N is averaged over 10 runs; bands are 95% Student’s-t confidence intervals (df=9).

Vertex scaling: write, read all, read one, disk size — zarr-vectorsvs CSV

zarr-vectors pays a fixed ~0.4 s setup cost (zarr metadata, fragment-index sidecars, one array per spatial chunk) regardless of N, then scales sublinearly. CSV scales linearly from the first point — to_csv formats and writes every row sequentially, read_csv scans the byte stream once, and a “random row” read is read_csv(..., skiprows=range(1, row_idx + 1), nrows=1) which still parses every preceding line before the one it wants. Crossover points:

Operation

CSV wins when…

zarr-vectors wins when…

Write

N < 10⁵

N 10⁵

Read all

N 10⁶

N > 10⁶ (gap closes fast)

Read one

N < 10⁵

N 10⁵

Disk size

N < 10⁴

N 10⁴

Dtype and on-disk encoding

Both sides start from the same float32 input (numpy.random.default_rng().uniform(...).astype(np.float32)).

Stage

zarr-vectors

pandas / CSV

On-disk encoding

packed float32 → Blosc(Zstd, BitShuffle, level=5)

decimal text, one row per line

Bytes per (x, y, z) row

12 (pre-compression)

~30–50 (8–12 chars per float + delimiters)

Read result dtype

float32 ndarray

float64 DataFrame

Two dtype asymmetries drive the plot. CSV renders each float as a decimal string ({:.12g} by default), so a 16-byte (x, y, z, intensity) row blows up to ~40 bytes of text — and text does not compress with the density Blosc-BitShuffle gives binary float32. On read, pandas silently widens every column to float64 unless dtype= is set; zarr-vectors round-trips the exact float32 it stored.

The constant ~0.1 s “Read one” floor for zarr-vectors at every N is the lazy reader (zarr_vectors.lazy.open_zv) opening the store, listing chunks once, and decoding a single Blosc-compressed vertices/<i.j.k> chunk — no scan, no offset table to walk.

Running locally

pip install zarr-vectors pandas matplotlib
jupyter lab benchmarks/

Expected runtime on a laptop:

  • 01_size_scaling: a few minutes (the 1 M-vertex case dominates).

  • 02_data_types: ~30 seconds.

  • 03_backends: ~10 seconds without cloud, longer with.

Cloud-backend mode

Notebook 03_backends.ipynb benchmarks the obstore and fsspec cloud backends only when the ZV_BENCH_S3_URL env var is set:

export ZV_BENCH_S3_URL="s3://my-bucket/zv-bench/"
pip install "zarr-vectors[obstore]"   # preferred
# or
pip install "zarr-vectors[cloud]"     # fsspec fallback

Without the env var the cloud rows are skipped and a one-row local-only result is reported.

Caveats

  • These notebooks measure disk bytes and wall time only — no memory profiling.

  • No CI gating and no pytest-benchmark integration: regressions are not caught automatically.

  • Different geometry types do genuinely different work; do not cross-compare rows of 02_data_types.

  • Numbers are not directly comparable across hardware, file systems, or cloud regions.

  • The CSV baseline uses pandas’ default read_csv behaviour (float64 output, no chunked reads). A tuned baseline with dtype=np.float32 and engine='c' would narrow the read gap somewhat but not change the random-access scaling.

To regenerate the notebooks from the source recipe:

python benchmarks/_build.py