Below is a potential Claude prompt to build a library for heterogeneous codec support in a Zarr-backed Array API library. Before spending time/tokens/water/energy on it, I wanted to share the idea with the team. Also, I don't really have time to guide claude through this project anyways 😅
This could fit into #347 or #346.
Use this prompt to kick off the project
I want to build a small Python library that exposes a single Zarr-like array whose data is physically stored as multiple real Zarr arrays, each with its own codec pipeline, presented through one coherent array interface. It should sit on top of zarr-python and use icechunk for atomic, versioned, multi-array coherence. Start with the brainstorming skill to pin down the manifest schema and the read/write contract before writing any code. Treat the design notes below as the starting point, not a finished spec.
Why this shape (the key insight)
A single spec-compliant Zarr array declares one codec pipeline in zarr.json; every chunk is decoded through it. So "different codecs in different regions of one array" cannot be a single spec array without a Zarr Enhancement Proposal (e.g. a codec-aware shard index).
This library sidesteps that entirely: the logical array is assembled in the library, not by a spec-compliant reader. We keep N real, fully spec-compliant Zarr arrays (each with a homogeneous codec pipeline) and do the routing + stitching ourselves. No format change, no must_understand, no cross-language buy-in. Backing data stays readable by TensorStore / zarrs / GDAL; only the assembly convention is ours.
This is different from VirtualiZarr / Kerchunk: those produce one virtual Zarr array via a chunk manifest, which forces a single codec pipeline across all referenced chunks — so they cannot mix codecs within one array. This library lands at a different point in the design space: many arrays, assembled, not one manifest array.
Architecture (three small pieces)
-
Manifest — a library-private metadata document (stored in the Icechunk group's attrs) describing the logical array (shape, dtype, chunk grid) and a map from region → backing member array. Example:
logical: shape=(1000, 1000), dtype=float32, chunks=(100, 100)
regions:
- bounds: [(0, 500), (0, 1000)] -> member "part_a" # blosc-zstd
- bounds: [(500, 1000), (0, 1000)] -> member "part_b" # gzip
Each member is a normal zarr.Array under the same group, with its own codecs.
-
LogicalArray duck type — NOT a subclass of zarr.Array (avoid inheriting eager-NumPy semantics and single-pipeline metadata). Exposes shape, dtype, __getitem__, optionally __setitem__. Read path:
def __getitem__(self, selection):
out = np.empty(output_shape(selection), self.dtype)
for region in self.manifest.regions_overlapping(selection):
member = self.group[region.name] # a real zarr.Array
local_sel = region.to_local(selection) # translate coords
out[region.to_output(selection)] = member[local_sel] # zarr decodes its own codecs
return out
zarr does all codec work per member; our code is pure coordinate bookkeeping + assembly.
-
Icechunk as the coherence layer — earns "coherent unit":
- Atomic multi-array commits: manifest + all members in one repo; appending a new region with a different codec and updating the manifest is a single transaction — readers never see a half-updated logical array.
- Versioning / time-travel of the whole assembly at once.
- Virtual chunk references: members can point at pre-existing external data without copying.
Open members via the normal Zarr API on the Icechunk-backed store: zarr.open_group(store=session.store). Icechunk sits below zarr; no direct integration beyond getting a store and committing.
Decisions that keep v1 minimal
- Align region boundaries to the logical chunk grid — so every logical chunk lives wholly in one member. Assembly becomes trivial slice-copies (no partial-chunk stitching across members), and each member is itself a valid sub-grid. Strongly worth the constraint for v1.
- One dtype, vary only codecs — every member decodes to the same dtype, so assembly is a plain copy. Different dtypes per region is strictly harder (needs a casting policy to present one logical dtype) — scope it OUT of minimal.
- Reject cross-region writes in v1 — allow per-region writes plus whole-new-region appends (the natural place to choose a codec for incoming data). Splitting a write across a region boundary is a later feature.
Known edges / deferred
- A selection spanning many members issues many reads with no coalescing. Fine for v1; range-coalescing / a query planner (cf. zarr-python lazy-indexing direction) is a later optimization.
- The manifest is library-private — nothing outside the library knows the logical array exists. Making it readable by other tools would mean returning to the spec conversation (a polymorphic shard index ZEP).
Open questions to resolve in brainstorming
- Manifest schema: exact serialization, region representation (bounds vs. chunk-block coordinates), versioning of the manifest format itself.
- The read/write contract: materialization semantics, whether
__getitem__ is eager or lazy, how __setitem__ validates region containment.
- Region algebra: the actual novel code — overlap resolution, coordinate translation (
to_local / to_output), boundary validation against the chunk grid.
- Package boundaries / naming; dependency on zarr-python and icechunk versions.
- Test strategy: round-trip per-member codecs, cross-region reads, append-with-new-codec under Icechunk transactions.
Context
This idea came out of planning discussions in the zarr-python-planning repo (d-v-b/zarr-python-planning) — specifically the lazy-indexing and codecs proposals, which establish that per-chunk heterogeneous codecs in a single spec array require a ZEP. This library is the no-spec-change alternative.
Below is a potential Claude prompt to build a library for heterogeneous codec support in a Zarr-backed Array API library. Before spending time/tokens/water/energy on it, I wanted to share the idea with the team. Also, I don't really have time to guide claude through this project anyways 😅
This could fit into #347 or #346.
Use this prompt to kick off the project
Why this shape (the key insight)
A single spec-compliant Zarr array declares one codec pipeline in
zarr.json; every chunk is decoded through it. So "different codecs in different regions of one array" cannot be a single spec array without a Zarr Enhancement Proposal (e.g. a codec-aware shard index).This library sidesteps that entirely: the logical array is assembled in the library, not by a spec-compliant reader. We keep N real, fully spec-compliant Zarr arrays (each with a homogeneous codec pipeline) and do the routing + stitching ourselves. No format change, no
must_understand, no cross-language buy-in. Backing data stays readable by TensorStore / zarrs / GDAL; only the assembly convention is ours.This is different from VirtualiZarr / Kerchunk: those produce one virtual Zarr array via a chunk manifest, which forces a single codec pipeline across all referenced chunks — so they cannot mix codecs within one array. This library lands at a different point in the design space: many arrays, assembled, not one manifest array.
Architecture (three small pieces)
Manifest — a library-private metadata document (stored in the Icechunk group's
attrs) describing the logical array (shape, dtype, chunk grid) and a map from region → backing member array. Example:Each member is a normal
zarr.Arrayunder the same group, with its own codecs.LogicalArrayduck type — NOT a subclass ofzarr.Array(avoid inheriting eager-NumPy semantics and single-pipeline metadata). Exposesshape,dtype,__getitem__, optionally__setitem__. Read path:zarr does all codec work per member; our code is pure coordinate bookkeeping + assembly.
Icechunk as the coherence layer — earns "coherent unit":
Open members via the normal Zarr API on the Icechunk-backed store:
zarr.open_group(store=session.store). Icechunk sits below zarr; no direct integration beyond getting a store and committing.Decisions that keep v1 minimal
Known edges / deferred
Open questions to resolve in brainstorming
__getitem__is eager or lazy, how__setitem__validates region containment.to_local/to_output), boundary validation against the chunk grid.Context
This idea came out of planning discussions in the
zarr-python-planningrepo (d-v-b/zarr-python-planning) — specifically the lazy-indexing and codecs proposals, which establish that per-chunk heterogeneous codecs in a single spec array require a ZEP. This library is the no-spec-change alternative.