Potential prompt: minimal library for a Zarr-like array backed by multiple codec-heterogeneous Zarr arrays

Below is a potential Claude prompt to build a library for heterogeneous codec support in a Zarr-backed Array API library. Before spending time/tokens/water/energy on it, I wanted to share the idea with the team. Also, I don't really have time to guide claude through this project anyways 😅 

This could fit into #347 or #346.

---

## Use this prompt to kick off the project

> I want to build a small Python library that exposes a single Zarr-*like* array whose data is physically stored as **multiple real Zarr arrays, each with its own codec pipeline**, presented through one coherent array interface. It should sit on top of `zarr-python` and use `icechunk` for atomic, versioned, multi-array coherence. Start with the brainstorming skill to pin down the manifest schema and the read/write contract before writing any code. Treat the design notes below as the starting point, not a finished spec.

## Why this shape (the key insight)

A single spec-compliant Zarr array declares **one** codec pipeline in `zarr.json`; every chunk is decoded through it. So "different codecs in different regions of one array" cannot be a single spec array without a Zarr Enhancement Proposal (e.g. a codec-aware shard index).

This library sidesteps that entirely: the logical array is assembled **in the library**, not by a spec-compliant reader. We keep N real, fully spec-compliant Zarr arrays (each with a homogeneous codec pipeline) and do the routing + stitching ourselves. No format change, no `must_understand`, no cross-language buy-in. Backing data stays readable by TensorStore / zarrs / GDAL; only the *assembly convention* is ours.

This is **different from VirtualiZarr / Kerchunk**: those produce one virtual Zarr array via a chunk manifest, which forces a single codec pipeline across all referenced chunks — so they *cannot* mix codecs within one array. This library lands at a different point in the design space: many arrays, assembled, not one manifest array.

## Architecture (three small pieces)

1. **Manifest** — a library-private metadata document (stored in the Icechunk group's `attrs`) describing the logical array (shape, dtype, chunk grid) and a map from region → backing member array. Example:
   ```
   logical: shape=(1000, 1000), dtype=float32, chunks=(100, 100)
   regions:
     - bounds: [(0, 500),   (0, 1000)] -> member "part_a"   # blosc-zstd
     - bounds: [(500, 1000), (0, 1000)] -> member "part_b"   # gzip
   ```
   Each member is a normal `zarr.Array` under the same group, with its own codecs.

2. **`LogicalArray` duck type** — NOT a subclass of `zarr.Array` (avoid inheriting eager-NumPy semantics and single-pipeline metadata). Exposes `shape`, `dtype`, `__getitem__`, optionally `__setitem__`. Read path:
   ```python
   def __getitem__(self, selection):
       out = np.empty(output_shape(selection), self.dtype)
       for region in self.manifest.regions_overlapping(selection):
           member = self.group[region.name]              # a real zarr.Array
           local_sel = region.to_local(selection)         # translate coords
           out[region.to_output(selection)] = member[local_sel]  # zarr decodes its own codecs
       return out
   ```
   zarr does all codec work per member; our code is pure coordinate bookkeeping + assembly.

3. **Icechunk as the coherence layer** — earns "coherent unit":
   - **Atomic multi-array commits**: manifest + all members in one repo; appending a new region with a *different* codec and updating the manifest is a single transaction — readers never see a half-updated logical array.
   - **Versioning / time-travel** of the whole assembly at once.
   - **Virtual chunk references**: members can point at pre-existing external data without copying.

   Open members via the normal Zarr API on the Icechunk-backed store: `zarr.open_group(store=session.store)`. Icechunk sits *below* zarr; no direct integration beyond getting a store and committing.

## Decisions that keep v1 minimal

- **Align region boundaries to the logical chunk grid** — so every logical chunk lives wholly in one member. Assembly becomes trivial slice-copies (no partial-chunk stitching across members), and each member is itself a valid sub-grid. Strongly worth the constraint for v1.
- **One dtype, vary only codecs** — every member decodes to the same dtype, so assembly is a plain copy. *Different dtypes* per region is strictly harder (needs a casting policy to present one logical dtype) — scope it OUT of minimal.
- **Reject cross-region writes in v1** — allow per-region writes plus whole-new-region appends (the natural place to *choose* a codec for incoming data). Splitting a write across a region boundary is a later feature.

## Known edges / deferred

- A selection spanning many members issues many reads with no coalescing. Fine for v1; range-coalescing / a query planner (cf. zarr-python lazy-indexing direction) is a later optimization.
- The manifest is library-private — nothing outside the library knows the logical array exists. Making it readable by other tools would mean returning to the spec conversation (a polymorphic shard index ZEP).

## Open questions to resolve in brainstorming

- Manifest schema: exact serialization, region representation (bounds vs. chunk-block coordinates), versioning of the manifest format itself.
- The read/write contract: materialization semantics, whether `__getitem__` is eager or lazy, how `__setitem__` validates region containment.
- Region algebra: the actual novel code — overlap resolution, coordinate translation (`to_local` / `to_output`), boundary validation against the chunk grid.
- Package boundaries / naming; dependency on zarr-python and icechunk versions.
- Test strategy: round-trip per-member codecs, cross-region reads, append-with-new-codec under Icechunk transactions.

## Context

This idea came out of planning discussions in the `zarr-python-planning` repo (`d-v-b/zarr-python-planning`) — specifically the lazy-indexing and codecs proposals, which establish that per-chunk heterogeneous codecs in a single spec array require a ZEP. This library is the no-spec-change alternative.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential prompt: minimal library for a Zarr-like array backed by multiple codec-heterogeneous Zarr arrays #407

Use this prompt to kick off the project

Why this shape (the key insight)

Architecture (three small pieces)

Decisions that keep v1 minimal

Known edges / deferred

Open questions to resolve in brainstorming

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Potential prompt: minimal library for a Zarr-like array backed by multiple codec-heterogeneous Zarr arrays #407

Description

Use this prompt to kick off the project

Why this shape (the key insight)

Architecture (three small pieces)

Decisions that keep v1 minimal

Known edges / deferred

Open questions to resolve in brainstorming

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions