From 99d07472c2718edf7ffcee14b17fe7a4fa31d74b Mon Sep 17 00:00:00 2001 From: Davide Angelocola Date: Sat, 20 Jun 2026 14:22:51 +0200 Subject: [PATCH] docs: compact CLAUDE.md, remove redundancy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Merge the duplicated testing/random-data sections and the split Javadoc rules, fold the per-side encoding/extension checklists and pluggability prose into tighter form, and trim restated prose. Every rule, command, and the perf code examples (arena allocation, hot-loop branch-split) are preserved. 429 → 231 lines. Co-Authored-By: Claude Opus 4.8 --- CLAUDE.md | 493 ++++++++++++++++-------------------------------------- 1 file changed, 148 insertions(+), 345 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 4d193d93..b2120776 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,428 +1,231 @@ # CLAUDE.md -This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. +Guidance for Claude Code working in this repository. ## What it is -Java 25 native implementation of the [Vortex](https://github.com/vortex-data/vortex) columnar file format. Uses FFM ( -`MemorySegment`/`Arena`) instead of JNI or `sun.misc.Unsafe`. +Java 25 native implementation of the [Vortex](https://github.com/vortex-data/vortex) columnar +file format. Uses FFM (`MemorySegment`/`Arena`) — never JNI or `sun.misc.Unsafe`. ## Module structure ``` core — DType, PType, VortexException, VortexFormat + generated fbs/proto - encoding: EncodingId, TimeUnit, PTypeIO - extension: ExtensionId + encoding: EncodingId, TimeUnit, PTypeIO extension: ExtensionId reader — VortexReader, VortexHttpReader, VortexHandle, ReadRegistry, ExtensionDecoder, - Chunk, ArrayStats, ScanOptions, RowFilter - reader.array: Array and all subtypes (decode outputs) - reader.decode: EncodingDecoder, DecodeContext, ArrayNode + all *EncodingDecoder impls - reader.extension: DateExtensionDecoder, TimeExtensionDecoder, - TimestampExtensionDecoder, UuidExtensionDecoder - (file-structure internals: Footer, Layout, SegmentSpec, Trailer, PostscriptParser, …) + Chunk, ArrayStats, ScanOptions, RowFilter; file internals (Footer, Layout, Trailer, + PostscriptParser, …) + reader.array — Array + all subtypes (decode outputs) + reader.decode — EncodingDecoder, DecodeContext, ArrayNode + *EncodingDecoder impls + reader.extension — Date/Time/Timestamp/Uuid ExtensionDecoder writer — VortexWriter, WriteRegistry, WriteOptions, ExtensionEncoder - writer.encode: EncodingEncoder, EncodeContext, NullableData + all *EncodingEncoder impls, - extension encoders (DateExtensionEncoder, TimeExtensionEncoder, - TimestampExtensionEncoder, UuidExtensionEncoder) + writer.encode — EncodingEncoder, EncodeContext, NullableData + *EncodingEncoder impls, + extension encoders ``` -Dependency rule: `writer → core`, `reader → core`. Writer never depends on reader. -`Array` and its subtypes are decode outputs — they live in `reader.array`, not `core`. +Dependency rule: `writer → core`, `reader → core`. **Writer never depends on reader.** +`Array` and subtypes are decode outputs — they live in `reader.array`, not `core`. -## Branching strategy +## Branching -Trunk-based development. PRs are fine but always squash or rebase — no merge commits. -Keep commits small and `main` always green. +Trunk-based. PRs fine but always squash or rebase — no merge commits. Keep commits small, +`main` always green. ## Commands -Never use `mvn install` or `./mvwn install`. - -Generated sources (`fbs`/`proto` → Java) are committed under `core/src/main/java`. -Normal builds need no external tools. - -Proto-to-Java generation is in-process via the `proto-gen` module (no `protoc` needed). -The generator emits one record per message with a {@code decode(MemorySegment, long, long)} static -factory and an {@code encode()} method that operate directly on a memory segment — no `byte[]` -copy, no `protobuf-java` runtime, no `sun.misc.Unsafe`. - -To regenerate after editing `.fbs` or `.proto` schemas: +**Never `mvn install` / `./mvnw install`.** Normal builds need no external tools; generated +`fbs`/`proto` sources are committed under `core/src/main/java`. ```bash -brew install flatbuffers # only needed for .fbs edits -./mvnw compile -pl proto-gen # build the proto generator (only on .proto edits) -./mvnw generate-sources -pl core -P regenerate-sources -# then commit the updated files +./mvnw verify # build all +./mvnw verify -DskipTests # build, no tests +./mvnw test # unit only (excludes *IntegrationTest) +./mvnw test -pl reader # one module +./mvnw test -pl reader -Dtest=MyTest # one class +./mvnw test -pl reader -Dtest=MyTest#m # one method +./mvnw verify -pl integration -am # integration (failsafe, NOT surefire) +./mvnw verify -pl integration -am -Dit.test="RustWritesJavaReadsIntegrationTest#method" +./bench RustVsJavaReadBenchmark.javaReadVolume # benchmark — always ClassName.methodName filter ``` -Any `flatc` version works — the profile strips the version guard automatically. -`flatc` runs every time the profile is active; if you only changed `.proto` files, revert any -spurious `fbs/` diffs with `git checkout -- core/src/main/java/io/github/dfa1/vortex/fbs/`. +Regenerate after editing `.fbs`/`.proto`: ```bash -# Build all modules -./mvnw verify - -# Build without tests -./mvnw verify -DskipTests - -# Run all tests (unit only — excludes *IntegrationTest) -./mvnw test - -# Run tests in one module -./mvnw test -pl reader - -# Run a single test class -./mvnw test -pl reader -Dtest=MyTest - -# Run a single test method -./mvnw test -pl reader -Dtest=MyTest#myMethod - -# Run integration tests (use verify + failsafe, NOT test + surefire) -./mvnw verify -pl integration -am - -# Run a single integration test class -./mvnw verify -pl integration -am -Dit.test=RustWritesJavaReadsIntegrationTest - -# Run a single integration test method -./mvnw verify -pl integration -am -Dit.test="RustWritesJavaReadsIntegrationTest#s3_pcoVortex_javaDecodeMatchesJni" - -# Run all benchmarks -./bench - -# Run specific benchmark class or method (always use ClassName.methodName filter) -./bench RustVsJavaReadBenchmark -./bench RustVsJavaReadBenchmark.javaReadVolume +brew install flatbuffers # only for .fbs edits (any flatc version; guard auto-stripped) +./mvnw compile -pl proto-gen # only on .proto edits +./mvnw generate-sources -pl core -P regenerate-sources # then commit ``` -### Javadoc - -Every public method must have complete Javadoc. The build enforces this via -`failOnError=true` + `failOnWarnings=true` in the `maven-javadoc-plugin`. - -Rules: - -- Every public method needs a main description, `@param` for each parameter, and `@return` (unless `void`). -- Every public record needs `@param` entries on the class-level doc (one per component). -- Cross-references use `[ClassName#method(ParamType)]` — verify the target exists before writing it. Wrong references - are **errors**, not warnings. -- `@see`-only Javadoc counts as "no main description" — always add a prose sentence. - -**Check:** `./mvnw javadoc:javadoc -pl core` — must produce zero output. +`flatc` runs whenever the profile is active; if you only changed `.proto`, revert spurious +`fbs/` diffs: `git checkout -- core/src/main/java/io/github/dfa1/vortex/fbs/`. Proto-to-Java +is in-process via `proto-gen` (no `protoc`/`protobuf-java`): one record per message with static +`decode(MemorySegment, long, long)` + `encode()` operating directly on a segment. ### Releasing ```bash ./mvnw --batch-mode release:clean release:prepare \ - -DreleaseVersion= \ - -DdevelopmentVersion=-SNAPSHOT -git push && git push --tags -``` - -GitHub Actions picks up the tag and deploys to Maven Central. - -### File format - -8-byte trailer at EOF: `version(u16 LE) | postscriptLen(u16 LE) | magic(VTXF)`. The postscript is a FlatBuffer blob -immediately before the trailer; it points (offset+length) to the Footer (FlatBuffer), DType (Protobuf), and Layout ( -FlatBuffer) blobs elsewhere in the file. - -### Layout tree - -``` -Struct → Zoned(Stats) → Chunked → [Flat, Flat, ...] + -DreleaseVersion= -DdevelopmentVersion=-SNAPSHOT +git push && git push --tags # GitHub Actions deploys the tag to Maven Central ``` -- **Flat**: single encoded segment on disk -- **Chunked**: sequence of Flat children -- **Zoned** (`vortex.stats`): wraps a child with per-chunk min/max statistics used for zone-map pruning -- **Struct**: one child per column +## File format -Encoding IDs are strings (e.g. `"vortex.flat"`, `"fastlanes.bitpacked"`). `ReadRegistry` maps IDs → `EncodingDecoder` -impls via `ServiceLoader`. The registry is immutable after construction — register custom decoders on the builder: -`ReadRegistry.builder().registerServiceLoaded().register(myDecoder).build()`. +8-byte trailer at EOF: `version(u16 LE) | postscriptLen(u16 LE) | magic(VTXF)`. The postscript +(FlatBuffer, immediately before the trailer) points (offset+length) to the Footer (FlatBuffer), +DType (Protobuf), and Layout (FlatBuffer) blobs elsewhere in the file. -**Adding a new encoding:** touch-points depend on read vs write: +Layout tree: `Struct → Zoned(Stats) → Chunked → [Flat, Flat, ...]` +- **Flat** single encoded segment · **Chunked** sequence of Flats · **Struct** one child/column +- **Zoned** (`vortex.stats`) wraps a child with per-chunk min/max for zone-map pruning -- **Decode side** (reader): - 1. `EncodingId.java` — add enum constant `VORTEX_FOO("vortex.foo")` - 2. `FooEncodingDecoder.java` — implement `EncodingDecoder` in `reader.decode` - 3. `reader/src/main/resources/META-INF/services/io.github.dfa1.vortex.reader.decode.EncodingDecoder` — add the FQN +Encoding IDs are strings (`"vortex.flat"`, `"fastlanes.bitpacked"`). `ReadRegistry` maps IDs → +`EncodingDecoder` via `ServiceLoader`; immutable after construction — register custom decoders on +the builder: `ReadRegistry.builder().registerServiceLoaded().register(myDecoder).build()`. -- **Encode side** (writer): - 1. `EncodingId.java` — add enum constant (if not already added above) - 2. `FooEncodingEncoder.java` — implement `EncodingEncoder` in `writer.encode` - 3. `writer/src/main/resources/META-INF/services/io.github.dfa1.vortex.writer.encode.EncodingEncoder` — add the FQN +### Adding an encoding -**Adding a new extension type:** split by side — implement `ExtensionDecoder` (reader), `ExtensionEncoder` (writer), or both: +Add `EncodingId` enum constant `VORTEX_FOO("vortex.foo")`, then per side: +- **Decode:** `FooEncodingDecoder implements EncodingDecoder` in `reader.decode` + FQN in + `reader/.../META-INF/services/io.github.dfa1.vortex.reader.decode.EncodingDecoder` +- **Encode:** `FooEncodingEncoder implements EncodingEncoder` in `writer.encode` + FQN in + `writer/.../META-INF/services/io.github.dfa1.vortex.writer.encode.EncodingEncoder` -- **Decode side** (reader): - 1. `ExtensionId.java` — add enum constant `VORTEX_FOO("vortex.foo")` - 2. `FooExtensionDecoder.java` — implement `ExtensionDecoder` in `reader.extension` - 3. Register via `ReadRegistry.builder().register(new FooExtensionDecoder())` — no service file; `registerServiceLoaded()` does not discover `ExtensionDecoder`. +### Adding an extension type -- **Encode side** (writer): - 1. `ExtensionId.java` — add enum constant (if not already added above) - 2. `FooExtensionEncoder.java` — implement `ExtensionEncoder` in `writer.encode` (or `writer`) - 3. `writer/src/main/resources/META-INF/services/io.github.dfa1.vortex.writer.ExtensionEncoder` — add the FQN +Add `ExtensionId` constant, then per side: +- **Decode:** `FooExtensionDecoder implements ExtensionDecoder` in `reader.extension`; register via + `ReadRegistry.builder().register(new FooExtensionDecoder())` — **no service file** + (`registerServiceLoaded()` does not discover `ExtensionDecoder`). +- **Encode:** `FooExtensionEncoder implements ExtensionEncoder` in `writer` + FQN in + `writer/.../META-INF/services/io.github.dfa1.vortex.writer.ExtensionEncoder` -### Memory model +## Memory model -`VortexReader` memory-maps the entire file into one `MemorySegment` (confined `Arena`). All `Array` buffers returned -during scan are zero-copy slices of that segment — their lifetime is tied to the `VortexReader`. Close the file to release -the mapped region. +`VortexReader` memory-maps the whole file into one confined-`Arena` `MemorySegment`. All `Array` +buffers returned during scan are zero-copy slices of it — lifetime tied to the reader; close to +release. -**Encoding output allocation rule:** never allocate `byte[]` + wrap with `MemorySegment.ofArray()` for decode output. -Always allocate from `ctx.arena()`: +**Allocation rule — never `new byte[]` + `MemorySegment.ofArray()` for decode output.** Always +`ctx.arena().allocate(...)` (off-heap, zero GC, scan-chunk lifetime). If a private helper lacks +`DecodeContext`, pass an `Arena arena` param from the `decode()` call site. ```java -// WRONG — heap allocation, GC pressure, extra copy -byte[] outBytes = new byte[(int) (n * elemBytes)]; -MemorySegment out = MemorySegment.ofArray(outBytes); - -// CORRECT — off-heap, zero GC, same lifetime as the scan chunk +// WRONG: heap alloc, GC pressure, extra copy +MemorySegment out = MemorySegment.ofArray(new byte[(int) (n * elemBytes)]); +// CORRECT MemorySegment out = ctx.arena().allocate(n * elemBytes); ``` -If the allocation is in a private static helper that doesn't receive `DecodeContext`, add an `Arena arena` parameter -and pass `ctx.arena()` from the `decode()` call site. - -**Hot-loop rule — never put modulo, division, or branch-with-variable-target on every element.** -A single `i % cap` per row blocks JIT auto-vectorization (C2 superword pass refuses to vectorize loops containing -`Op_ModL`/`Op_DivL`; no SIMD ISA has an integer-divide opcode). Even loop-invariant `cap` doesn't save you — -strength-reduction (Barrett/Granlund-Montgomery) requires a *compile-time* constant divisor. Per-element scalar -modulo also costs 20–40 cycles on Apple silicon vs ~1 cycle for surrounding loads. Net: one modulo in the body of -a 1M-row inner loop has cost 5–10× regressions in this codebase already (commits `ed658b7` → `051a794` → `442021f`). - -If you need broadcast/clamp/masking semantics for the rare path, **branch-split**: hoist the cap once outside the -loop, gate two specialized loop bodies on the cheap check. +**Hot-loop rule — no modulo/division/variable-target branch per element.** A single `i % cap` per +row blocks JIT auto-vectorization (C2 superword refuses `Op_ModL`/`Op_DivL`; no SIMD integer-divide +opcode) — and loop-invariant `cap` doesn't help (strength-reduction needs a *compile-time* constant +divisor). Scalar modulo is also 20–40 cycles vs ~1 for a load on Apple silicon. One modulo in a 1M-row +body has caused 5–10× regressions here (`ed658b7`→`051a794`→`442021f`). Same for bounds/validity-bit +checks and sign-extension switches — anything making the body non-uniform. For broadcast/clamp/mask, +**branch-split**: hoist the check once, gate two specialized loop bodies. ```java -// WRONG — modulo per element kills vectorization and adds 20+ cycles/iter long cap = SegmentBroadcast.capacity(src, 8); -for (long i = 0; i < n; i++) { - out.setAtIndex(LE_LONG, i, src.getAtIndex(LE_LONG, i % cap)); -} - -// CORRECT — fast path has zero modulos; slow path only fires for ConstantEncoding broadcast -long cap = SegmentBroadcast.capacity(src, 8); -if (cap == n) { - for (long i = 0; i < n; i++) { - out.setAtIndex(LE_LONG, i, src.getAtIndex(LE_LONG, i)); - } -} else { - for (long i = 0; i < n; i++) { - out.setAtIndex(LE_LONG, i, src.getAtIndex(LE_LONG, i % cap)); - } +if (cap == n) { // fast path: zero modulos, vectorizes + for (long i = 0; i < n; i++) { out.setAtIndex(LE_LONG, i, src.getAtIndex(LE_LONG, i)); } +} else { // slow path: only ConstantEncoding broadcast + for (long i = 0; i < n; i++) { out.setAtIndex(LE_LONG, i, src.getAtIndex(LE_LONG, i % cap)); } } ``` -The same rule applies to bounds checks, validity-bit checks, sign-extension switches — anything that -makes the body of a multi-million-row loop non-uniform. Profile with JFR (`-prof stack:lines=10`); if you -see `idiv` / `sdiv` / arithmetic helpers showing up as the hot frame, it's almost always this pattern. +Profile with JFR (`-prof stack:lines=10`); `idiv`/`sdiv`/arithmetic helpers as the hot frame is +almost always this. ## Reference implementation -When stuck on encoding/decoding behavior, consult the Rust reference implementation at -`https://github.com/spiraldb/vortex` (via `gh api repos/spiraldb/vortex/contents/`). - -Key paths: +When stuck on encode/decode behavior, read the Rust reference at +`https://github.com/spiraldb/vortex` (via `gh api repos/spiraldb/vortex/contents/`): +`encodings/fastlanes/src/{bitpacking,for}/`, `encodings/sparse/src/`, `encodings/alp/src/alp/`, and +`https://github.com/spiraldb/fastlanes-rs` (`src/bitpacking.rs`, `src/macros.rs`). -- `encodings/fastlanes/src/bitpacking/` — `fastlanes.bitpacked` wire format and algorithm -- `encodings/fastlanes/src/for/` — `fastlanes.for` (frame-of-reference) -- `encodings/sparse/src/` — `vortex.sparse` -- `encodings/alp/src/alp/` — `vortex.alp` -- `https://github.com/spiraldb/fastlanes-rs` — FastLanes bit-packing algorithm (`src/bitpacking.rs`, `src/macros.rs`) +**Never reverse-engineer wire formats by probing bytes.** Read the vtable `serialize`/`deserialize` +in the Rust source for the exact schema, then implement from spec. -Never reverse-engineer wire formats by probing byte patterns. Read the vtable `serialize`/`deserialize` -methods in the Rust source to get the exact protobuf schema, then implement from spec. +## Design decisions -## POM dependency ordering +- **DType is pluggable only via `Extension`.** `DType` is a sealed interface; downstream code must + not add variants. Use `new DType.Extension("ip.address", new DType.Primitive(PType.I32, false), + null, false)` and register decoders/encoders on the registries (or `ServiceLoader`). + Mirrors Rust (`vortex.date`, `vortex.uuid`, …). No SPI for DType variants planned. +- **Layout is a fixed set, no SPI.** `ScanIterator.decodeLayout()` dispatches the known IDs + (flat/chunked/zoned/struct/dict) and throws otherwise. Keep the fixed set; revisit only for a + concrete downstream case unaddressable by a different flat-segment encoding. +- **Small public APIs.** Don't expose internals — when in doubt, leave it out or make it private. +- **POM deps** grouped with comments: `` then ``, each with + project-internal (`io.github.dfa1.vortex:*`) deps first, then external. Omit empty sections. -In every module `pom.xml`, dependencies are grouped with comments: - -```xml - - - - - -``` - -Omit a section if empty (e.g. integration module has no production deps; performance has no test deps). - -## Pluggability decisions - -### DType — already pluggable via Extension - -`DType` is a sealed interface. The sealed variants (Primitive, Utf8, Struct, …) are -file-format primitives; downstream consumers **must not** add new variants. Instead use -`DType.Extension`: - -```java -// custom "ip.address" logical type over 4-byte I32 storage -DType.Extension ipDtype = new DType.Extension("ip.address", - new DType.Primitive(PType.I32, false), null, false); -``` - -Register decoders via `ReadRegistry.builder().register(myDecoder)` and encoders via -`ServiceLoader` or `WriteRegistry.builder().register(myEncoder)`. -This mirrors Rust exactly — `vortex.date`, `vortex.uuid`, etc. are all Extension types -in Rust too. **No SPI for DType variants is planned.** - -### Layout — fixed set, no SPI - -`Layout` is a plain record with a `String encodingId`. `ScanIterator.decodeLayout()` -dispatches on the four known IDs (flat/chunked/zoned/struct/dict) and throws for -anything else. Custom layouts would require reader changes to traverse the tree, -decode metadata, and handle child counts — no upstream Vortex format has shipped a -custom layout that requires Java support. +## Code style -Decision: **keep the fixed set**. Revisit only when a concrete downstream use case -arrives that cannot be addressed by choosing a different encoding for a flat segment. +- 4-space indent, **zero SonarQube bugs/smells**, no `sun.misc.Unsafe` or internal JDK APIs. +- Prefer explicit over clever; fail fast on unhandled cases. +- Idiomatic modern Java: reuse the JDK (override `Iterator.forEachRemaining`, don't invent + `forEachChunk`; use `Optional`, records, sealed types, pattern switches, virtual threads, FFM). + New APIs should feel like JDK APIs. +- Always braces for `if`/`else`/`for`/`while`, even one-liners (`if (c) { return a; }`). +- **Time quantities use `java.time.Duration`, never `long`** (no `long timeoutMs`/`delayNanos`). + Exception: low-level JDK interop taking `long ns` (`Thread.sleep`, `LockSupport.parkNanos`, + `System.nanoTime` math) — convert at the call site via `duration.toNanos()`/`toMillis()`. -## API design +### Javadoc (build-enforced: `failOnError` + `failOnWarnings`) -- Keep public interfaces as small as possible. -- Don't expose internals. When in doubt, leave it out or make it private. +- Every public method: main prose description, `@param` per parameter, `@return` (unless `void`). + Every public record: `@param` per component on the class doc. `@see`-only counts as no description. +- All `///` Markdown — **no HTML** (checkstyle `RegexpSingleline` blocks `

`,`

    `,`
  • `, + ``,`
    `,``, …). Use blank `///` for paragraphs, `- ` lists, ` ```java ``` `,
    +  `**bold**`. Cross-refs `[ClassName#method(ParamType)]` — verify the target exists (wrong refs are
    +  **errors**).
    +- Check: `./mvnw javadoc:javadoc -pl core` must produce zero output.
     
    -## Code style
    +### Encoding class structure
     
    -- indents are 4 spaces, enforced by checkstyle
    -- Zero SonarQube bugs/smells policy.
    -- **Javadoc style:** all `///` Markdown Javadoc — no HTML tags. Use Markdown syntax:
    -  blank `///` line for paragraph break, `- item` for lists, ` ```java ``` ` for code
    -  blocks, `**text**` for bold. Checkstyle enforces this (`RegexpSingleline` rule in
    -  `checkstyle.xml` blocks `

    `, `

      `, `
    • `, ``, `
      `, `
    `, etc.). -- No `sun.misc.Unsafe` or internal JDK APIs. -- Prefer explicit over clever. Fail fast on unhandled cases. -- Always prefer idiomatic modern Java. Reuse the standard library and language - features the JDK already provides — e.g. override `Iterator.forEachRemaining` - instead of inventing a parallel `forEachChunk`; use `Optional`, records, - sealed types, pattern switches, virtual threads, FFM — over hand-rolled - equivalents. New APIs should look and feel like JDK APIs Java developers - already know. -- Always use braces for `if`/`else`/`for`/`while` bodies, even single-liners: - ```java - // WRONG - if (cond) return a; - - // CORRECT - if (cond) { - return a; - } - ``` -- **Time quantities use `java.time.Duration`, not `long`.** Any parameter, field, or - return type representing a time amount (timeout, delay, interval, period, deadline-offset) - must be `Duration`. No `long timeoutMs` / `long delayNanos` / `long intervalSeconds` — - the unit lives in the type, not the variable name. - ```java - // WRONG - Optional readKey(long timeoutMs) throws IOException; - - // CORRECT - Optional readKey(Duration timeout) throws IOException; - ``` - Exception: low-level interop with JDK APIs that take `long ns` (e.g. `Thread.sleep`, - `LockSupport.parkNanos`, `System.nanoTime` arithmetic) — convert at the call site - via `duration.toNanos()` / `toMillis()`. - -## Encoding class structure - -Encoding classes with non-trivial encode **and** decode paths must separate them into -private static inner classes named `Encoder` and `Decoder`. Shared low-level helpers -(buffer math, proto serialization) live in the side that owns them or in a third inner -class if genuinely shared by both. +Encodings with non-trivial encode **and** decode separate them into private static inner classes +`Encoder` and `Decoder` (shared low-level helpers live with their owner or a third inner class): ```java public final class FooEncoding implements Encoding { - - @Override - public EncodeResult encode(DType dtype, Object data) { - return Encoder.encode(dtype, data); - } - - @Override - public Array decode(DecodeContext ctx) { - return Decoder.decode(ctx); - } - - private static final class Encoder { - static EncodeResult encode(DType dtype, Object data) { ...} - } - - private static final class Decoder { - static Array decode(DecodeContext ctx) { ...} - } + @Override public EncodeResult encode(DType dtype, Object data) { return Encoder.encode(dtype, data); } + @Override public Array decode(DecodeContext ctx) { return Decoder.decode(ctx); } + private static final class Encoder { static EncodeResult encode(DType dtype, Object data) { ... } } + private static final class Decoder { static Array decode(DecodeContext ctx) { ... } } } ``` -Simple encodings (≤ ~80 lines total, e.g. `NullEncoding`, `BoolEncoding`) are exempt. +Simple encodings (≤ ~80 lines, e.g. `NullEncoding`, `BoolEncoding`) are exempt. -### Metadata-only encodings - -Some encodings store all data in proto3 metadata — no buffers, no children (e.g. `SequenceEncoding`). -Their `EncodeResult` uses an `EncodeNode` with `metadata` set and an empty `bufferIndices` array: - -```java -ByteBuffer metaBuf = ByteBuffer.wrap(meta.encode()); -EncodeNode node = new EncodeNode(encodingId, metaBuf, new EncodeNode[0], new int[]{}); -return new EncodeResult(node, List.of(), null, null); -``` - -The decoder reads back via `ctx.metadata()`, not `ctx.buffer(n)`: +**Metadata-only encodings** (all data in proto3 metadata, no buffers/children, e.g. `SequenceEncoding`): +`EncodeResult` uses an `EncodeNode` with `metadata` set and empty `bufferIndices`; the decoder reads +`ctx.metadata()` (not `ctx.buffer(n)`): ```java +EncodeNode node = new EncodeNode(encodingId, ByteBuffer.wrap(meta.encode()), new EncodeNode[0], new int[]{}); +// decode: MemorySegment metaSeg = MemorySegment.ofBuffer(ctx.metadata().duplicate()); FooMetadata meta = FooMetadata.decode(metaSeg, 0, metaSeg.byteSize()); ``` -Generated proto records live in `io.github.dfa1.vortex.proto`. The runtime decoder -(`ProtoReader`, `ProtoWriter`) is package-private — generated code calls it directly. -For oneof messages (e.g. `ScalarValue`), prefer the static `ofXxxValue(v)` factory over -the 11-arg constructor. +Generated proto records live in `io.github.dfa1.vortex.proto`; the runtime (`ProtoReader`, +`ProtoWriter`) is package-private. For oneof messages (e.g. `ScalarValue`) prefer the static +`ofXxxValue(v)` factory over the multi-arg constructor. ## Testing -- Every feature needs unit tests covering: happy path, negative cases (invalid input, error conditions), and corner - cases (empty, zero, max values, boundary conditions). -- Unit tests must be fast — no file I/O, no network, no sleep. Mock or use in-memory data. -- Integration tests are critical: there is no formal spec, so interoperability with the Rust reference implementation is - the ground truth. Write integration tests for every encoding round-trip and file format boundary. -- JUnit 5 + Mockito (BDDMockito) + AssertJ. -- Every test has `// Given` / `// When` / `// Then` sections. -- Class under test is always named `sut`. -- Use `BDDMockito` exclusively: `given(mock.method()).willReturn(value)`. Never the reverse form. Only static-import - `given` and `then` — not `willReturn`/`willThrow`. -- Prefer `@ParameterizedTest` over copy-pasting tests. Use `@ValueSource` when possible; `@ArgumentsSource` when more - structure needed (test case must have a name). -- Use `@ParameterizedTest` with seeded random generators for encoding/decoding logic where input space is large — they - find corner cases that example tests miss. -- Acceptance tests run the built jar end-to-end with hosh scripts. -- Use `@Nested` to group related tests by scenario or feature within a test class: - ```java - class FooEncodingTest { - @Nested class Encode { @Test void roundTrips() { ... } } - @Nested class Decode { @Test void handlesEmpty() { ... } } - } - ``` - `@BeforeEach` inside a `@Nested` class applies only to that group. Private helpers go - at the end of the class they serve, after all `@Test` methods. - -## Random-data parameterized tests - -Use `@ParameterizedTest` + `@MethodSource` for random-input coverage. Put generators in `RandomArrays` (integration -module) -or a similar utility class. Static provider methods in the test class delegate to the generator: - -```java -static Stream i64ArrayProvider() { - return RandomArrays.i64Arrays(30); -} - -@ParameterizedTest -@MethodSource("i64ArrayProvider") -void roundTrips(long[] data) { ...} -``` - -Keep counts low (10–30) for integration tests that involve file I/O or JNI. +- Cover happy path, negative cases (invalid input / errors), and corners (empty, zero, max, + boundaries). Unit tests must be fast — no file I/O, network, or sleep; mock or use in-memory data. +- **Integration tests are ground truth** (no formal spec): interop with the Rust reference. Write + one for every encoding round-trip and file-format boundary. +- JUnit 5 + Mockito (BDDMockito) + AssertJ. Class under test named `sut`. Every test has + `// Given` / `// When` / `// Then`. BDDMockito only: `given(mock.m()).willReturn(v)` / + `then(...)` (static-import only `given`/`then`, never `willReturn`/`willThrow`). +- Prefer `@ParameterizedTest` over copy-paste (`@ValueSource`, else `@ArgumentsSource`/named cases). + For large input spaces use seeded-random `@MethodSource` generators — they find corners examples + miss. Put generators in `RandomArrays` (integration) or a similar util; keep counts low (10–30) + when the test does file I/O or JNI. +- `@Nested` groups related scenarios (`@BeforeEach` in a nested class applies only to it). Private + helpers go after all `@Test` methods. +- Acceptance tests run the built jar end-to-end via hosh scripts.