From f8e743fe1f0263afd0dc90c7ddb3b22783387b6b Mon Sep 17 00:00:00 2001 From: Davide Angelocola Date: Sat, 20 Jun 2026 13:39:28 +0200 Subject: [PATCH 1/3] =?UTF-8?q?docs:=20sync=20state=20=E2=80=94=20patched?= =?UTF-8?q?=20encode,=20ADR=20statuses,=200.8.0=20version?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Review-driven doc cleanup against current code: - compatibility.md: vortex.patched Encode ✅ (PatchedEncodingEncoder is implemented + SPI-registered; "encode not yet implemented" was stale). - README.md + compatibility.md: dependency snippet 0.7.3 → 0.8.0 (released). - ADR-0003: status now reflects Phase E (bounds typing via IoBounds) shipped, Phases A–D (message sanitization) pending; ADR.md index title widened to "sanitization + bounds typing". - ADR-0007/0010/0012/0014: normalize "Implemented" → "Completed" to match the documented status vocabulary (Proposed → Accepted → Completed | …). - TODO.md: narrow the asSlice line (IoBounds.slice shipped; only the Checkstyle rule + final sweep remain) and the sanitization line (Phases A–D remain); add a Compute section tracking ADR-0013 (was untracked). Co-Authored-By: Claude Opus 4.8 --- README.md | 2 +- TODO.md | 20 +++++++++++++------ .../adr/0003-vortex-exception-sanitization.md | 2 +- docs/adr/0007-pco-encode.md | 2 +- docs/adr/0010-lazy-decode.md | 2 +- docs/adr/0012-zero-copy-layout-decoding.md | 2 +- docs/adr/0014-variant-encoding-strategy.md | 2 +- docs/adr/ADR.md | 2 +- docs/compatibility.md | 6 +++--- 9 files changed, 24 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 0cd2d323..853168ee 100644 --- a/README.md +++ b/README.md @@ -61,7 +61,7 @@ line-by-line diff is zero. io.github.dfa1.vortex vortex-reader - 0.7.3 + 0.8.0 ``` diff --git a/TODO.md b/TODO.md index f3094566..0e30d8c6 100644 --- a/TODO.md +++ b/TODO.md @@ -22,11 +22,11 @@ parser exception. Each entry below is either a known gap, a contract audit, or s ### Parser hardening -- [ ] **Audit every `MemorySegment.asSlice` call site for bounds wrapping** — - `grep -rn 'asSlice' core/src/main reader/src/main`. Each call on untrusted offsets/lengths - must throw `VortexException` rather than the JDK's `IndexOutOfBoundsException`. Either wrap - per call site, or route through an `IoBounds.slice(seg, off, len)` helper and add a - Checkstyle rule rejecting raw `asSlice` in `io`/`scan`/`encoding` packages. +- [ ] **Checkstyle rule rejecting raw `MemorySegment.asSlice` on untrusted offsets** — + the `IoBounds.slice(seg, off, len)` helper shipped (ADR-0003 Phase E) and the untrusted + file-structure + decode call sites already route through it. Remaining: add the Checkstyle + rule blocking raw `asSlice` in the reader file-structure / decode packages, then a final + `grep -rn 'asSlice' core/src/main reader/src/main` sweep to catch any site the migration missed. ### Per-encoding adversarial tests @@ -85,7 +85,9 @@ Per-encoding gotchas: ## API - [ ] **Error messages — structural sanitization of `VortexException`** — - see [ADR-0003](docs/adr/0003-vortex-exception-sanitization.md) for design and phasing. + Phase E (bounds typing via `IoBounds`) shipped; remaining is Phases A–D (the `Sanitize` + helper + `VortexError` catalog). See [ADR-0003](docs/adr/0003-vortex-exception-sanitization.md) + for design and phasing. - [ ] Use domain primitives (`UInt32`, `UInt64`, etc.) as value classes via Project Valhalla instead of raw `long`/`int` - See [ADR-0008](docs/adr/0008-domain-primitives-unsigned-integers.md) and https://dfa1.github.io/articles/rethink-domain-primitives-with-valhalla - Candidates: `PType` integer kinds, buffer offsets, row indices, byte lengths @@ -96,6 +98,12 @@ Per-encoding gotchas: - Validate, fail fast (no silent clamp): consumer-facing `slice` throws `IllegalArgumentException` on `offset < 0 || length < 0 || offset + length > length()`; scan-internal `Offset*` construction fed by untrusted layout metadata throws `VortexException` - Add the matching compact-constructor bounds guard to all `Offset*` records (currently document `inner length >= offset + length` but enforce nothing — broadcast inner silently wraps via `i % cap`) +## Compute + +- [ ] **Compute primitives — masks, kernels, no-materialise** — pushdown filter/compare/aggregate + kernels operating on Lazy arrays without materialising. See [ADR-0013](docs/adr/0013-compute-primitives.md) + (Proposed). Gate: a concrete downstream consumer (e.g. the vortex-arrow bridge or filter pushdown). + ## Encodings See [docs/compatibility.md](docs/compatibility.md) for the full encoding support table and S3 fixture status. diff --git a/docs/adr/0003-vortex-exception-sanitization.md b/docs/adr/0003-vortex-exception-sanitization.md index 543f9d8e..20d6bbf5 100644 --- a/docs/adr/0003-vortex-exception-sanitization.md +++ b/docs/adr/0003-vortex-exception-sanitization.md @@ -1,6 +1,6 @@ # ADR 0003: `VortexException` contract — message sanitization and bounds typing -- **Status:** Accepted — implementation pending (see Phases below) +- **Status:** Accepted — Phase E (bounds typing via `IoBounds`) done; Phases A–D (message sanitization) pending - **Date:** 2026-06-13 (bounds-typing scope added 2026-06-20) - **Deciders:** project maintainer - **Related:** [ADR 0001 — Split read and write runtimes](0001-split-read-and-write-runtimes.md), diff --git a/docs/adr/0007-pco-encode.md b/docs/adr/0007-pco-encode.md index 28115ff1..a9c5f293 100644 --- a/docs/adr/0007-pco-encode.md +++ b/docs/adr/0007-pco-encode.md @@ -1,6 +1,6 @@ # ADR 0007: Pure-Java `vortex.pco` encoder -- **Status:** Implemented (E1-E5, E7-E9 done; E6 partial: IntMult only; FloatMult/FloatQuant deferred as marginal vs. existing Classic+ALP cascade) +- **Status:** Completed (E1-E5, E7-E9 done; E6 partial: IntMult only; FloatMult/FloatQuant deferred as marginal vs. existing Classic+ALP cascade) - **Date:** 2026-06-13 - **Deciders:** project maintainer - **Supersedes:** — diff --git a/docs/adr/0010-lazy-decode.md b/docs/adr/0010-lazy-decode.md index 5fd37287..e8d2a4c1 100644 --- a/docs/adr/0010-lazy-decode.md +++ b/docs/adr/0010-lazy-decode.md @@ -1,6 +1,6 @@ # ADR 0010: Lazy decode for 1:1 transform encodings -- **Status:** Implemented +- **Status:** Completed - **Date:** 2026-06-13 - **Implemented:** 2026-06-14 — 2026-06-15 - **Deciders:** project maintainer diff --git a/docs/adr/0012-zero-copy-layout-decoding.md b/docs/adr/0012-zero-copy-layout-decoding.md index 814c1b19..8d86c2a7 100644 --- a/docs/adr/0012-zero-copy-layout-decoding.md +++ b/docs/adr/0012-zero-copy-layout-decoding.md @@ -1,6 +1,6 @@ # ADR 0012: Zero-copy layout decoding — lazy Chunked and Dict arrays -- **Status:** Implemented +- **Status:** Completed - **Date:** 2026-06-14 - **Implemented:** 2026-06-15 (PRs #38, #39, #42) - **Deciders:** project maintainer diff --git a/docs/adr/0014-variant-encoding-strategy.md b/docs/adr/0014-variant-encoding-strategy.md index a07d46ae..c41a29f1 100644 --- a/docs/adr/0014-variant-encoding-strategy.md +++ b/docs/adr/0014-variant-encoding-strategy.md @@ -1,6 +1,6 @@ # ADR 0014: Variant encoding strategy — chunked constants now, parquet.variant later -- **Status:** Implemented +- **Status:** Completed - **Date:** 2026-06-18 - **Deciders:** project maintainer - **Supersedes:** — diff --git a/docs/adr/ADR.md b/docs/adr/ADR.md index 7bb1598a..8bfa7b41 100644 --- a/docs/adr/ADR.md +++ b/docs/adr/ADR.md @@ -15,7 +15,7 @@ Each ADR is a Markdown file named `NNNN-short-title.md`. Use `template.md` as th |------|-----------------------------------------------|-----------|----------| | 0001 | Split read and write runtimes out of core | Completed | 0.7.0 | | 0002 | Pluggable DType, Layout, and Compute | Deferred | | -| 0003 | VortexException message sanitization | Accepted | | +| 0003 | VortexException sanitization + bounds typing | Accepted | | | 0004 | ResourceLimits + ReadOptions | Accepted | | | 0005 | Vector API adoption | Deferred | | | 0006 | Benchmark publishing | Accepted | | diff --git a/docs/compatibility.md b/docs/compatibility.md index db93a43d..18f9b546 100644 --- a/docs/compatibility.md +++ b/docs/compatibility.md @@ -12,14 +12,14 @@ A consumer that only needs to read Vortex files can depend on a strict subset: io.github.dfa1.vortex vortex-reader - 0.7.3 + 0.8.0 io.github.dfa1.vortex vortex-inspector - 0.7.3 + 0.8.0 ``` @@ -71,7 +71,7 @@ resolves only the standalone decoders in `reader`; no encoder class is loaded. | `fastlanes.delta` | `DeltaEncodingDecoder` | `DeltaEncodingEncoder` | ✅ | ✅ | Integer PTypes | | `fastlanes.for` | `FrameOfReferenceEncodingDecoder`| `FrameOfReferenceEncodingEncoder`| ✅ | ✅ | Integer PTypes | | `fastlanes.rle` | `RleEncodingDecoder` | `RleEncodingEncoder` | ✅ | ✅ | Chunk-based RLE | -| `vortex.patched` | `PatchedEncodingDecoder` | `PatchedEncodingEncoder` | ✅ | ❌ | Primitive PTypes; encode not yet implemented | +| `vortex.patched` | `PatchedEncodingDecoder` | `PatchedEncodingEncoder` | ✅ | ✅ | Primitive PTypes; base + chunked patches (1024-elem blocks) | | `vortex.variant` | `VariantEncodingDecoder` | `VariantEncodingEncoder` | ✅ | ✅ | Canonical container; constant / chunked-of-constants core + optional shredded child. Typed-scalar values only — nested objects need `parquet.variant` (ADR 0014) | | `vortex.onpair` | _none_ | _none_ | ❌ | ❌ | Experimental in Rust 0.74.0; not yet ported | From 29c81ccadbbebc10548d52840c375e3bbbe5a41e Mon Sep 17 00:00:00 2001 From: Davide Angelocola Date: Sat, 20 Jun 2026 13:49:17 +0200 Subject: [PATCH 2/3] test(integration): cover patched, fastlanes.delta, masked round-trips MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Close three integration-coverage gaps found in the state review: - fastlanes.delta — Java→Rust round-trip (forced DeltaEncodingEncoder over a monotonic I64 ramp; Rust trims the padded 1024-chunk back to n rows). - vortex.masked — direct Masked-over-primitive Java→Rust round-trip (nullable I64 with interleaved nulls via the boxed Long[] builder path). Previously only covered indirectly through nullable_date (Masked → Ext → Primitive). - vortex.patched — the bundled vortex-jni build rejects a standalone patched array ("Unknown encoding: vortex.patched"), so this is a Java write → Java read round-trip in a new JavaRoundTripIntegrationTest. Still exercises the writer encode + file format + reader decode end to end; four outliers (< n/20) force the patch path. Full integration suite: 271 green, 2 @Disabled (f16/uuid, JNI-blocked). Co-Authored-By: Claude Opus 4.8 --- .../JavaRoundTripIntegrationTest.java | 77 +++++++++++++++++++ .../JavaWritesRustReadsIntegrationTest.java | 61 +++++++++++++++ 2 files changed, 138 insertions(+) create mode 100644 integration/src/test/java/io/github/dfa1/vortex/integration/JavaRoundTripIntegrationTest.java diff --git a/integration/src/test/java/io/github/dfa1/vortex/integration/JavaRoundTripIntegrationTest.java b/integration/src/test/java/io/github/dfa1/vortex/integration/JavaRoundTripIntegrationTest.java new file mode 100644 index 00000000..96a4b625 --- /dev/null +++ b/integration/src/test/java/io/github/dfa1/vortex/integration/JavaRoundTripIntegrationTest.java @@ -0,0 +1,77 @@ +package io.github.dfa1.vortex.integration; + +import io.github.dfa1.vortex.core.DType; +import io.github.dfa1.vortex.core.PType; +import io.github.dfa1.vortex.reader.ReadRegistry; +import io.github.dfa1.vortex.reader.ScanOptions; +import io.github.dfa1.vortex.reader.VortexReader; +import io.github.dfa1.vortex.reader.array.IntArray; +import io.github.dfa1.vortex.writer.VortexWriter; +import io.github.dfa1.vortex.writer.WriteOptions; +import io.github.dfa1.vortex.writer.encode.PatchedEncodingEncoder; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.IOException; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.StandardOpenOption; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +import static org.assertj.core.api.Assertions.assertThat; + +/// Java writer → Java reader round-trips for encodings the bundled `vortex-jni` build cannot read +/// back, so they have no Java→Rust coverage. This is still a real cross-module integration test: it +/// drives the writer's encode, the on-disk file format, and the reader's decode end to end. +/// +/// `vortex.patched` is the case here — the JNI reader rejects a standalone patched array with +/// "Unknown encoding: vortex.patched", so the round-trip is asserted on the Java side instead. +class JavaRoundTripIntegrationTest { + + private static final DType.Struct I32_SCHEMA = new DType.Struct( + List.of("v"), + List.of(new DType.Primitive(PType.I32, false)), + false); + + @Test + void patched_i32_javaWriteJavaRead(@TempDir Path tmp) throws IOException { + // Given — most values fit ~6 bits; four large outliers (< n/20) force the patch path + // (base inner array + patch index / patch value children). Non-negative, so the bit width + // is computed from the value itself rather than sign-extension. + Path file = tmp.resolve("java_patched_i32.vtx"); + int[] data = new int[120]; + for (int i = 0; i < data.length; i++) { + data[i] = i % 50; + } + data[7] = 5_000_000; + data[23] = 6_000_000; + data[61] = 7_000_000; + data[88] = 8_000_000; + try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); + var sut = VortexWriter.create(ch, I32_SCHEMA, WriteOptions.defaults(), + List.of(new PatchedEncodingEncoder()))) { + // When + sut.writeChunk(Map.of("v", data)); + } + + // Then — the Java reader reconstructs base values + patched outliers exactly + int[] decoded = readIntColumn(file, "v"); + assertThat(decoded).containsExactly(data); + } + + private static int[] readIntColumn(Path file, String column) throws IOException { + try (var vf = VortexReader.open(file, ReadRegistry.loadAll()); + var iter = vf.scan(ScanOptions.columns(column))) { + var ints = new ArrayList(); + iter.forEachRemaining(c -> { + IntArray arr = (IntArray) c.columns().get(column); + for (long i = 0; i < arr.length(); i++) { + ints.add(arr.getInt(i)); + } + }); + return ints.stream().mapToInt(Integer::intValue).toArray(); + } + } +} diff --git a/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java b/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java index e911f988..5de32b0f 100644 --- a/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java +++ b/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java @@ -15,6 +15,7 @@ import io.github.dfa1.vortex.writer.encode.ByteBoolEncodingEncoder; import io.github.dfa1.vortex.writer.encode.ConstantEncodingEncoder; import io.github.dfa1.vortex.writer.encode.DateExtensionEncoder; +import io.github.dfa1.vortex.writer.encode.DeltaEncodingEncoder; import io.github.dfa1.vortex.writer.encode.FsstEncodingEncoder; import io.github.dfa1.vortex.writer.encode.ListData; import io.github.dfa1.vortex.writer.encode.ListEncodingEncoder; @@ -994,6 +995,66 @@ void javaWriter_rustReader_sparse_i32(@TempDir Path tmp) throws IOException { assertThat(decoded).containsExactly(data); } + @Test + void javaWriter_rustReader_delta_i64(@TempDir Path tmp) throws IOException { + // Given — FastLanes delta: monotonic data with varied positive steps. The encoder pads to a + // 1024-element chunk and stores per-lane bases + deltas; Rust must trim back to n rows. + Path file = tmp.resolve("java_delta_i64.vtx"); + long[] data = new long[50]; + long acc = 1_000L; + for (int i = 0; i < data.length; i++) { + acc += 1 + (i % 7); + data[i] = acc; + } + try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); + var sut = VortexWriter.create(ch, TS_SCHEMA, WriteOptions.defaults(), + List.of(new DeltaEncodingEncoder()))) { + // When + sut.writeChunk(Map.of("ts", data)); + } + + // Then + long[] decoded = readLongColumn(file, "ts"); + assertThat(decoded).containsExactly(data); + } + + @Test + void javaWriter_rustReader_masked_nullableI64(@TempDir Path tmp) throws IOException { + // Given — nullable primitive I64 with interleaved nulls. The default write path wraps a + // nullable column as MaskedEncoding → PrimitiveEncoding (values + validity bool child). + // This is the direct Masked-over-primitive layout; nullable_date only covers Masked → Ext. + Path file = tmp.resolve("java_masked_i64.vtx"); + DType.Struct schema = new DType.Struct( + List.of("v"), + List.of(new DType.Primitive(PType.I64, true)), + false); + Long[] data = {10L, null, 30L, null, 50L, 60L}; + try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); + var sut = VortexWriter.create(ch, schema, WriteOptions.defaults())) { + // When — boxed Long[] via the builder routes through the nullable → masked path + sut.writeChunk(c -> c.put("v", data)); + } + + // Then — Rust reads a nullable BigInt vector; null positions survive, values round-trip + String uri = file.toAbsolutePath().toUri().toString(); + DataSource ds = DataSource.open(SESSION, uri); + Scan scan = ds.scan(ScanOptions.of()); + var values = new ArrayList(); + while (scan.hasNext()) { + Partition partition = scan.next(); + try (ArrowReader reader = partition.scanArrow(ALLOCATOR)) { + while (reader.loadNextBatch()) { + VectorSchemaRoot root = reader.getVectorSchemaRoot(); + BigIntVector vec = (BigIntVector) root.getVector("v"); + for (int i = 0; i < root.getRowCount(); i++) { + values.add(vec.isNull(i) ? null : vec.get(i)); + } + } + } + } + assertThat(values).containsExactly(10L, null, 30L, null, 50L, 60L); + } + @Test void javaWriter_rustReader_bool_boolEncoding(@TempDir Path tmp) throws IOException { // Given — BoolEncoding: bit-packed boolean column From 2180e3e8ab015a26188e1a0a92b4aee62d0d3bb7 Mon Sep 17 00:00:00 2001 From: Davide Angelocola Date: Sat, 20 Jun 2026 13:58:36 +0200 Subject: [PATCH 3/3] feat(writer): accept boxed nullable arrays on the map writeChunk path MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The map entry point writeChunk(Map) only accepted raw primitive arrays, so a nullable column had to be supplied via the builder (writeChunk(c -> c.put(..., new Long[]{...}))). Passing a boxed Long[]/Integer[]/… through the map threw "unsupported data type". The two entry points now share ChunkImpl.validateAndAdapt: - writeChunk(Map) adapts every column up front (boxed nullable array → NullableData, raw arrays pass through) before the row-count check and encoding, so the map form routes nullable columns through MaskedEncoding exactly like the builder. - validateAndAdapt is now package-private and idempotent: an already-adapted NullableData passes through, since the builder pre-adapts before delegating to writeChunk(Map) (would otherwise double-adapt and reject NullableData). Tests: VortexWriterTest gains map-path accept (boxed Long[] with null) + reject (boxed on non-nullable) cases; the integration masked round-trip now drives the map form, verifying null preservation end-to-end through the JNI reader. Co-Authored-By: Claude Opus 4.8 --- .../JavaWritesRustReadsIntegrationTest.java | 4 +- .../github/dfa1/vortex/writer/ChunkImpl.java | 17 ++++++++- .../dfa1/vortex/writer/VortexWriter.java | 26 ++++++++++--- .../dfa1/vortex/writer/VortexWriterTest.java | 37 +++++++++++++++++++ 4 files changed, 75 insertions(+), 9 deletions(-) diff --git a/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java b/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java index 5de32b0f..4a905776 100644 --- a/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java +++ b/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java @@ -1031,8 +1031,8 @@ void javaWriter_rustReader_masked_nullableI64(@TempDir Path tmp) throws IOExcept Long[] data = {10L, null, 30L, null, 50L, 60L}; try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); var sut = VortexWriter.create(ch, schema, WriteOptions.defaults())) { - // When — boxed Long[] via the builder routes through the nullable → masked path - sut.writeChunk(c -> c.put("v", data)); + // When — boxed Long[] via the map entry point routes through the nullable → masked path + sut.writeChunk(Map.of("v", data)); } // Then — Rust reads a nullable BigInt vector; null positions survive, values round-trip diff --git a/writer/src/main/java/io/github/dfa1/vortex/writer/ChunkImpl.java b/writer/src/main/java/io/github/dfa1/vortex/writer/ChunkImpl.java index a740d4e4..188f4d3e 100644 --- a/writer/src/main/java/io/github/dfa1/vortex/writer/ChunkImpl.java +++ b/writer/src/main/java/io/github/dfa1/vortex/writer/ChunkImpl.java @@ -42,10 +42,25 @@ Map finish() { return data; } - private static Object validateAndAdapt(String column, DType dtype, Object value) { + /// Validates a column's raw data against its schema dtype and adapts boxed nullable arrays + /// (`Long[]`, `Integer[]`, `Boolean[]`, …) into the internal [NullableData] carrier. Shared by + /// the builder ([#put]) and the map-based [VortexWriter#writeChunk(Map)] entry points so both + /// accept the same shapes. + /// + /// @param column the column name, for error messages + /// @param dtype the column's declared schema type + /// @param value the raw column data + /// @return the adapted data: the original array for non-nullable primitives, a [NullableData] + /// for boxed nullable arrays, or `value` unchanged for non-primitive carriers + static Object validateAndAdapt(String column, DType dtype, Object value) { if (value == null) { throw new IllegalArgumentException("null array for column: " + column); } + // Idempotent: an already-adapted NullableData (e.g. from the builder, which adapts before + // delegating to VortexWriter.writeChunk(Map)) passes through so a second call is a no-op. + if (value instanceof NullableData) { + return value; + } return switch (dtype) { case DType.Primitive p -> adaptPrimitive(column, p, value); case DType.Utf8 u -> adaptUtf8(column, u, value); diff --git a/writer/src/main/java/io/github/dfa1/vortex/writer/VortexWriter.java b/writer/src/main/java/io/github/dfa1/vortex/writer/VortexWriter.java index 93776f7a..f12275a4 100644 --- a/writer/src/main/java/io/github/dfa1/vortex/writer/VortexWriter.java +++ b/writer/src/main/java/io/github/dfa1/vortex/writer/VortexWriter.java @@ -369,11 +369,29 @@ public void writeChunk(java.util.function.Consumer builder) throws IOExce /// Write one chunk. Each column is encoded by the first registered encoder that accepts its dtype. /// + /// A nullable column may be supplied as a boxed array (`Long[]`, `Integer[]`, `Double[]`, + /// `Boolean[]`, …) with `null` marking absent rows; it routes through `MaskedEncoding` just like + /// the builder form. Non-nullable columns take the raw primitive array (`long[]`, `int[]`, …). + /// /// @param columns map from column name to typed array data /// @throws IOException if an I/O error occurs writing to the underlying channel /// @throws IllegalArgumentException if a schema column is missing from `columns`, /// or if column arrays disagree on row count public void writeChunk(Map columns) throws IOException { + // Adapt each column up front so the map entry point accepts the same shapes as the + // builder: boxed nullable arrays (Long[], Integer[], Boolean[], …) become NullableData, + // raw primitive arrays pass through. Done before the row-count check so length validation + // and encoding both see the normalized carrier. + Map adapted = new LinkedHashMap<>(); + for (int i = 0; i < schema.fieldNames().size(); i++) { + String colName = schema.fieldNames().get(i); + Object data = columns.get(colName); + if (data == null) { + throw new IllegalArgumentException("missing column: " + colName); + } + adapted.put(colName, ChunkImpl.validateAndAdapt(colName, schema.fieldTypes().get(i), data)); + } + // Pre-validate row counts so a length mismatch is rejected with a clear error // before any data is serialised. Without this check, the writer would produce a // file whose column chunks claim different row counts — readable but logically @@ -382,11 +400,7 @@ public void writeChunk(Map columns) throws IOException { String expectedFrom = null; for (int i = 0; i < schema.fieldNames().size(); i++) { String colName = schema.fieldNames().get(i); - Object data = columns.get(colName); - if (data == null) { - throw new IllegalArgumentException("missing column: " + colName); - } - long len = rowCountForValidation(colName, columns.get(colName)); + long len = rowCountForValidation(colName, adapted.get(colName)); if (expectedLen < 0) { expectedLen = len; expectedFrom = colName; @@ -400,7 +414,7 @@ public void writeChunk(Map columns) throws IOException { for (int i = 0; i < schema.fieldNames().size(); i++) { String colName = schema.fieldNames().get(i); DType colDtype = schema.fieldTypes().get(i); - Object data = columns.get(colName); + Object data = adapted.get(colName); // Auto-route extension columns: callers can pass List, List, // etc., and we route through the matching spec extension to produce the int[] / diff --git a/writer/src/test/java/io/github/dfa1/vortex/writer/VortexWriterTest.java b/writer/src/test/java/io/github/dfa1/vortex/writer/VortexWriterTest.java index 46cbe2f4..3755788e 100644 --- a/writer/src/test/java/io/github/dfa1/vortex/writer/VortexWriterTest.java +++ b/writer/src/test/java/io/github/dfa1/vortex/writer/VortexWriterTest.java @@ -12,6 +12,7 @@ import java.io.IOException; import java.nio.channels.FileChannel; +import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.StandardOpenOption; import java.util.ArrayList; @@ -85,6 +86,42 @@ void writeChunk_autoroutesExtensionCollectionViaSpecExtension(@TempDir Path tmp) } } + @Test + void writeChunk_map_nullablePrimitive_acceptsBoxedArray(@TempDir Path tmp) throws IOException { + // Given — nullable I64 column passed to the MAP entry point as a boxed Long[] with a null. + // Regression: the map path used to reject boxed arrays ("unsupported data type: Long[]"); + // only the builder accepted them. Both now share ChunkImpl.validateAndAdapt, so the map + // form routes the column through nullable → MaskedEncoding. The null round-trip itself is + // asserted end-to-end (through the JNI reader) by the integration masked test. + var schema = new DType.Struct(List.of("v"), + List.of(new DType.Primitive(PType.I64, true)), false); + Path file = tmp.resolve("nullable_map.vtx"); + + // When + try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); + var sut = VortexWriter.create(ch, schema, WriteOptions.defaults())) { + sut.writeChunk(Map.of("v", new Long[]{10L, null, 30L})); + } + + // Then — the masked file is well-formed + assertThat(Files.size(file)).isPositive(); + } + + @Test + void writeChunk_map_nonNullablePrimitive_rejectsBoxedArray(@TempDir Path tmp) throws IOException { + // Given — a non-nullable I64 column rejects a boxed array on the map path, same as the + // builder: boxed implies nullability, which the schema does not allow. + Path file = tmp.resolve("err.vtx"); + try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); + var sut = VortexWriter.create(ch, SCHEMA, WriteOptions.defaults())) { + // When / Then + assertThatThrownBy(() -> sut.writeChunk(Map.of("id", new Long[]{1L, 2L}))) + .isInstanceOf(IllegalArgumentException.class) + .hasMessageContaining("non-nullable") + .hasMessageContaining("id"); + } + } + @Test void writeChunk_roundTripsTimeExtension(@TempDir Path tmp) throws IOException { // Given — milliseconds resolution exercises the I32 storage branch of