diff --git a/README.md b/README.md index 0cd2d323..853168ee 100644 --- a/README.md +++ b/README.md @@ -61,7 +61,7 @@ line-by-line diff is zero. io.github.dfa1.vortex vortex-reader - 0.7.3 + 0.8.0 ``` diff --git a/TODO.md b/TODO.md index f3094566..0e30d8c6 100644 --- a/TODO.md +++ b/TODO.md @@ -22,11 +22,11 @@ parser exception. Each entry below is either a known gap, a contract audit, or s ### Parser hardening -- [ ] **Audit every `MemorySegment.asSlice` call site for bounds wrapping** — - `grep -rn 'asSlice' core/src/main reader/src/main`. Each call on untrusted offsets/lengths - must throw `VortexException` rather than the JDK's `IndexOutOfBoundsException`. Either wrap - per call site, or route through an `IoBounds.slice(seg, off, len)` helper and add a - Checkstyle rule rejecting raw `asSlice` in `io`/`scan`/`encoding` packages. +- [ ] **Checkstyle rule rejecting raw `MemorySegment.asSlice` on untrusted offsets** — + the `IoBounds.slice(seg, off, len)` helper shipped (ADR-0003 Phase E) and the untrusted + file-structure + decode call sites already route through it. Remaining: add the Checkstyle + rule blocking raw `asSlice` in the reader file-structure / decode packages, then a final + `grep -rn 'asSlice' core/src/main reader/src/main` sweep to catch any site the migration missed. ### Per-encoding adversarial tests @@ -85,7 +85,9 @@ Per-encoding gotchas: ## API - [ ] **Error messages — structural sanitization of `VortexException`** — - see [ADR-0003](docs/adr/0003-vortex-exception-sanitization.md) for design and phasing. + Phase E (bounds typing via `IoBounds`) shipped; remaining is Phases A–D (the `Sanitize` + helper + `VortexError` catalog). See [ADR-0003](docs/adr/0003-vortex-exception-sanitization.md) + for design and phasing. - [ ] Use domain primitives (`UInt32`, `UInt64`, etc.) as value classes via Project Valhalla instead of raw `long`/`int` - See [ADR-0008](docs/adr/0008-domain-primitives-unsigned-integers.md) and https://dfa1.github.io/articles/rethink-domain-primitives-with-valhalla - Candidates: `PType` integer kinds, buffer offsets, row indices, byte lengths @@ -96,6 +98,12 @@ Per-encoding gotchas: - Validate, fail fast (no silent clamp): consumer-facing `slice` throws `IllegalArgumentException` on `offset < 0 || length < 0 || offset + length > length()`; scan-internal `Offset*` construction fed by untrusted layout metadata throws `VortexException` - Add the matching compact-constructor bounds guard to all `Offset*` records (currently document `inner length >= offset + length` but enforce nothing — broadcast inner silently wraps via `i % cap`) +## Compute + +- [ ] **Compute primitives — masks, kernels, no-materialise** — pushdown filter/compare/aggregate + kernels operating on Lazy arrays without materialising. See [ADR-0013](docs/adr/0013-compute-primitives.md) + (Proposed). Gate: a concrete downstream consumer (e.g. the vortex-arrow bridge or filter pushdown). + ## Encodings See [docs/compatibility.md](docs/compatibility.md) for the full encoding support table and S3 fixture status. diff --git a/docs/adr/0003-vortex-exception-sanitization.md b/docs/adr/0003-vortex-exception-sanitization.md index 543f9d8e..20d6bbf5 100644 --- a/docs/adr/0003-vortex-exception-sanitization.md +++ b/docs/adr/0003-vortex-exception-sanitization.md @@ -1,6 +1,6 @@ # ADR 0003: `VortexException` contract — message sanitization and bounds typing -- **Status:** Accepted — implementation pending (see Phases below) +- **Status:** Accepted — Phase E (bounds typing via `IoBounds`) done; Phases A–D (message sanitization) pending - **Date:** 2026-06-13 (bounds-typing scope added 2026-06-20) - **Deciders:** project maintainer - **Related:** [ADR 0001 — Split read and write runtimes](0001-split-read-and-write-runtimes.md), diff --git a/docs/adr/0007-pco-encode.md b/docs/adr/0007-pco-encode.md index 28115ff1..a9c5f293 100644 --- a/docs/adr/0007-pco-encode.md +++ b/docs/adr/0007-pco-encode.md @@ -1,6 +1,6 @@ # ADR 0007: Pure-Java `vortex.pco` encoder -- **Status:** Implemented (E1-E5, E7-E9 done; E6 partial: IntMult only; FloatMult/FloatQuant deferred as marginal vs. existing Classic+ALP cascade) +- **Status:** Completed (E1-E5, E7-E9 done; E6 partial: IntMult only; FloatMult/FloatQuant deferred as marginal vs. existing Classic+ALP cascade) - **Date:** 2026-06-13 - **Deciders:** project maintainer - **Supersedes:** — diff --git a/docs/adr/0010-lazy-decode.md b/docs/adr/0010-lazy-decode.md index 5fd37287..e8d2a4c1 100644 --- a/docs/adr/0010-lazy-decode.md +++ b/docs/adr/0010-lazy-decode.md @@ -1,6 +1,6 @@ # ADR 0010: Lazy decode for 1:1 transform encodings -- **Status:** Implemented +- **Status:** Completed - **Date:** 2026-06-13 - **Implemented:** 2026-06-14 — 2026-06-15 - **Deciders:** project maintainer diff --git a/docs/adr/0012-zero-copy-layout-decoding.md b/docs/adr/0012-zero-copy-layout-decoding.md index 814c1b19..8d86c2a7 100644 --- a/docs/adr/0012-zero-copy-layout-decoding.md +++ b/docs/adr/0012-zero-copy-layout-decoding.md @@ -1,6 +1,6 @@ # ADR 0012: Zero-copy layout decoding — lazy Chunked and Dict arrays -- **Status:** Implemented +- **Status:** Completed - **Date:** 2026-06-14 - **Implemented:** 2026-06-15 (PRs #38, #39, #42) - **Deciders:** project maintainer diff --git a/docs/adr/0014-variant-encoding-strategy.md b/docs/adr/0014-variant-encoding-strategy.md index a07d46ae..c41a29f1 100644 --- a/docs/adr/0014-variant-encoding-strategy.md +++ b/docs/adr/0014-variant-encoding-strategy.md @@ -1,6 +1,6 @@ # ADR 0014: Variant encoding strategy — chunked constants now, parquet.variant later -- **Status:** Implemented +- **Status:** Completed - **Date:** 2026-06-18 - **Deciders:** project maintainer - **Supersedes:** — diff --git a/docs/adr/ADR.md b/docs/adr/ADR.md index 7bb1598a..8bfa7b41 100644 --- a/docs/adr/ADR.md +++ b/docs/adr/ADR.md @@ -15,7 +15,7 @@ Each ADR is a Markdown file named `NNNN-short-title.md`. Use `template.md` as th |------|-----------------------------------------------|-----------|----------| | 0001 | Split read and write runtimes out of core | Completed | 0.7.0 | | 0002 | Pluggable DType, Layout, and Compute | Deferred | | -| 0003 | VortexException message sanitization | Accepted | | +| 0003 | VortexException sanitization + bounds typing | Accepted | | | 0004 | ResourceLimits + ReadOptions | Accepted | | | 0005 | Vector API adoption | Deferred | | | 0006 | Benchmark publishing | Accepted | | diff --git a/docs/compatibility.md b/docs/compatibility.md index db93a43d..18f9b546 100644 --- a/docs/compatibility.md +++ b/docs/compatibility.md @@ -12,14 +12,14 @@ A consumer that only needs to read Vortex files can depend on a strict subset: io.github.dfa1.vortex vortex-reader - 0.7.3 + 0.8.0 io.github.dfa1.vortex vortex-inspector - 0.7.3 + 0.8.0 ``` @@ -71,7 +71,7 @@ resolves only the standalone decoders in `reader`; no encoder class is loaded. | `fastlanes.delta` | `DeltaEncodingDecoder` | `DeltaEncodingEncoder` | ✅ | ✅ | Integer PTypes | | `fastlanes.for` | `FrameOfReferenceEncodingDecoder`| `FrameOfReferenceEncodingEncoder`| ✅ | ✅ | Integer PTypes | | `fastlanes.rle` | `RleEncodingDecoder` | `RleEncodingEncoder` | ✅ | ✅ | Chunk-based RLE | -| `vortex.patched` | `PatchedEncodingDecoder` | `PatchedEncodingEncoder` | ✅ | ❌ | Primitive PTypes; encode not yet implemented | +| `vortex.patched` | `PatchedEncodingDecoder` | `PatchedEncodingEncoder` | ✅ | ✅ | Primitive PTypes; base + chunked patches (1024-elem blocks) | | `vortex.variant` | `VariantEncodingDecoder` | `VariantEncodingEncoder` | ✅ | ✅ | Canonical container; constant / chunked-of-constants core + optional shredded child. Typed-scalar values only — nested objects need `parquet.variant` (ADR 0014) | | `vortex.onpair` | _none_ | _none_ | ❌ | ❌ | Experimental in Rust 0.74.0; not yet ported | diff --git a/integration/src/test/java/io/github/dfa1/vortex/integration/JavaRoundTripIntegrationTest.java b/integration/src/test/java/io/github/dfa1/vortex/integration/JavaRoundTripIntegrationTest.java new file mode 100644 index 00000000..96a4b625 --- /dev/null +++ b/integration/src/test/java/io/github/dfa1/vortex/integration/JavaRoundTripIntegrationTest.java @@ -0,0 +1,77 @@ +package io.github.dfa1.vortex.integration; + +import io.github.dfa1.vortex.core.DType; +import io.github.dfa1.vortex.core.PType; +import io.github.dfa1.vortex.reader.ReadRegistry; +import io.github.dfa1.vortex.reader.ScanOptions; +import io.github.dfa1.vortex.reader.VortexReader; +import io.github.dfa1.vortex.reader.array.IntArray; +import io.github.dfa1.vortex.writer.VortexWriter; +import io.github.dfa1.vortex.writer.WriteOptions; +import io.github.dfa1.vortex.writer.encode.PatchedEncodingEncoder; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.IOException; +import java.nio.channels.FileChannel; +import java.nio.file.Path; +import java.nio.file.StandardOpenOption; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +import static org.assertj.core.api.Assertions.assertThat; + +/// Java writer → Java reader round-trips for encodings the bundled `vortex-jni` build cannot read +/// back, so they have no Java→Rust coverage. This is still a real cross-module integration test: it +/// drives the writer's encode, the on-disk file format, and the reader's decode end to end. +/// +/// `vortex.patched` is the case here — the JNI reader rejects a standalone patched array with +/// "Unknown encoding: vortex.patched", so the round-trip is asserted on the Java side instead. +class JavaRoundTripIntegrationTest { + + private static final DType.Struct I32_SCHEMA = new DType.Struct( + List.of("v"), + List.of(new DType.Primitive(PType.I32, false)), + false); + + @Test + void patched_i32_javaWriteJavaRead(@TempDir Path tmp) throws IOException { + // Given — most values fit ~6 bits; four large outliers (< n/20) force the patch path + // (base inner array + patch index / patch value children). Non-negative, so the bit width + // is computed from the value itself rather than sign-extension. + Path file = tmp.resolve("java_patched_i32.vtx"); + int[] data = new int[120]; + for (int i = 0; i < data.length; i++) { + data[i] = i % 50; + } + data[7] = 5_000_000; + data[23] = 6_000_000; + data[61] = 7_000_000; + data[88] = 8_000_000; + try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); + var sut = VortexWriter.create(ch, I32_SCHEMA, WriteOptions.defaults(), + List.of(new PatchedEncodingEncoder()))) { + // When + sut.writeChunk(Map.of("v", data)); + } + + // Then — the Java reader reconstructs base values + patched outliers exactly + int[] decoded = readIntColumn(file, "v"); + assertThat(decoded).containsExactly(data); + } + + private static int[] readIntColumn(Path file, String column) throws IOException { + try (var vf = VortexReader.open(file, ReadRegistry.loadAll()); + var iter = vf.scan(ScanOptions.columns(column))) { + var ints = new ArrayList(); + iter.forEachRemaining(c -> { + IntArray arr = (IntArray) c.columns().get(column); + for (long i = 0; i < arr.length(); i++) { + ints.add(arr.getInt(i)); + } + }); + return ints.stream().mapToInt(Integer::intValue).toArray(); + } + } +} diff --git a/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java b/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java index e911f988..4a905776 100644 --- a/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java +++ b/integration/src/test/java/io/github/dfa1/vortex/integration/JavaWritesRustReadsIntegrationTest.java @@ -15,6 +15,7 @@ import io.github.dfa1.vortex.writer.encode.ByteBoolEncodingEncoder; import io.github.dfa1.vortex.writer.encode.ConstantEncodingEncoder; import io.github.dfa1.vortex.writer.encode.DateExtensionEncoder; +import io.github.dfa1.vortex.writer.encode.DeltaEncodingEncoder; import io.github.dfa1.vortex.writer.encode.FsstEncodingEncoder; import io.github.dfa1.vortex.writer.encode.ListData; import io.github.dfa1.vortex.writer.encode.ListEncodingEncoder; @@ -994,6 +995,66 @@ void javaWriter_rustReader_sparse_i32(@TempDir Path tmp) throws IOException { assertThat(decoded).containsExactly(data); } + @Test + void javaWriter_rustReader_delta_i64(@TempDir Path tmp) throws IOException { + // Given — FastLanes delta: monotonic data with varied positive steps. The encoder pads to a + // 1024-element chunk and stores per-lane bases + deltas; Rust must trim back to n rows. + Path file = tmp.resolve("java_delta_i64.vtx"); + long[] data = new long[50]; + long acc = 1_000L; + for (int i = 0; i < data.length; i++) { + acc += 1 + (i % 7); + data[i] = acc; + } + try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); + var sut = VortexWriter.create(ch, TS_SCHEMA, WriteOptions.defaults(), + List.of(new DeltaEncodingEncoder()))) { + // When + sut.writeChunk(Map.of("ts", data)); + } + + // Then + long[] decoded = readLongColumn(file, "ts"); + assertThat(decoded).containsExactly(data); + } + + @Test + void javaWriter_rustReader_masked_nullableI64(@TempDir Path tmp) throws IOException { + // Given — nullable primitive I64 with interleaved nulls. The default write path wraps a + // nullable column as MaskedEncoding → PrimitiveEncoding (values + validity bool child). + // This is the direct Masked-over-primitive layout; nullable_date only covers Masked → Ext. + Path file = tmp.resolve("java_masked_i64.vtx"); + DType.Struct schema = new DType.Struct( + List.of("v"), + List.of(new DType.Primitive(PType.I64, true)), + false); + Long[] data = {10L, null, 30L, null, 50L, 60L}; + try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); + var sut = VortexWriter.create(ch, schema, WriteOptions.defaults())) { + // When — boxed Long[] via the map entry point routes through the nullable → masked path + sut.writeChunk(Map.of("v", data)); + } + + // Then — Rust reads a nullable BigInt vector; null positions survive, values round-trip + String uri = file.toAbsolutePath().toUri().toString(); + DataSource ds = DataSource.open(SESSION, uri); + Scan scan = ds.scan(ScanOptions.of()); + var values = new ArrayList(); + while (scan.hasNext()) { + Partition partition = scan.next(); + try (ArrowReader reader = partition.scanArrow(ALLOCATOR)) { + while (reader.loadNextBatch()) { + VectorSchemaRoot root = reader.getVectorSchemaRoot(); + BigIntVector vec = (BigIntVector) root.getVector("v"); + for (int i = 0; i < root.getRowCount(); i++) { + values.add(vec.isNull(i) ? null : vec.get(i)); + } + } + } + } + assertThat(values).containsExactly(10L, null, 30L, null, 50L, 60L); + } + @Test void javaWriter_rustReader_bool_boolEncoding(@TempDir Path tmp) throws IOException { // Given — BoolEncoding: bit-packed boolean column diff --git a/writer/src/main/java/io/github/dfa1/vortex/writer/ChunkImpl.java b/writer/src/main/java/io/github/dfa1/vortex/writer/ChunkImpl.java index a740d4e4..188f4d3e 100644 --- a/writer/src/main/java/io/github/dfa1/vortex/writer/ChunkImpl.java +++ b/writer/src/main/java/io/github/dfa1/vortex/writer/ChunkImpl.java @@ -42,10 +42,25 @@ Map finish() { return data; } - private static Object validateAndAdapt(String column, DType dtype, Object value) { + /// Validates a column's raw data against its schema dtype and adapts boxed nullable arrays + /// (`Long[]`, `Integer[]`, `Boolean[]`, …) into the internal [NullableData] carrier. Shared by + /// the builder ([#put]) and the map-based [VortexWriter#writeChunk(Map)] entry points so both + /// accept the same shapes. + /// + /// @param column the column name, for error messages + /// @param dtype the column's declared schema type + /// @param value the raw column data + /// @return the adapted data: the original array for non-nullable primitives, a [NullableData] + /// for boxed nullable arrays, or `value` unchanged for non-primitive carriers + static Object validateAndAdapt(String column, DType dtype, Object value) { if (value == null) { throw new IllegalArgumentException("null array for column: " + column); } + // Idempotent: an already-adapted NullableData (e.g. from the builder, which adapts before + // delegating to VortexWriter.writeChunk(Map)) passes through so a second call is a no-op. + if (value instanceof NullableData) { + return value; + } return switch (dtype) { case DType.Primitive p -> adaptPrimitive(column, p, value); case DType.Utf8 u -> adaptUtf8(column, u, value); diff --git a/writer/src/main/java/io/github/dfa1/vortex/writer/VortexWriter.java b/writer/src/main/java/io/github/dfa1/vortex/writer/VortexWriter.java index 93776f7a..f12275a4 100644 --- a/writer/src/main/java/io/github/dfa1/vortex/writer/VortexWriter.java +++ b/writer/src/main/java/io/github/dfa1/vortex/writer/VortexWriter.java @@ -369,11 +369,29 @@ public void writeChunk(java.util.function.Consumer builder) throws IOExce /// Write one chunk. Each column is encoded by the first registered encoder that accepts its dtype. /// + /// A nullable column may be supplied as a boxed array (`Long[]`, `Integer[]`, `Double[]`, + /// `Boolean[]`, …) with `null` marking absent rows; it routes through `MaskedEncoding` just like + /// the builder form. Non-nullable columns take the raw primitive array (`long[]`, `int[]`, …). + /// /// @param columns map from column name to typed array data /// @throws IOException if an I/O error occurs writing to the underlying channel /// @throws IllegalArgumentException if a schema column is missing from `columns`, /// or if column arrays disagree on row count public void writeChunk(Map columns) throws IOException { + // Adapt each column up front so the map entry point accepts the same shapes as the + // builder: boxed nullable arrays (Long[], Integer[], Boolean[], …) become NullableData, + // raw primitive arrays pass through. Done before the row-count check so length validation + // and encoding both see the normalized carrier. + Map adapted = new LinkedHashMap<>(); + for (int i = 0; i < schema.fieldNames().size(); i++) { + String colName = schema.fieldNames().get(i); + Object data = columns.get(colName); + if (data == null) { + throw new IllegalArgumentException("missing column: " + colName); + } + adapted.put(colName, ChunkImpl.validateAndAdapt(colName, schema.fieldTypes().get(i), data)); + } + // Pre-validate row counts so a length mismatch is rejected with a clear error // before any data is serialised. Without this check, the writer would produce a // file whose column chunks claim different row counts — readable but logically @@ -382,11 +400,7 @@ public void writeChunk(Map columns) throws IOException { String expectedFrom = null; for (int i = 0; i < schema.fieldNames().size(); i++) { String colName = schema.fieldNames().get(i); - Object data = columns.get(colName); - if (data == null) { - throw new IllegalArgumentException("missing column: " + colName); - } - long len = rowCountForValidation(colName, columns.get(colName)); + long len = rowCountForValidation(colName, adapted.get(colName)); if (expectedLen < 0) { expectedLen = len; expectedFrom = colName; @@ -400,7 +414,7 @@ public void writeChunk(Map columns) throws IOException { for (int i = 0; i < schema.fieldNames().size(); i++) { String colName = schema.fieldNames().get(i); DType colDtype = schema.fieldTypes().get(i); - Object data = columns.get(colName); + Object data = adapted.get(colName); // Auto-route extension columns: callers can pass List, List, // etc., and we route through the matching spec extension to produce the int[] / diff --git a/writer/src/test/java/io/github/dfa1/vortex/writer/VortexWriterTest.java b/writer/src/test/java/io/github/dfa1/vortex/writer/VortexWriterTest.java index 46cbe2f4..3755788e 100644 --- a/writer/src/test/java/io/github/dfa1/vortex/writer/VortexWriterTest.java +++ b/writer/src/test/java/io/github/dfa1/vortex/writer/VortexWriterTest.java @@ -12,6 +12,7 @@ import java.io.IOException; import java.nio.channels.FileChannel; +import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.StandardOpenOption; import java.util.ArrayList; @@ -85,6 +86,42 @@ void writeChunk_autoroutesExtensionCollectionViaSpecExtension(@TempDir Path tmp) } } + @Test + void writeChunk_map_nullablePrimitive_acceptsBoxedArray(@TempDir Path tmp) throws IOException { + // Given — nullable I64 column passed to the MAP entry point as a boxed Long[] with a null. + // Regression: the map path used to reject boxed arrays ("unsupported data type: Long[]"); + // only the builder accepted them. Both now share ChunkImpl.validateAndAdapt, so the map + // form routes the column through nullable → MaskedEncoding. The null round-trip itself is + // asserted end-to-end (through the JNI reader) by the integration masked test. + var schema = new DType.Struct(List.of("v"), + List.of(new DType.Primitive(PType.I64, true)), false); + Path file = tmp.resolve("nullable_map.vtx"); + + // When + try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); + var sut = VortexWriter.create(ch, schema, WriteOptions.defaults())) { + sut.writeChunk(Map.of("v", new Long[]{10L, null, 30L})); + } + + // Then — the masked file is well-formed + assertThat(Files.size(file)).isPositive(); + } + + @Test + void writeChunk_map_nonNullablePrimitive_rejectsBoxedArray(@TempDir Path tmp) throws IOException { + // Given — a non-nullable I64 column rejects a boxed array on the map path, same as the + // builder: boxed implies nullability, which the schema does not allow. + Path file = tmp.resolve("err.vtx"); + try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE); + var sut = VortexWriter.create(ch, SCHEMA, WriteOptions.defaults())) { + // When / Then + assertThatThrownBy(() -> sut.writeChunk(Map.of("id", new Long[]{1L, 2L}))) + .isInstanceOf(IllegalArgumentException.class) + .hasMessageContaining("non-nullable") + .hasMessageContaining("id"); + } + } + @Test void writeChunk_roundTripsTimeExtension(@TempDir Path tmp) throws IOException { // Given — milliseconds resolution exercises the I32 storage branch of