Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ line-by-line diff is zero.
<dependency>
<groupId>io.github.dfa1.vortex</groupId>
<artifactId>vortex-reader</artifactId>
<version>0.7.3</version>
<version>0.8.0</version>
</dependency>
```

Expand Down
20 changes: 14 additions & 6 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@ parser exception. Each entry below is either a known gap, a contract audit, or s

### Parser hardening

- [ ] **Audit every `MemorySegment.asSlice` call site for bounds wrapping** —
`grep -rn 'asSlice' core/src/main reader/src/main`. Each call on untrusted offsets/lengths
must throw `VortexException` rather than the JDK's `IndexOutOfBoundsException`. Either wrap
per call site, or route through an `IoBounds.slice(seg, off, len)` helper and add a
Checkstyle rule rejecting raw `asSlice` in `io`/`scan`/`encoding` packages.
- [ ] **Checkstyle rule rejecting raw `MemorySegment.asSlice` on untrusted offsets** —
the `IoBounds.slice(seg, off, len)` helper shipped (ADR-0003 Phase E) and the untrusted
file-structure + decode call sites already route through it. Remaining: add the Checkstyle
rule blocking raw `asSlice` in the reader file-structure / decode packages, then a final
`grep -rn 'asSlice' core/src/main reader/src/main` sweep to catch any site the migration missed.

### Per-encoding adversarial tests

Expand Down Expand Up @@ -85,7 +85,9 @@ Per-encoding gotchas:
## API

- [ ] **Error messages — structural sanitization of `VortexException`** —
see [ADR-0003](docs/adr/0003-vortex-exception-sanitization.md) for design and phasing.
Phase E (bounds typing via `IoBounds`) shipped; remaining is Phases A–D (the `Sanitize`
helper + `VortexError` catalog). See [ADR-0003](docs/adr/0003-vortex-exception-sanitization.md)
for design and phasing.
- [ ] Use domain primitives (`UInt32`, `UInt64`, etc.) as value classes via Project Valhalla instead of raw `long`/`int`
- See [ADR-0008](docs/adr/0008-domain-primitives-unsigned-integers.md) and https://dfa1.github.io/articles/rethink-domain-primitives-with-valhalla
- Candidates: `PType` integer kinds, buffer offsets, row indices, byte lengths
Expand All @@ -96,6 +98,12 @@ Per-encoding gotchas:
- Validate, fail fast (no silent clamp): consumer-facing `slice` throws `IllegalArgumentException` on `offset < 0 || length < 0 || offset + length > length()`; scan-internal `Offset*` construction fed by untrusted layout metadata throws `VortexException`
- Add the matching compact-constructor bounds guard to all `Offset*` records (currently document `inner length >= offset + length` but enforce nothing — broadcast inner silently wraps via `i % cap`)

## Compute

- [ ] **Compute primitives — masks, kernels, no-materialise** — pushdown filter/compare/aggregate
kernels operating on Lazy arrays without materialising. See [ADR-0013](docs/adr/0013-compute-primitives.md)
(Proposed). Gate: a concrete downstream consumer (e.g. the vortex-arrow bridge or filter pushdown).

## Encodings

See [docs/compatibility.md](docs/compatibility.md) for the full encoding support table and S3 fixture status.
Expand Down
2 changes: 1 addition & 1 deletion docs/adr/0003-vortex-exception-sanitization.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ADR 0003: `VortexException` contract — message sanitization and bounds typing

- **Status:** Accepted — implementation pending (see Phases below)
- **Status:** Accepted — Phase E (bounds typing via `IoBounds`) done; Phases A–D (message sanitization) pending
- **Date:** 2026-06-13 (bounds-typing scope added 2026-06-20)
- **Deciders:** project maintainer
- **Related:** [ADR 0001 — Split read and write runtimes](0001-split-read-and-write-runtimes.md),
Expand Down
2 changes: 1 addition & 1 deletion docs/adr/0007-pco-encode.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ADR 0007: Pure-Java `vortex.pco` encoder

- **Status:** Implemented (E1-E5, E7-E9 done; E6 partial: IntMult only; FloatMult/FloatQuant deferred as marginal vs. existing Classic+ALP cascade)
- **Status:** Completed (E1-E5, E7-E9 done; E6 partial: IntMult only; FloatMult/FloatQuant deferred as marginal vs. existing Classic+ALP cascade)
- **Date:** 2026-06-13
- **Deciders:** project maintainer
- **Supersedes:** —
Expand Down
2 changes: 1 addition & 1 deletion docs/adr/0010-lazy-decode.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ADR 0010: Lazy decode for 1:1 transform encodings

- **Status:** Implemented
- **Status:** Completed
- **Date:** 2026-06-13
- **Implemented:** 2026-06-14 — 2026-06-15
- **Deciders:** project maintainer
Expand Down
2 changes: 1 addition & 1 deletion docs/adr/0012-zero-copy-layout-decoding.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ADR 0012: Zero-copy layout decoding — lazy Chunked and Dict arrays

- **Status:** Implemented
- **Status:** Completed
- **Date:** 2026-06-14
- **Implemented:** 2026-06-15 (PRs #38, #39, #42)
- **Deciders:** project maintainer
Expand Down
2 changes: 1 addition & 1 deletion docs/adr/0014-variant-encoding-strategy.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ADR 0014: Variant encoding strategy — chunked constants now, parquet.variant later

- **Status:** Implemented
- **Status:** Completed
- **Date:** 2026-06-18
- **Deciders:** project maintainer
- **Supersedes:** —
Expand Down
2 changes: 1 addition & 1 deletion docs/adr/ADR.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Each ADR is a Markdown file named `NNNN-short-title.md`. Use `template.md` as th
|------|-----------------------------------------------|-----------|----------|
| 0001 | Split read and write runtimes out of core | Completed | 0.7.0 |
| 0002 | Pluggable DType, Layout, and Compute | Deferred | |
| 0003 | VortexException message sanitization | Accepted | |
| 0003 | VortexException sanitization + bounds typing | Accepted | |
| 0004 | ResourceLimits + ReadOptions | Accepted | |
| 0005 | Vector API adoption | Deferred | |
| 0006 | Benchmark publishing | Accepted | |
Expand Down
6 changes: 3 additions & 3 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@ A consumer that only needs to read Vortex files can depend on a strict subset:
<dependency>
<groupId>io.github.dfa1.vortex</groupId>
<artifactId>vortex-reader</artifactId>
<version>0.7.3</version>
<version>0.8.0</version>
</dependency>

<!-- optional: inspector for layout-tree introspection -->
<dependency>
<groupId>io.github.dfa1.vortex</groupId>
<artifactId>vortex-inspector</artifactId>
<version>0.7.3</version>
<version>0.8.0</version>
</dependency>
```

Expand Down Expand Up @@ -71,7 +71,7 @@ resolves only the standalone decoders in `reader`; no encoder class is loaded.
| `fastlanes.delta` | `DeltaEncodingDecoder` | `DeltaEncodingEncoder` | ✅ | ✅ | Integer PTypes |
| `fastlanes.for` | `FrameOfReferenceEncodingDecoder`| `FrameOfReferenceEncodingEncoder`| ✅ | ✅ | Integer PTypes |
| `fastlanes.rle` | `RleEncodingDecoder` | `RleEncodingEncoder` | ✅ | ✅ | Chunk-based RLE |
| `vortex.patched` | `PatchedEncodingDecoder` | `PatchedEncodingEncoder` | ✅ | | Primitive PTypes; encode not yet implemented |
| `vortex.patched` | `PatchedEncodingDecoder` | `PatchedEncodingEncoder` | ✅ | | Primitive PTypes; base + chunked patches (1024-elem blocks) |
| `vortex.variant` | `VariantEncodingDecoder` | `VariantEncodingEncoder` | ✅ | ✅ | Canonical container; constant / chunked-of-constants core + optional shredded child. Typed-scalar values only — nested objects need `parquet.variant` (ADR 0014) |
| `vortex.onpair` | _none_ | _none_ | ❌ | ❌ | Experimental in Rust 0.74.0; not yet ported |

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
package io.github.dfa1.vortex.integration;

import io.github.dfa1.vortex.core.DType;
import io.github.dfa1.vortex.core.PType;
import io.github.dfa1.vortex.reader.ReadRegistry;
import io.github.dfa1.vortex.reader.ScanOptions;
import io.github.dfa1.vortex.reader.VortexReader;
import io.github.dfa1.vortex.reader.array.IntArray;
import io.github.dfa1.vortex.writer.VortexWriter;
import io.github.dfa1.vortex.writer.WriteOptions;
import io.github.dfa1.vortex.writer.encode.PatchedEncodingEncoder;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;

import java.io.IOException;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

import static org.assertj.core.api.Assertions.assertThat;

/// Java writer → Java reader round-trips for encodings the bundled `vortex-jni` build cannot read
/// back, so they have no Java→Rust coverage. This is still a real cross-module integration test: it
/// drives the writer's encode, the on-disk file format, and the reader's decode end to end.
///
/// `vortex.patched` is the case here — the JNI reader rejects a standalone patched array with
/// "Unknown encoding: vortex.patched", so the round-trip is asserted on the Java side instead.
class JavaRoundTripIntegrationTest {

private static final DType.Struct I32_SCHEMA = new DType.Struct(
List.of("v"),
List.of(new DType.Primitive(PType.I32, false)),
false);

@Test
void patched_i32_javaWriteJavaRead(@TempDir Path tmp) throws IOException {
// Given — most values fit ~6 bits; four large outliers (< n/20) force the patch path
// (base inner array + patch index / patch value children). Non-negative, so the bit width
// is computed from the value itself rather than sign-extension.
Path file = tmp.resolve("java_patched_i32.vtx");
int[] data = new int[120];
for (int i = 0; i < data.length; i++) {
data[i] = i % 50;
}
data[7] = 5_000_000;
data[23] = 6_000_000;
data[61] = 7_000_000;
data[88] = 8_000_000;
try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE);
var sut = VortexWriter.create(ch, I32_SCHEMA, WriteOptions.defaults(),
List.of(new PatchedEncodingEncoder()))) {
// When
sut.writeChunk(Map.of("v", data));
}

// Then — the Java reader reconstructs base values + patched outliers exactly
int[] decoded = readIntColumn(file, "v");
assertThat(decoded).containsExactly(data);
}

private static int[] readIntColumn(Path file, String column) throws IOException {
try (var vf = VortexReader.open(file, ReadRegistry.loadAll());
var iter = vf.scan(ScanOptions.columns(column))) {
var ints = new ArrayList<Integer>();
iter.forEachRemaining(c -> {
IntArray arr = (IntArray) c.columns().get(column);
for (long i = 0; i < arr.length(); i++) {
ints.add(arr.getInt(i));
}
});
return ints.stream().mapToInt(Integer::intValue).toArray();
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import io.github.dfa1.vortex.writer.encode.ByteBoolEncodingEncoder;
import io.github.dfa1.vortex.writer.encode.ConstantEncodingEncoder;
import io.github.dfa1.vortex.writer.encode.DateExtensionEncoder;
import io.github.dfa1.vortex.writer.encode.DeltaEncodingEncoder;
import io.github.dfa1.vortex.writer.encode.FsstEncodingEncoder;
import io.github.dfa1.vortex.writer.encode.ListData;
import io.github.dfa1.vortex.writer.encode.ListEncodingEncoder;
Expand Down Expand Up @@ -994,6 +995,66 @@ void javaWriter_rustReader_sparse_i32(@TempDir Path tmp) throws IOException {
assertThat(decoded).containsExactly(data);
}

@Test
void javaWriter_rustReader_delta_i64(@TempDir Path tmp) throws IOException {
// Given — FastLanes delta: monotonic data with varied positive steps. The encoder pads to a
// 1024-element chunk and stores per-lane bases + deltas; Rust must trim back to n rows.
Path file = tmp.resolve("java_delta_i64.vtx");
long[] data = new long[50];
long acc = 1_000L;
for (int i = 0; i < data.length; i++) {
acc += 1 + (i % 7);
data[i] = acc;
}
try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE);
var sut = VortexWriter.create(ch, TS_SCHEMA, WriteOptions.defaults(),
List.of(new DeltaEncodingEncoder()))) {
// When
sut.writeChunk(Map.of("ts", data));
}

// Then
long[] decoded = readLongColumn(file, "ts");
assertThat(decoded).containsExactly(data);
}

@Test
void javaWriter_rustReader_masked_nullableI64(@TempDir Path tmp) throws IOException {
// Given — nullable primitive I64 with interleaved nulls. The default write path wraps a
// nullable column as MaskedEncoding → PrimitiveEncoding (values + validity bool child).
// This is the direct Masked-over-primitive layout; nullable_date only covers Masked → Ext.
Path file = tmp.resolve("java_masked_i64.vtx");
DType.Struct schema = new DType.Struct(
List.of("v"),
List.of(new DType.Primitive(PType.I64, true)),
false);
Long[] data = {10L, null, 30L, null, 50L, 60L};
try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE);
var sut = VortexWriter.create(ch, schema, WriteOptions.defaults())) {
// When — boxed Long[] via the map entry point routes through the nullable → masked path
sut.writeChunk(Map.of("v", data));
}

// Then — Rust reads a nullable BigInt vector; null positions survive, values round-trip
String uri = file.toAbsolutePath().toUri().toString();
DataSource ds = DataSource.open(SESSION, uri);
Scan scan = ds.scan(ScanOptions.of());
var values = new ArrayList<Long>();
while (scan.hasNext()) {
Partition partition = scan.next();
try (ArrowReader reader = partition.scanArrow(ALLOCATOR)) {
while (reader.loadNextBatch()) {
VectorSchemaRoot root = reader.getVectorSchemaRoot();
BigIntVector vec = (BigIntVector) root.getVector("v");
for (int i = 0; i < root.getRowCount(); i++) {
values.add(vec.isNull(i) ? null : vec.get(i));
}
}
}
}
assertThat(values).containsExactly(10L, null, 30L, null, 50L, 60L);
}

@Test
void javaWriter_rustReader_bool_boolEncoding(@TempDir Path tmp) throws IOException {
// Given — BoolEncoding: bit-packed boolean column
Expand Down
17 changes: 16 additions & 1 deletion writer/src/main/java/io/github/dfa1/vortex/writer/ChunkImpl.java
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,25 @@ Map<String, Object> finish() {
return data;
}

private static Object validateAndAdapt(String column, DType dtype, Object value) {
/// Validates a column's raw data against its schema dtype and adapts boxed nullable arrays
/// (`Long[]`, `Integer[]`, `Boolean[]`, …) into the internal [NullableData] carrier. Shared by
/// the builder ([#put]) and the map-based [VortexWriter#writeChunk(Map)] entry points so both
/// accept the same shapes.
///
/// @param column the column name, for error messages
/// @param dtype the column's declared schema type
/// @param value the raw column data
/// @return the adapted data: the original array for non-nullable primitives, a [NullableData]
/// for boxed nullable arrays, or `value` unchanged for non-primitive carriers
static Object validateAndAdapt(String column, DType dtype, Object value) {
if (value == null) {
throw new IllegalArgumentException("null array for column: " + column);
}
// Idempotent: an already-adapted NullableData (e.g. from the builder, which adapts before
// delegating to VortexWriter.writeChunk(Map)) passes through so a second call is a no-op.
if (value instanceof NullableData) {
return value;
}
return switch (dtype) {
case DType.Primitive p -> adaptPrimitive(column, p, value);
case DType.Utf8 u -> adaptUtf8(column, u, value);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -369,11 +369,29 @@ public void writeChunk(java.util.function.Consumer<Chunk> builder) throws IOExce

/// Write one chunk. Each column is encoded by the first registered encoder that accepts its dtype.
///
/// A nullable column may be supplied as a boxed array (`Long[]`, `Integer[]`, `Double[]`,
/// `Boolean[]`, …) with `null` marking absent rows; it routes through `MaskedEncoding` just like
/// the builder form. Non-nullable columns take the raw primitive array (`long[]`, `int[]`, …).
///
/// @param columns map from column name to typed array data
/// @throws IOException if an I/O error occurs writing to the underlying channel
/// @throws IllegalArgumentException if a schema column is missing from `columns`,
/// or if column arrays disagree on row count
public void writeChunk(Map<String, Object> columns) throws IOException {
// Adapt each column up front so the map entry point accepts the same shapes as the
// builder: boxed nullable arrays (Long[], Integer[], Boolean[], …) become NullableData,
// raw primitive arrays pass through. Done before the row-count check so length validation
// and encoding both see the normalized carrier.
Map<String, Object> adapted = new LinkedHashMap<>();
for (int i = 0; i < schema.fieldNames().size(); i++) {
String colName = schema.fieldNames().get(i);
Object data = columns.get(colName);
if (data == null) {
throw new IllegalArgumentException("missing column: " + colName);
}
adapted.put(colName, ChunkImpl.validateAndAdapt(colName, schema.fieldTypes().get(i), data));
}

// Pre-validate row counts so a length mismatch is rejected with a clear error
// before any data is serialised. Without this check, the writer would produce a
// file whose column chunks claim different row counts — readable but logically
Expand All @@ -382,11 +400,7 @@ public void writeChunk(Map<String, Object> columns) throws IOException {
String expectedFrom = null;
for (int i = 0; i < schema.fieldNames().size(); i++) {
String colName = schema.fieldNames().get(i);
Object data = columns.get(colName);
if (data == null) {
throw new IllegalArgumentException("missing column: " + colName);
}
long len = rowCountForValidation(colName, columns.get(colName));
long len = rowCountForValidation(colName, adapted.get(colName));
if (expectedLen < 0) {
expectedLen = len;
expectedFrom = colName;
Expand All @@ -400,7 +414,7 @@ public void writeChunk(Map<String, Object> columns) throws IOException {
for (int i = 0; i < schema.fieldNames().size(); i++) {
String colName = schema.fieldNames().get(i);
DType colDtype = schema.fieldTypes().get(i);
Object data = columns.get(colName);
Object data = adapted.get(colName);

// Auto-route extension columns: callers can pass List<LocalDate>, List<Instant>,
// etc., and we route through the matching spec extension to produce the int[] /
Expand Down
Loading