Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,70 @@ All notable changes to **vortex-java** are documented here.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.6.0] — Unreleased

The headline theme is the **proto-rewrite**: `protobuf-java` is dropped in favour of an
in-tree MemorySegment-native proto3 codec, generated from `.proto` schemas by a new
`proto-gen` module. CLI uber-jar shrinks ~14% and the JDK 25 `sun.misc.Unsafe` stderr
warning (emitted by `protobuf-java`'s `UnsafeUtil`) is gone.

### Added

- **`proto-gen` module** — build-time `.proto` to Java code generator. Lexer + parser +
type registry + emitter. Outputs one immutable Java `record` per message and one Java
`enum` per proto enum, each carrying a `@Generated("io.github.dfa1.vortex.protogen.CodeGen")`
annotation. Records expose `decode(MemorySegment, long, long)` static factories and
`encode()` instance methods that operate directly on a memory segment — zero `byte[]`
copy, no `protobuf-java` runtime.
- **`ProtoReader` / `ProtoWriter`** — package-private proto3 wire-format primitives
under `io.github.dfa1.vortex.proto`. Reads varint / sint64 / fixed32 / fixed64 /
length-delimited / packed-repeated payloads, with bounds checks and a 10-byte cap on
varint length. 42 unit tests cover happy path + truncation + bounds.
- **Oneof factories** on generated records (e.g. `ScalarValue.ofInt64Value(123L)`) —
avoids the 11-arg constructor for `ScalarValue`'s oneof.
- **`PatchedMetadata` / `VariantMetadata`** — added to `encodings.proto`. Previously
hand-parsed with `CodedInputStream`; now go through the generated record path.

### Changed

- **Build-time tooling**: `regenerate-sources` profile no longer shells out to `protoc`.
Run `./mvnw compile -pl proto-gen` once, then
`./mvnw generate-sources -pl core -P regenerate-sources`. `brew install protobuf` is
no longer needed for normal development.
- **Encoding consumers**: 25 encoding classes (`ALP`, `Bitpacked`, `Dict`, `Rle`,
`Sparse`, `Sequence`, etc.) and 23 test files rewritten to use the new record API.
Constructor calls are positional; field accessors follow proto3 snake_case
(`meta.bit_width()`, not `meta.getBitWidth()`).

### Removed

- **`com.google.protobuf:protobuf-java`** dependency dropped from `core`, `reader`,
`writer`, and root `dependencyManagement`. The `protobuf.version` property is gone.
CLI uber-jar: **14 MB → 12 MB**. JDK 25 `sun.misc.Unsafe::arrayBaseOffset` stderr
warning emitted by `UnsafeUtil` on every cold start: **gone**.
- `protoc` no longer required by the build. `brew install flatbuffers` covers `.fbs`
edits; `.proto` edits use the in-process generator.

### Compatibility

Wire-format compatibility with the Rust reference implementation is unchanged and is
verified by the full integration suite:

- `RustWritesJavaReadsIntegrationTest` (10 tests) — Rust writes, Java reads
- `JavaWritesRustReadsIntegrationTest` (194 tests) — Java writes, JNI reads
- `RustJavaReaderComparisonIntegrationTest` (25 tests) — both readers, same file
- `ParquetImportIntegrationTest` (5 tests) — round-trip through ParquetImporter

All 872 unit + 243 integration tests pass on JDK 25.

### Performance

No measurable change on bulk-read benchmarks (`RustVsJavaReadBenchmark.javaReadCascading`
within 1% of main, stdev ±2 ops/s). Proto metadata parse is < 1% of work on multi-million-row
scans; the win is architectural, not throughput.

[0.6.0]: https://github.com/dfa1/vortex-java/compare/v0.5.0...main

## [0.5.0] — 2026-06-09

The headline themes are an **interactive inspector TUI** for navigating Vortex files
Expand Down
29 changes: 23 additions & 6 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,24 @@ Never use `mvn install` or `./mvwn install`.

Generated sources (`fbs`/`proto` → Java) are committed under `core/src/main/java`.
Normal builds need no external tools.

Proto-to-Java generation is in-process via the `proto-gen` module (no `protoc` needed).
The generator emits one record per message with a {@code decode(MemorySegment, long, long)} static
factory and an {@code encode()} method that operate directly on a memory segment — no `byte[]`
copy, no `protobuf-java` runtime, no `sun.misc.Unsafe`.

To regenerate after editing `.fbs` or `.proto` schemas:

```bash
brew install flatbuffers protobuf
brew install flatbuffers # only needed for .fbs edits
./mvnw compile -pl proto-gen # build the proto generator (only on .proto edits)
./mvnw generate-sources -pl core -P regenerate-sources
# then commit the updated files
```

Any `flatc` version works — the profile strips the version guard automatically.
`flatc` runs every time the profile is active; if you only changed `.proto` files, revert any
spurious `fbs/` diffs with `git checkout -- core/src/main/java/io/github/dfa1/vortex/fbs/`.

```bash
# Build all modules
Expand Down Expand Up @@ -276,18 +285,26 @@ Simple encodings (≤ ~80 lines total, e.g. `NullEncoding`, `BoolEncoding`) are

### Metadata-only encodings

Some encodings store all data in protobuf metadata — no buffers, no children (e.g. `SequenceEncoding`).
Some encodings store all data in proto3 metadata — no buffers, no children (e.g. `SequenceEncoding`).
Their `EncodeResult` uses an `EncodeNode` with `metadata` set and an empty `bufferIndices` array:

```java
ByteBuffer metaBuf = ByteBuffer.wrap(meta.toByteArray());
ByteBuffer metaBuf = ByteBuffer.wrap(meta.encode());
EncodeNode node = new EncodeNode(encodingId, metaBuf, new EncodeNode[0], new int[]{});
return new
return new EncodeResult(node, List.of(), null, null);
```

EncodeResult(node, List.of(), null,null);
The decoder reads back via `ctx.metadata()`, not `ctx.buffer(n)`:

```java
MemorySegment metaSeg = MemorySegment.ofBuffer(ctx.metadata().duplicate());
FooMetadata meta = FooMetadata.decode(metaSeg, 0, metaSeg.byteSize());
```

The decoder reads back via `ctx.metadata()`, not `ctx.buffer(n)`.
Generated proto records live in `io.github.dfa1.vortex.proto`. The runtime decoder
(`ProtoReader`, `ProtoWriter`) is package-private — generated code calls it directly.
For oneof messages (e.g. `ScalarValue`), prefer the static `ofXxxValue(v)` factory over
the 11-arg constructor.

## Testing

Expand Down
27 changes: 16 additions & 11 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# Security Policy

`vortex-java` reads and writes the [Vortex columnar file format](https://github.com/vortex-data/vortex).
The reader memory-maps and parses untrusted binary input — trailers, FlatBuffers, Protobuf
metadata, and per-segment encoded data. Robustness against malformed input is treated as a
correctness contract, not a best-effort feature.
The reader memory-maps and parses untrusted binary input — trailers, FlatBuffers, proto3
metadata (via the in-tree MemorySegment-native `ProtoReader` — no `protobuf-java` runtime),
and per-segment encoded data. Robustness against malformed input is treated as a correctness
contract, not a best-effort feature.

## Supported versions

Expand All @@ -12,9 +13,9 @@ only if the vulnerability is critical and the fix is mechanical.

| Version | Status |
| ------- | ----------------------- |
| 0.4.x | Supported |
| 0.3.x | Critical fixes only |
| < 0.3 | End of life |
| 0.6.x | Supported |
| 0.5.x | Critical fixes only |
| < 0.5 | End of life |

## Reporting a vulnerability

Expand Down Expand Up @@ -46,7 +47,7 @@ In scope:
- Any malformed `.vortex` input that causes the reader to throw an exception other than
`io.github.dfa1.vortex.core.VortexException` (e.g. `IndexOutOfBoundsException`,
`NegativeArraySizeException`, `OutOfMemoryError`, `StackOverflowError`, raw FlatBuffer
runtime exceptions, raw Protobuf parser exceptions, or a JVM crash via the FFM layer).
runtime exceptions, raw `IOException` from the proto3 reader, or a JVM crash via the FFM layer).
- Any malformed `.vortex` input that causes the reader to allocate memory disproportionate
to its on-disk size (zip-bomb-style amplification).
- Any malformed `.vortex` input that causes silent data corruption — wrong row count,
Expand All @@ -58,9 +59,10 @@ Out of scope:

- Denial of service from legitimately large inputs (multi-gigabyte files). Use the
resource caps in `ReadOptions` (planned) to bound them.
- Vulnerabilities in third-party dependencies (`vortex-jni`, `zstd-jni`, FlatBuffers runtime,
Protobuf runtime). Report those upstream; we'll bump the dependency once a fixed version
is available.
- Vulnerabilities in third-party dependencies (`vortex-jni`, `zstd-jni`, FlatBuffers runtime).
Report those upstream; we'll bump the dependency once a fixed version is available.
Vortex no longer depends on `protobuf-java` — proto3 parsing is handled by the in-tree
`ProtoReader` (issues there are in scope).
- Performance regressions or correctness bugs unrelated to malformed input — please open
a regular issue.

Expand All @@ -76,9 +78,12 @@ exception**. Concretely:
self-referential FlatBuffer cycles).
- Layout metadata is capped at 4 MiB.
- `Decimal` precision is restricted to `[1, 38]`; `scale` to `[0, precision]`.
- `PType` ordinals from Protobuf are bounds-checked.
- `PType` ordinals from proto3 are bounds-checked.
- `ConstantEncoding` and dict-layout decode allocate `O(1)` memory regardless of the
declared row count (zip-bomb mitigation).
- `ProtoReader` enforces varint length ≤ 10 bytes, rejects truncated len-delim regions,
and validates segment bounds on every read. (0.6.0+ — replaces the `protobuf-java`
parser path; same exception contract.)

The regression suite lives under `reader/src/test/java/.../*SecurityTest`. Run with
`./mvnw test -Dtest='*SecurityTest'`.
Expand Down
25 changes: 15 additions & 10 deletions core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,6 @@
<groupId>com.google.flatbuffers</groupId>
<artifactId>flatbuffers-java</artifactId>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
</dependency>
<dependency>
<groupId>io.airlift</groupId>
<artifactId>aircompressor-v3</artifactId>
Expand Down Expand Up @@ -56,7 +52,7 @@
Normal builds need no external tools.

To regenerate after editing .fbs or .proto schemas:
brew install flatbuffers protobuf
brew install flatbuffers
./mvnw generate-sources -pl core -P regenerate-sources
Then commit the updated files.
Any flatc version works — the profile strips the version guard automatically.
Expand Down Expand Up @@ -93,18 +89,27 @@
</arguments>
</configuration>
</execution>
<!-- 2. Generate Java from Protobuf schemas -->
<!--
2. Generate MemorySegment-native Java from Protobuf schemas via vortex-proto-gen.

Pre-step: run `./mvnw compile -pl proto-gen` once so this exec finds the classes.
We use a direct exec (rather than declaring vortex-proto-gen as a Maven dep) to avoid
an artificial provided-scope dep that would leak into the published core POM.
-->
<execution>
<id>protoc-generate</id>
<id>protogen-generate</id>
<phase>generate-sources</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<executable>protoc</executable>
<executable>java</executable>
<arguments>
<argument>--java_out=${project.basedir}/src/main/java</argument>
<argument>--proto_path=${project.basedir}/src/main/proto</argument>
<argument>-cp</argument>
<argument>${project.basedir}/../proto-gen/target/classes</argument>
<argument>io.github.dfa1.vortex.protogen.Main</argument>
<argument>--out</argument>
<argument>${project.basedir}/src/main/java/io/github/dfa1/vortex/proto</argument>
<argument>${project.basedir}/src/main/proto/dtype.proto</argument>
<argument>${project.basedir}/src/main/proto/scalar.proto</argument>
<argument>${project.basedir}/src/main/proto/encodings.proto</argument>
Expand Down
43 changes: 29 additions & 14 deletions core/src/main/java/io/github/dfa1/vortex/core/ArrayStats.java
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
package io.github.dfa1.vortex.core;

import com.google.protobuf.InvalidProtocolBufferException;
import io.github.dfa1.vortex.proto.ScalarProtos;
import io.github.dfa1.vortex.proto.ScalarValue;

import java.io.IOException;
import java.lang.foreign.MemorySegment;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;

/// Per-array statistics embedded in the encoding tree.
///
Expand Down Expand Up @@ -52,18 +54,31 @@ private static Object decodeScalar(ByteBuffer bytes) {
return null;
}
try {
ScalarProtos.ScalarValue sv = ScalarProtos.ScalarValue.parseFrom(bytes.duplicate());
return switch (sv.getKindCase()) {
case INT64_VALUE -> sv.getInt64Value();
case UINT64_VALUE -> sv.getUint64Value();
case F32_VALUE -> sv.getF32Value();
case F64_VALUE -> sv.getF64Value();
case BOOL_VALUE -> sv.getBoolValue();
case STRING_VALUE -> sv.getStringValue();
case BYTES_VALUE -> sv.getBytesValue().toStringUtf8();
default -> null;
};
} catch (InvalidProtocolBufferException e) {
MemorySegment seg = MemorySegment.ofBuffer(bytes.duplicate());
ScalarValue sv = ScalarValue.decode(seg, 0, seg.byteSize());
if (sv.int64_value() != null) {
return sv.int64_value();
}
if (sv.uint64_value() != null) {
return sv.uint64_value();
}
if (sv.f32_value() != null) {
return sv.f32_value();
}
if (sv.f64_value() != null) {
return sv.f64_value();
}
if (sv.bool_value() != null) {
return sv.bool_value();
}
if (sv.string_value() != null) {
return sv.string_value();
}
if (sv.bytes_value() != null) {
return new String(sv.bytes_value(), StandardCharsets.UTF_8);
}
return null;
} catch (IOException e) {
throw new VortexException("invalid scalar value in array stats", e);
}
}
Expand Down
Loading
Loading