dfa1 · dfa1 · Jun 10, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,70 @@ All notable changes to **vortex-java** are documented here.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.6.0] — Unreleased
+
+The headline theme is the **proto-rewrite**: `protobuf-java` is dropped in favour of an
+in-tree MemorySegment-native proto3 codec, generated from `.proto` schemas by a new
+`proto-gen` module. CLI uber-jar shrinks ~14% and the JDK 25 `sun.misc.Unsafe` stderr
+warning (emitted by `protobuf-java`'s `UnsafeUtil`) is gone.
+
+### Added
+
+- **`proto-gen` module** — build-time `.proto` to Java code generator. Lexer + parser +
+  type registry + emitter. Outputs one immutable Java `record` per message and one Java
+  `enum` per proto enum, each carrying a `@Generated("io.github.dfa1.vortex.protogen.CodeGen")`
+  annotation. Records expose `decode(MemorySegment, long, long)` static factories and
+  `encode()` instance methods that operate directly on a memory segment — zero `byte[]`
+  copy, no `protobuf-java` runtime.
+- **`ProtoReader` / `ProtoWriter`** — package-private proto3 wire-format primitives
+  under `io.github.dfa1.vortex.proto`. Reads varint / sint64 / fixed32 / fixed64 /
+  length-delimited / packed-repeated payloads, with bounds checks and a 10-byte cap on
+  varint length. 42 unit tests cover happy path + truncation + bounds.
+- **Oneof factories** on generated records (e.g. `ScalarValue.ofInt64Value(123L)`) —
+  avoids the 11-arg constructor for `ScalarValue`'s oneof.
+- **`PatchedMetadata` / `VariantMetadata`** — added to `encodings.proto`. Previously
+  hand-parsed with `CodedInputStream`; now go through the generated record path.
+
+### Changed
+
+- **Build-time tooling**: `regenerate-sources` profile no longer shells out to `protoc`.
+  Run `./mvnw compile -pl proto-gen` once, then
+  `./mvnw generate-sources -pl core -P regenerate-sources`. `brew install protobuf` is
+  no longer needed for normal development.
+- **Encoding consumers**: 25 encoding classes (`ALP`, `Bitpacked`, `Dict`, `Rle`,
+  `Sparse`, `Sequence`, etc.) and 23 test files rewritten to use the new record API.
+  Constructor calls are positional; field accessors follow proto3 snake_case
+  (`meta.bit_width()`, not `meta.getBitWidth()`).
+
+### Removed
+
+- **`com.google.protobuf:protobuf-java`** dependency dropped from `core`, `reader`,
+  `writer`, and root `dependencyManagement`. The `protobuf.version` property is gone.
+  CLI uber-jar: **14 MB → 12 MB**. JDK 25 `sun.misc.Unsafe::arrayBaseOffset` stderr
+  warning emitted by `UnsafeUtil` on every cold start: **gone**.
+- `protoc` no longer required by the build. `brew install flatbuffers` covers `.fbs`
+  edits; `.proto` edits use the in-process generator.
+
+### Compatibility
+
+Wire-format compatibility with the Rust reference implementation is unchanged and is
+verified by the full integration suite:
+
+- `RustWritesJavaReadsIntegrationTest` (10 tests) — Rust writes, Java reads
+- `JavaWritesRustReadsIntegrationTest` (194 tests) — Java writes, JNI reads
+- `RustJavaReaderComparisonIntegrationTest` (25 tests) — both readers, same file
+- `ParquetImportIntegrationTest` (5 tests) — round-trip through ParquetImporter
+
+All 872 unit + 243 integration tests pass on JDK 25.
+
+### Performance
+
+No measurable change on bulk-read benchmarks (`RustVsJavaReadBenchmark.javaReadCascading`
+within 1% of main, stdev ±2 ops/s). Proto metadata parse is < 1% of work on multi-million-row
+scans; the win is architectural, not throughput.
+
+[0.6.0]: https://github.com/dfa1/vortex-java/compare/v0.5.0...main
+
 ## [0.5.0] — 2026-06-09
 
 The headline themes are an **interactive inspector TUI** for navigating Vortex files

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -18,15 +18,24 @@ Never use `mvn install` or `./mvwn install`.
 
 Generated sources (`fbs`/`proto` → Java) are committed under `core/src/main/java`.
 Normal builds need no external tools.
+
+Proto-to-Java generation is in-process via the `proto-gen` module (no `protoc` needed).
+The generator emits one record per message with a {@code decode(MemorySegment, long, long)} static
+factory and an {@code encode()} method that operate directly on a memory segment — no `byte[]`
+copy, no `protobuf-java` runtime, no `sun.misc.Unsafe`.
+
 To regenerate after editing `.fbs` or `.proto` schemas:
 
 ```bash
-brew install flatbuffers protobuf
+brew install flatbuffers              # only needed for .fbs edits
+./mvnw compile -pl proto-gen          # build the proto generator (only on .proto edits)
 ./mvnw generate-sources -pl core -P regenerate-sources
 # then commit the updated files
 ```
 
 Any `flatc` version works — the profile strips the version guard automatically.
+`flatc` runs every time the profile is active; if you only changed `.proto` files, revert any
+spurious `fbs/` diffs with `git checkout -- core/src/main/java/io/github/dfa1/vortex/fbs/`.
 
 ```bash
 # Build all modules
@@ -276,18 +285,26 @@ Simple encodings (≤ ~80 lines total, e.g. `NullEncoding`, `BoolEncoding`) are
 
 ### Metadata-only encodings
 
-Some encodings store all data in protobuf metadata — no buffers, no children (e.g. `SequenceEncoding`).
+Some encodings store all data in proto3 metadata — no buffers, no children (e.g. `SequenceEncoding`).
 Their `EncodeResult` uses an `EncodeNode` with `metadata` set and an empty `bufferIndices` array:
 
 ```java
-ByteBuffer metaBuf = ByteBuffer.wrap(meta.toByteArray());
+ByteBuffer metaBuf = ByteBuffer.wrap(meta.encode());
 EncodeNode node = new EncodeNode(encodingId, metaBuf, new EncodeNode[0], new int[]{});
-return new
+return new EncodeResult(node, List.of(), null, null);
+```
 
-EncodeResult(node, List.of(), null,null);
+The decoder reads back via `ctx.metadata()`, not `ctx.buffer(n)`:
+
+```java
+MemorySegment metaSeg = MemorySegment.ofBuffer(ctx.metadata().duplicate());
+FooMetadata meta = FooMetadata.decode(metaSeg, 0, metaSeg.byteSize());
 ```
 
-The decoder reads back via `ctx.metadata()`, not `ctx.buffer(n)`.
+Generated proto records live in `io.github.dfa1.vortex.proto`. The runtime decoder
+(`ProtoReader`, `ProtoWriter`) is package-private — generated code calls it directly.
+For oneof messages (e.g. `ScalarValue`), prefer the static `ofXxxValue(v)` factory over
+the 11-arg constructor.
 
 ## Testing
 

diff --git a/SECURITY.md b/SECURITY.md
@@ -1,9 +1,10 @@
 # Security Policy
 
 `vortex-java` reads and writes the [Vortex columnar file format](https://github.com/vortex-data/vortex).
-The reader memory-maps and parses untrusted binary input — trailers, FlatBuffers, Protobuf
-metadata, and per-segment encoded data. Robustness against malformed input is treated as a
-correctness contract, not a best-effort feature.
+The reader memory-maps and parses untrusted binary input — trailers, FlatBuffers, proto3
+metadata (via the in-tree MemorySegment-native `ProtoReader` — no `protobuf-java` runtime),
+and per-segment encoded data. Robustness against malformed input is treated as a correctness
+contract, not a best-effort feature.
 
 ## Supported versions
 
@@ -12,9 +13,9 @@ only if the vulnerability is critical and the fix is mechanical.
 
 | Version | Status                  |
 | ------- | ----------------------- |
-| 0.4.x   | Supported               |
-| 0.3.x   | Critical fixes only     |
-| < 0.3   | End of life             |
+| 0.6.x   | Supported               |
+| 0.5.x   | Critical fixes only     |
+| < 0.5   | End of life             |
 
 ## Reporting a vulnerability
 
@@ -46,7 +47,7 @@ In scope:
 - Any malformed `.vortex` input that causes the reader to throw an exception other than
   `io.github.dfa1.vortex.core.VortexException` (e.g. `IndexOutOfBoundsException`,
   `NegativeArraySizeException`, `OutOfMemoryError`, `StackOverflowError`, raw FlatBuffer
-  runtime exceptions, raw Protobuf parser exceptions, or a JVM crash via the FFM layer).
+  runtime exceptions, raw `IOException` from the proto3 reader, or a JVM crash via the FFM layer).
 - Any malformed `.vortex` input that causes the reader to allocate memory disproportionate
   to its on-disk size (zip-bomb-style amplification).
 - Any malformed `.vortex` input that causes silent data corruption — wrong row count,
@@ -58,9 +59,10 @@ Out of scope:
 
 - Denial of service from legitimately large inputs (multi-gigabyte files). Use the
   resource caps in `ReadOptions` (planned) to bound them.
-- Vulnerabilities in third-party dependencies (`vortex-jni`, `zstd-jni`, FlatBuffers runtime,
-  Protobuf runtime). Report those upstream; we'll bump the dependency once a fixed version
-  is available.
+- Vulnerabilities in third-party dependencies (`vortex-jni`, `zstd-jni`, FlatBuffers runtime).
+  Report those upstream; we'll bump the dependency once a fixed version is available.
+  Vortex no longer depends on `protobuf-java` — proto3 parsing is handled by the in-tree
+  `ProtoReader` (issues there are in scope).
 - Performance regressions or correctness bugs unrelated to malformed input — please open
   a regular issue.
 
@@ -76,9 +78,12 @@ exception**. Concretely:
   self-referential FlatBuffer cycles).
 - Layout metadata is capped at 4 MiB.
 - `Decimal` precision is restricted to `[1, 38]`; `scale` to `[0, precision]`.
-- `PType` ordinals from Protobuf are bounds-checked.
+- `PType` ordinals from proto3 are bounds-checked.
 - `ConstantEncoding` and dict-layout decode allocate `O(1)` memory regardless of the
   declared row count (zip-bomb mitigation).
+- `ProtoReader` enforces varint length ≤ 10 bytes, rejects truncated len-delim regions,
+  and validates segment bounds on every read. (0.6.0+ — replaces the `protobuf-java`
+  parser path; same exception contract.)
 
 The regression suite lives under `reader/src/test/java/.../*SecurityTest`. Run with
 `./mvnw test -Dtest='*SecurityTest'`.

diff --git a/core/pom.xml b/core/pom.xml
@@ -20,10 +20,6 @@
 			<groupId>com.google.flatbuffers</groupId>
 			<artifactId>flatbuffers-java</artifactId>
 		</dependency>
-		<dependency>
-			<groupId>com.google.protobuf</groupId>
-			<artifactId>protobuf-java</artifactId>
-		</dependency>
 		<dependency>
 			<groupId>io.airlift</groupId>
 			<artifactId>aircompressor-v3</artifactId>
@@ -56,7 +52,7 @@
 	  Normal builds need no external tools.
 
 	  To regenerate after editing .fbs or .proto schemas:
-	    brew install flatbuffers protobuf
+	    brew install flatbuffers
 	    ./mvnw generate-sources -pl core -P regenerate-sources
 	  Then commit the updated files.
 	  Any flatc version works — the profile strips the version guard automatically.
@@ -93,18 +89,27 @@
 									</arguments>
 								</configuration>
 							</execution>
-							<!-- 2. Generate Java from Protobuf schemas -->
+							<!--
+								2. Generate MemorySegment-native Java from Protobuf schemas via vortex-proto-gen.
+
+								Pre-step: run `./mvnw compile -pl proto-gen` once so this exec finds the classes.
+								We use a direct exec (rather than declaring vortex-proto-gen as a Maven dep) to avoid
+								an artificial provided-scope dep that would leak into the published core POM.
+							-->
 							<execution>
-								<id>protoc-generate</id>
+								<id>protogen-generate</id>
 								<phase>generate-sources</phase>
 								<goals>
 									<goal>exec</goal>
 								</goals>
 								<configuration>
-									<executable>protoc</executable>
+									<executable>java</executable>
 									<arguments>
-										<argument>--java_out=${project.basedir}/src/main/java</argument>
-										<argument>--proto_path=${project.basedir}/src/main/proto</argument>
+										<argument>-cp</argument>
+										<argument>${project.basedir}/../proto-gen/target/classes</argument>
+										<argument>io.github.dfa1.vortex.protogen.Main</argument>
+										<argument>--out</argument>
+										<argument>${project.basedir}/src/main/java/io/github/dfa1/vortex/proto</argument>
 										<argument>${project.basedir}/src/main/proto/dtype.proto</argument>
 										<argument>${project.basedir}/src/main/proto/scalar.proto</argument>
 										<argument>${project.basedir}/src/main/proto/encodings.proto</argument>

diff --git a/core/src/main/java/io/github/dfa1/vortex/core/ArrayStats.java b/core/src/main/java/io/github/dfa1/vortex/core/ArrayStats.java
@@ -1,9 +1,11 @@
 package io.github.dfa1.vortex.core;
 
-import com.google.protobuf.InvalidProtocolBufferException;
-import io.github.dfa1.vortex.proto.ScalarProtos;
+import io.github.dfa1.vortex.proto.ScalarValue;
 
+import java.io.IOException;
+import java.lang.foreign.MemorySegment;
 import java.nio.ByteBuffer;
+import java.nio.charset.StandardCharsets;
 
 /// Per-array statistics embedded in the encoding tree.
 ///
@@ -52,18 +54,31 @@ private static Object decodeScalar(ByteBuffer bytes) {
             return null;
         }
         try {
-            ScalarProtos.ScalarValue sv = ScalarProtos.ScalarValue.parseFrom(bytes.duplicate());
-            return switch (sv.getKindCase()) {
-                case INT64_VALUE -> sv.getInt64Value();
-                case UINT64_VALUE -> sv.getUint64Value();
-                case F32_VALUE -> sv.getF32Value();
-                case F64_VALUE -> sv.getF64Value();
-                case BOOL_VALUE -> sv.getBoolValue();
-                case STRING_VALUE -> sv.getStringValue();
-                case BYTES_VALUE -> sv.getBytesValue().toStringUtf8();
-                default -> null;
-            };
-        } catch (InvalidProtocolBufferException e) {
+            MemorySegment seg = MemorySegment.ofBuffer(bytes.duplicate());
+            ScalarValue sv = ScalarValue.decode(seg, 0, seg.byteSize());
+            if (sv.int64_value() != null) {
+                return sv.int64_value();
+            }
+            if (sv.uint64_value() != null) {
+                return sv.uint64_value();
+            }
+            if (sv.f32_value() != null) {
+                return sv.f32_value();
+            }
+            if (sv.f64_value() != null) {
+                return sv.f64_value();
+            }
+            if (sv.bool_value() != null) {
+                return sv.bool_value();
+            }
+            if (sv.string_value() != null) {
+                return sv.string_value();
+            }
+            if (sv.bytes_value() != null) {
+                return new String(sv.bytes_value(), StandardCharsets.UTF_8);
+            }
+            return null;
+        } catch (IOException e) {
             throw new VortexException("invalid scalar value in array stats", e);
         }
     }