proto-rewrite: replace protobuf-java with MemorySegment-native codec#24
Merged
Conversation
Replaces the protobuf-java runtime with a custom .proto-to-Java generator that emits immutable records decoding directly off a MemorySegment slice (zero copy, no Unsafe, no Kotlin). Foundation done: - proto-gen module: lexer, parser, AST, type registry, code generator, CLI entry. 9 tests pass. - core proto runtime: ProtoReader (varint, sint64 zigzag, fixed32/64, string, bytes, len-delim segment slice, packed-repeated, skip-unknown) + ProtoWriter + WireType. 42 unit tests pass. - Generator emits 42 record/enum classes from vortex's 3 .proto files; all 45 (42 generated + 3 runtime) compile in isolation. - Maven regenerate-sources profile rewired: ./mvnw compile -pl proto-gen ./mvnw generate-sources -pl core -P regenerate-sources - Old DTypeProtos/ScalarProtos/EncodingProtos.java (34K LOC) deleted. In progress: - 3 of 25 consumer encodings rewritten (Sequence, Bool, Constant). - 22 remaining encodings + 22 tests still reference the old API. - protobuf-java still in deps until consumer rewrite + verify is done. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rewritten to use the MemorySegment-native record API instead of protobuf-java: ArrayStats, Sequence, Bool, Constant, Primitive, VarBin, Rle, Delta, Dict, Sparse, Zstd, FrameOfReference. Generator now stamps @generated("io.github.dfa1.vortex.protogen.CodeGen") on every emitted record + enum. Regenerated all 42 proto files. Remaining: 13 main encodings (Alp, AlpRd, Bitpacked, DateTimeParts, DecimalByteParts, Decimal, Fsst, List, ListView, Patched, Pco, RunEnd, Variant) + 22 test files + verify run + drop protobuf-java dep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Final 13 encodings rewritten: AlpEncoding, AlpRdEncoding, BitpackedEncoding, DateTimePartsEncoding, DecimalEncoding, DecimalBytePartsEncoding, FsstEncoding, ListEncoding, ListViewEncoding, PatchedEncoding, PcoEncoding, RunEndEncoding, VariantEncoding. Added two .proto messages for hand-parsed metadata that previously used CodedInputStream directly: PatchedMetadata, VariantMetadata. Both are now decoded via the generated records. Full reactor compiles (`./mvnw compile`). Tests still reference old API. Remaining: 23 test files + verify + drop protobuf-java dep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All 25 main encoding files + 23 test files rewritten to use the generated MemorySegment-native record API. protobuf-java removed from core/reader/writer pom.xml + root dependencyManagement. Test results: ./mvnw verify → BUILD SUCCESS Unit: 872 tests, 0 failures Integration: 243 tests, 0 failures (incl. Rust round-trip) Wire-format compatibility with the Rust reference implementation confirmed by: RustWritesJavaReadsIntegrationTest (10 tests) JavaWritesRustReadsIntegrationTest (194 tests) RustJavaReaderComparisonIntegrationTest (25 tests) ParquetImportIntegrationTest (5 tests) CLI uber-jar: 14 MB → 12 MB. No more sun.misc.Unsafe warning on startup (was emitted by protobuf-java's UnsafeUtil). Generated classes carry @generated("io.github.dfa1.vortex.protogen.CodeGen"). Added two .proto messages for previously hand-parsed metadata: PatchedMetadata, VariantMetadata. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLAUDE.md: - regenerate-sources flow: drop `brew install protobuf`, add `./mvnw compile -pl proto-gen` pre-step - metadata-only encoding example uses `.encode()` not `.toByteArray()` and includes the decode-side snippet - note `ProtoReader`/`ProtoWriter` are package-private + the `ofXxxValue` factory pattern for oneof messages SECURITY.md: - update threat-model intro to reference the in-tree `ProtoReader` instead of `protobuf-java` - supported versions bumped: 0.6.x supported, 0.5.x critical-only, < 0.5 EOL - drop `protobuf-java` from out-of-scope deps; add ProtoReader hardening to defensive guarantees (varint cap, truncated-len-delim, bounds) - exception list: `IOException` from proto reader replaces "raw Protobuf parser exceptions" CHANGELOG.md: - new [0.6.0] section: proto-rewrite summary, ProtoReader/Writer, proto-gen module, oneof factories, dependency removal (`protobuf-java`), wire-format compatibility evidence (Rust integration tests), perf note (within noise on bulk reads) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Blockers from PR #24 review: 1. Enum decode now raises IOException, not IllegalArgumentException, on unknown wire values. SECURITY.md promises PType ordinals are bounds-checked, but the generated `Enum.fromValue` threw IAE — which escaped the encoders' `catch (IOException)` blocks. The generator now wraps every `fromValue(int)` decode call in a try/catch that translates IAE → IOException with the original wire value in the message. 2. Generated records with `byte[]` components now override `equals`/`hashCode` with `Arrays.equals` / `Arrays.hashCode`. Records' auto-generated equality uses `Objects.equals`, which falls through to reference equality for arrays — so structurally equal records like `ScalarValue.ofBytesValue(new byte[]{1,2,3})` compared unequal. 3. SINGLE `string`/`bytes` fields no longer NPE on encode when null. The default-value check is now `var != null && !var.isEmpty()` and `var != null && var.length != 0`. Smaller follow-ups in the same pass: - Enum init uses the first declared constant (e.g. `PType.U8`) instead of `fromValue(0)`. Works for any enum, not just those with a zero entry. - `PROTO_PKG` hardcode dropped; the generator now honors each type's `option java_package` via `ResolvedType.javaPackage()`. - `TypeRegistry.resolve` fails on ambiguous suffix matches instead of returning whichever the HashMap iterator hit first. Lists all candidates in the error. - `ProtoWriter.ensure` clamps growth at `Integer.MAX_VALUE - 8` and pre-checks for overflow on `pos + extra`. - `Main` bounds-checks `--out` so a trailing `--out` prints usage instead of AIOOBE. - Stale "skip google.protobuf.NullValue" comment removed. Test additions in ProtoRuntimeTest: - VarintOverflow: 11 continuation bytes throws "varint overflow" - UnknownEnum: Primitive decode with PType=99 throws IOException - SingleNullEncode: Extension(null, null, null).encode() returns empty - ByteArrayEquality: structurally equal ScalarValues compare equal ./mvnw verify — all unit + integration tests pass (incl. Rust round-trips). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
protobuf-javadependency from core/reader/writer; replace with hand-rolled proto3 codec onMemorySegment(ProtoReader/ProtoWriter).proto-genmodule: lexer/parser/codegen producing@Generatedrecords + enums from.protofiles (42 messages regenerated).PatchedMetadataandVariantMetadata.proto messages for previously hand-parsed metadata.sun.misc.Unsafewarning (was from protobuf-javaUnsafeUtil).Wire-format compatibility
Confirmed against Rust reference impl:
RustWritesJavaReadsIntegrationTest(10)JavaWritesRustReadsIntegrationTest(194)RustJavaReaderComparisonIntegrationTest(25)ParquetImportIntegrationTest(5)Test plan
./mvnw verify— 872 unit + 243 integration, 0 failures@Generatedstamping🤖 Generated with Claude Code