Skip to content

proto-rewrite: replace protobuf-java with MemorySegment-native codec#24

Merged
dfa1 merged 6 commits into
mainfrom
proto-rewrite
Jun 10, 2026
Merged

proto-rewrite: replace protobuf-java with MemorySegment-native codec#24
dfa1 merged 6 commits into
mainfrom
proto-rewrite

Conversation

@dfa1

@dfa1 dfa1 commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

  • Drop protobuf-java dependency from core/reader/writer; replace with hand-rolled proto3 codec on MemorySegment (ProtoReader/ProtoWriter).
  • New proto-gen module: lexer/parser/codegen producing @Generated records + enums from .proto files (42 messages regenerated).
  • All 25 encoding impls + 23 encoding tests migrated to the generated record API. Added PatchedMetadata and VariantMetadata .proto messages for previously hand-parsed metadata.
  • CLI uber-jar 14 MB → 12 MB; no more sun.misc.Unsafe warning (was from protobuf-java UnsafeUtil).

Wire-format compatibility

Confirmed against Rust reference impl:

  • RustWritesJavaReadsIntegrationTest (10)
  • JavaWritesRustReadsIntegrationTest (194)
  • RustJavaReaderComparisonIntegrationTest (25)
  • ParquetImportIntegrationTest (5)

Test plan

  • ./mvnw verify — 872 unit + 243 integration, 0 failures
  • Rust round-trip integration tests pass
  • Reviewer: confirm CLAUDE.md/SECURITY.md/CHANGELOG updates match scope
  • Reviewer: spot-check generated code style + @Generated stamping

🤖 Generated with Claude Code

dfa1 and others added 6 commits June 10, 2026 08:13
Replaces the protobuf-java runtime with a custom .proto-to-Java
generator that emits immutable records decoding directly off a
MemorySegment slice (zero copy, no Unsafe, no Kotlin).

Foundation done:
- proto-gen module: lexer, parser, AST, type registry, code generator,
  CLI entry. 9 tests pass.
- core proto runtime: ProtoReader (varint, sint64 zigzag, fixed32/64,
  string, bytes, len-delim segment slice, packed-repeated, skip-unknown)
  + ProtoWriter + WireType. 42 unit tests pass.
- Generator emits 42 record/enum classes from vortex's 3 .proto files;
  all 45 (42 generated + 3 runtime) compile in isolation.
- Maven regenerate-sources profile rewired:
  ./mvnw compile -pl proto-gen
  ./mvnw generate-sources -pl core -P regenerate-sources
- Old DTypeProtos/ScalarProtos/EncodingProtos.java (34K LOC) deleted.

In progress:
- 3 of 25 consumer encodings rewritten (Sequence, Bool, Constant).
- 22 remaining encodings + 22 tests still reference the old API.
- protobuf-java still in deps until consumer rewrite + verify is done.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rewritten to use the MemorySegment-native record API instead of protobuf-java:
ArrayStats, Sequence, Bool, Constant, Primitive, VarBin, Rle, Delta,
Dict, Sparse, Zstd, FrameOfReference.

Generator now stamps @generated("io.github.dfa1.vortex.protogen.CodeGen")
on every emitted record + enum. Regenerated all 42 proto files.

Remaining: 13 main encodings (Alp, AlpRd, Bitpacked, DateTimeParts,
DecimalByteParts, Decimal, Fsst, List, ListView, Patched, Pco, RunEnd,
Variant) + 22 test files + verify run + drop protobuf-java dep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Final 13 encodings rewritten: AlpEncoding, AlpRdEncoding, BitpackedEncoding,
DateTimePartsEncoding, DecimalEncoding, DecimalBytePartsEncoding,
FsstEncoding, ListEncoding, ListViewEncoding, PatchedEncoding, PcoEncoding,
RunEndEncoding, VariantEncoding.

Added two .proto messages for hand-parsed metadata that previously used
CodedInputStream directly: PatchedMetadata, VariantMetadata. Both are now
decoded via the generated records.

Full reactor compiles (`./mvnw compile`). Tests still reference old API.

Remaining: 23 test files + verify + drop protobuf-java dep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All 25 main encoding files + 23 test files rewritten to use the
generated MemorySegment-native record API. protobuf-java removed
from core/reader/writer pom.xml + root dependencyManagement.

Test results:
  ./mvnw verify → BUILD SUCCESS
  Unit:       872 tests, 0 failures
  Integration: 243 tests, 0 failures (incl. Rust round-trip)

Wire-format compatibility with the Rust reference implementation
confirmed by:
  RustWritesJavaReadsIntegrationTest   (10 tests)
  JavaWritesRustReadsIntegrationTest  (194 tests)
  RustJavaReaderComparisonIntegrationTest (25 tests)
  ParquetImportIntegrationTest          (5 tests)

CLI uber-jar: 14 MB → 12 MB. No more sun.misc.Unsafe warning on
startup (was emitted by protobuf-java's UnsafeUtil).

Generated classes carry @generated("io.github.dfa1.vortex.protogen.CodeGen").
Added two .proto messages for previously hand-parsed metadata:
PatchedMetadata, VariantMetadata.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLAUDE.md:
- regenerate-sources flow: drop `brew install protobuf`, add
  `./mvnw compile -pl proto-gen` pre-step
- metadata-only encoding example uses `.encode()` not `.toByteArray()`
  and includes the decode-side snippet
- note `ProtoReader`/`ProtoWriter` are package-private + the
  `ofXxxValue` factory pattern for oneof messages

SECURITY.md:
- update threat-model intro to reference the in-tree `ProtoReader`
  instead of `protobuf-java`
- supported versions bumped: 0.6.x supported, 0.5.x critical-only,
  < 0.5 EOL
- drop `protobuf-java` from out-of-scope deps; add ProtoReader hardening
  to defensive guarantees (varint cap, truncated-len-delim, bounds)
- exception list: `IOException` from proto reader replaces
  "raw Protobuf parser exceptions"

CHANGELOG.md:
- new [0.6.0] section: proto-rewrite summary, ProtoReader/Writer,
  proto-gen module, oneof factories, dependency removal
  (`protobuf-java`), wire-format compatibility evidence (Rust
  integration tests), perf note (within noise on bulk reads)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Blockers from PR #24 review:

1. Enum decode now raises IOException, not IllegalArgumentException, on
   unknown wire values. SECURITY.md promises PType ordinals are
   bounds-checked, but the generated `Enum.fromValue` threw IAE — which
   escaped the encoders' `catch (IOException)` blocks. The generator now
   wraps every `fromValue(int)` decode call in a try/catch that
   translates IAE → IOException with the original wire value in the
   message.

2. Generated records with `byte[]` components now override
   `equals`/`hashCode` with `Arrays.equals` / `Arrays.hashCode`. Records'
   auto-generated equality uses `Objects.equals`, which falls through to
   reference equality for arrays — so structurally equal records like
   `ScalarValue.ofBytesValue(new byte[]{1,2,3})` compared unequal.

3. SINGLE `string`/`bytes` fields no longer NPE on encode when null.
   The default-value check is now `var != null && !var.isEmpty()` and
   `var != null && var.length != 0`.

Smaller follow-ups in the same pass:

- Enum init uses the first declared constant (e.g. `PType.U8`) instead
  of `fromValue(0)`. Works for any enum, not just those with a zero entry.
- `PROTO_PKG` hardcode dropped; the generator now honors each type's
  `option java_package` via `ResolvedType.javaPackage()`.
- `TypeRegistry.resolve` fails on ambiguous suffix matches instead of
  returning whichever the HashMap iterator hit first. Lists all
  candidates in the error.
- `ProtoWriter.ensure` clamps growth at `Integer.MAX_VALUE - 8` and
  pre-checks for overflow on `pos + extra`.
- `Main` bounds-checks `--out` so a trailing `--out` prints usage
  instead of AIOOBE.
- Stale "skip google.protobuf.NullValue" comment removed.

Test additions in ProtoRuntimeTest:
- VarintOverflow: 11 continuation bytes throws "varint overflow"
- UnknownEnum: Primitive decode with PType=99 throws IOException
- SingleNullEncode: Extension(null, null, null).encode() returns empty
- ByteArrayEquality: structurally equal ScalarValues compare equal

./mvnw verify — all unit + integration tests pass (incl. Rust round-trips).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dfa1 dfa1 merged commit b527f84 into main Jun 10, 2026
3 checks passed
@dfa1 dfa1 deleted the proto-rewrite branch June 10, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant