feat(msgpack): add magic-ID type-discriminant prefix to wire format [PoC]#63
feat(msgpack): add magic-ID type-discriminant prefix to wire format [PoC]#63GordonYuanyc wants to merge 6 commits into
Conversation
…mpls This is a proof-of-concept / starting point for making serialized sketch binaries self-describing. Every `to_msgpack()` output now starts with a single type-discriminant byte (analogous to Prometheus magic bytes for gauge/histogram), and `from_msgpack()` validates that byte before deserialising the payload. Magic-ID table (stable, never reuse a value): 0x01 HllSketch 0x02 CountMinSketch 0x03 CountMinSketchWithHeap 0x04 CountSketch 0x05 DdSketch 0x06 KllSketch 0x07 HydraKllSketch 0x08 SetAggregator 0x09 DeltaResult Wire format after this change: [ magic_id: u8 | <rmp_serde msgpack payload> ] API is unchanged: callers still call `to_msgpack()` / `from_msgpack()`. Round-trips still pass. The `BadMagicId` error variant surfaces mismatches at decode time instead of producing a confusing type error. All 426 unit tests pass. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…es on all sketch types All native sketch serialization methods now also carry a magic-ID prefix (range 0x81–0x89, separate from the portable 0x01–0x09 range): 0x81 CountMin<_, _> (native map-format) 0x82 Count<_, _, _> (native map-format) 0x83 CountL2HH / CMSHeap (native map-format) 0x84 HyperLogLogImpl<_,_,_> 0x85 HyperLogLogHIPImpl<_> 0x86 DDSketch (native map-format) 0x87 KLL<T> 0x88 KLLDynamic<T> 0x89 KMV The portable KllSketch and HydraKllSketch wire formats embed KLL bytes via KLL::serialize_to_bytes / deserialize_from_bytes; those call sites pick up the magic byte automatically and the portable round-trips continue to work. The native MessagePackCodec shims in message_pack_format/native/ delegate to these methods, so they too get the magic byte without any further changes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- cargo fmt --all to fix formatting diffs caught by CI - Add docs/msgpack-magic-ids.md documenting the full magic-ID table, the portable vs native distinction, the embedded-KLL relationship, and instructions for adding future sketch types Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Completes the magic-ID rollout across all serialize_to_bytes / deserialize_from_bytes call sites in the codebase. The EH files (eh.rs, eh_sketch_list.rs, eh_univ_optimized.rs) have no serialization methods and need no changes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The native header grows from 1 byte to 2 bytes: [ family+mode byte | hasher byte | <rmp_serde named payload> ] Key changes: serialize_to_bytes / deserialize_from_bytes are now in Mode-specialized impl blocks instead of a single generic impl, so the first byte encodes both the sketch family and the phantom-type parameter: CountMin<_, RegularPath, _> → 0x81 CountMin<_, FastPath, _> → 0x82 Count<_, RegularPath, _> → 0x83 Count<_, FastPath, _> → 0x84 HyperLogLogImpl<Classic, _, _> → 0x86 HyperLogLogImpl<ErtlMLE, _, _> → 0x87 A new hasher_magic_id() method on SketchHasher (default = 0xff = HASHER_UNKNOWN) fills the second byte. DefaultXxHasher returns 0x01 (HASHER_DEFAULT_XX). check_hasher_id<H>(stored) skips the check when either side is 0xff, so custom hashers interoperate without registering. Types without an H parameter (DDSketch, KLL, KLLDynamic, KMV, Hydra, UnivMon, HyperLogLogHIPImpl) always write 0xff or HASHER_DEFAULT_XX as the second byte for a consistent 2-byte header across all native blobs. The native MessagePackCodec shims in message_pack_format/native/ are updated to match the specialized impl split. Docs updated: docs/msgpack-magic-ids.md now shows the 2-byte native header layout and the full updated ID table. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Closes the coverage gap: the earlier split-impl work only had RegularPath
round-trip tests. New tests:
count_min_fast_path_round_trip_serialization — verifies magic bytes
0x82 / HASHER_DEFAULT_XX and data survives serialize→deserialize
count_min_mode_mismatch_is_rejected — FastPath bytes →
RegularPath decoder must error
count_sketch_fast_path_round_trip_serialization — same for Count
count_sketch_mode_mismatch_is_rejected
hll_magic_bytes_are_variant_specific — Classic=0x86, ErtlMLE=0x87
hll_variant_mismatch_is_rejected — ErtlMLE bytes → Classic
decoder must error
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
One thought - a 2 digit hex magic ID restricts us to 256 IDs. While this seems reasonable for the short term, it's not out of the question that the number of sketches may go beyond that. We can make it 3 digits, or if we want to ensure complete future proofness (at the cost of some complexity), the magic ID can be prepended by a number that denotes the number of digits in the magic ID. Are there any references in other frameworks as to how this is handled? |
|
I agree that a single-byte sketch ID gives us an unnecessary 256-ID ceiling. Prometheus and OTAP both use lightweight metadata before the raw binary payload so the receiver can interpret the following bytes from the bytes themselves. Prometheus stores a chunk encoding before each chunk payload; OTAP wraps raw Arrow IPC bytes with schema_id/type metadata. For us, a thin wrapper around the existing MessagePack payload should be enough:
|
|
With this wrapper, the serialized bytes are self-describing enough for the deserializer to determine which sketch decoder to use. |
|
Sounds good. This encoding scheme makes sense. |
Summary
Adds a single leading byte (the "magic ID") to every serialised sketch binary — both the portable cross-language format and the internal Rust-only format — so any byte blob is self-describing. Analogous to Prometheus magic bytes for gauge/histogram types.
Wire format before:
Wire format after:
Magic-ID table
Portable (cross-language, Go reads these):
0x01HllSketch0x02CountMinSketch0x03CountMinSketchWithHeap0x04CountSketch0x05DdSketch0x06KllSketch0x07HydraKllSketch0x08SetAggregator0x09DeltaResultNative (Rust-internal,
serialize_to_bytes/deserialize_from_bytes):0x81CountMin<_, _>0x82Count<_, _, _>(CountSketch)0x83CountL2HH/CMSHeap0x84HyperLogLogImpl<_, _, _>0x85HyperLogLogHIPImpl<_>0x86DDSketch(native map-format)0x87KLL<T>0x88KLLDynamic<T>0x89KMVWhat changed
src/message_pack_format/magic_ids.rs— all IDs in one place (portable 0x01–0x09, native 0x81–0x89)src/message_pack_format/error.rs— newError::BadMagicId { expected, got }variantMessagePackCodecimpls inmessage_pack_format/portable/updatedserialize_to_bytes/deserialize_from_bytesmethods on sketch types insrc/sketches/updatedMessagePackCodecshims inmessage_pack_format/native/delegate to the sketch methods, so they pick up magic IDs automatically without any further changesKllSketchandHydraKllSketchformats embed rawKLL::serialize_to_bytes()bytes; those call sites pick up the magic byte automatically and continue to round-trip correctlyAPI
Unchanged — callers still call
to_msgpack()/from_msgpack()orserialize_to_bytes()/deserialize_from_bytes(). All round-trips pass.Open questions / follow-ups
compute_delta/apply_delta_bytes) get their own IDs or share the base sketch ID?Test plan
cargo test— all 426 tests passtests/msgpack_compat.rsBadMagicId/Uncategorizederror surfaces on wrong-type or truncated input🤖 Generated with Claude Code