Skip to content

feat(msgpack): add magic-ID type-discriminant prefix to wire format [PoC]#63

Draft
GordonYuanyc wants to merge 6 commits into
mainfrom
feat/msgpack-magic-id
Draft

feat(msgpack): add magic-ID type-discriminant prefix to wire format [PoC]#63
GordonYuanyc wants to merge 6 commits into
mainfrom
feat/msgpack-magic-id

Conversation

@GordonYuanyc

@GordonYuanyc GordonYuanyc commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

This is a proof-of-idea PR — a starting point for discussion, not a merge-ready change.

Adds a single leading byte (the "magic ID") to every serialised sketch binary — both the portable cross-language format and the internal Rust-only format — so any byte blob is self-describing. Analogous to Prometheus magic bytes for gauge/histogram types.

Wire format before:

[ <msgpack payload> ]

Wire format after:

[ magic_id: u8 | <msgpack payload> ]

Magic-ID table

Portable (cross-language, Go reads these):

ID Type
0x01 HllSketch
0x02 CountMinSketch
0x03 CountMinSketchWithHeap
0x04 CountSketch
0x05 DdSketch
0x06 KllSketch
0x07 HydraKllSketch
0x08 SetAggregator
0x09 DeltaResult

Native (Rust-internal, serialize_to_bytes / deserialize_from_bytes):

ID Type
0x81 CountMin<_, _>
0x82 Count<_, _, _> (CountSketch)
0x83 CountL2HH / CMSHeap
0x84 HyperLogLogImpl<_, _, _>
0x85 HyperLogLogHIPImpl<_>
0x86 DDSketch (native map-format)
0x87 KLL<T>
0x88 KLLDynamic<T>
0x89 KMV

What changed

  • src/message_pack_format/magic_ids.rs — all IDs in one place (portable 0x01–0x09, native 0x81–0x89)
  • src/message_pack_format/error.rs — new Error::BadMagicId { expected, got } variant
  • All 9 MessagePackCodec impls in message_pack_format/portable/ updated
  • All serialize_to_bytes / deserialize_from_bytes methods on sketch types in src/sketches/ updated
  • The native MessagePackCodec shims in message_pack_format/native/ delegate to the sketch methods, so they pick up magic IDs automatically without any further changes
  • The portable KllSketch and HydraKllSketch formats embed raw KLL::serialize_to_bytes() bytes; those call sites pick up the magic byte automatically and continue to round-trip correctly

API

Unchanged — callers still call to_msgpack() / from_msgpack() or serialize_to_bytes() / deserialize_from_bytes(). All round-trips pass.

Open questions / follow-ups

  • Should the delta formats (compute_delta / apply_delta_bytes) get their own IDs or share the base sketch ID?
  • The Go mirror PR is sketchlib-go#68 — the two must land together once IDs are finalised.
  • Backward-compatibility strategy for bytes serialised without the prefix (rolling deploy, version-detection shim, etc.).

Test plan

  • cargo test — all 426 tests pass
  • Round-trip tests for every sketch type in tests/msgpack_compat.rs
  • BadMagicId / Uncategorized error surfaces on wrong-type or truncated input
  • Portable KLL / HydraKLL round-trips verified (they embed native KLL bytes)

🤖 Generated with Claude Code

GordonYuanyc and others added 6 commits June 17, 2026 16:00
…mpls

This is a proof-of-concept / starting point for making serialized sketch
binaries self-describing.  Every `to_msgpack()` output now starts with a
single type-discriminant byte (analogous to Prometheus magic bytes for
gauge/histogram), and `from_msgpack()` validates that byte before
deserialising the payload.

Magic-ID table (stable, never reuse a value):
  0x01  HllSketch
  0x02  CountMinSketch
  0x03  CountMinSketchWithHeap
  0x04  CountSketch
  0x05  DdSketch
  0x06  KllSketch
  0x07  HydraKllSketch
  0x08  SetAggregator
  0x09  DeltaResult

Wire format after this change:
  [ magic_id: u8 | <rmp_serde msgpack payload> ]

API is unchanged: callers still call `to_msgpack()` / `from_msgpack()`.
Round-trips still pass.  The `BadMagicId` error variant surfaces
mismatches at decode time instead of producing a confusing type error.

All 426 unit tests pass.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…es on all sketch types

All native sketch serialization methods now also carry a magic-ID prefix
(range 0x81–0x89, separate from the portable 0x01–0x09 range):

  0x81  CountMin<_, _>        (native map-format)
  0x82  Count<_, _, _>        (native map-format)
  0x83  CountL2HH / CMSHeap   (native map-format)
  0x84  HyperLogLogImpl<_,_,_>
  0x85  HyperLogLogHIPImpl<_>
  0x86  DDSketch               (native map-format)
  0x87  KLL<T>
  0x88  KLLDynamic<T>
  0x89  KMV

The portable KllSketch and HydraKllSketch wire formats embed KLL bytes via
KLL::serialize_to_bytes / deserialize_from_bytes; those call sites pick up
the magic byte automatically and the portable round-trips continue to work.

The native MessagePackCodec shims in message_pack_format/native/ delegate to
these methods, so they too get the magic byte without any further changes.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- cargo fmt --all to fix formatting diffs caught by CI
- Add docs/msgpack-magic-ids.md documenting the full magic-ID table,
  the portable vs native distinction, the embedded-KLL relationship,
  and instructions for adding future sketch types

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Completes the magic-ID rollout across all serialize_to_bytes /
deserialize_from_bytes call sites in the codebase. The EH files
(eh.rs, eh_sketch_list.rs, eh_univ_optimized.rs) have no serialization
methods and need no changes.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The native header grows from 1 byte to 2 bytes:
  [ family+mode byte | hasher byte | <rmp_serde named payload> ]

Key changes:

serialize_to_bytes / deserialize_from_bytes are now in Mode-specialized
impl blocks instead of a single generic impl, so the first byte encodes
both the sketch family and the phantom-type parameter:

  CountMin<_, RegularPath, _> → 0x81
  CountMin<_, FastPath,    _> → 0x82
  Count<_,   RegularPath, _> → 0x83
  Count<_,   FastPath,    _> → 0x84
  HyperLogLogImpl<Classic,  _, _> → 0x86
  HyperLogLogImpl<ErtlMLE,  _, _> → 0x87

A new hasher_magic_id() method on SketchHasher (default = 0xff =
HASHER_UNKNOWN) fills the second byte.  DefaultXxHasher returns 0x01
(HASHER_DEFAULT_XX).  check_hasher_id<H>(stored) skips the check when
either side is 0xff, so custom hashers interoperate without registering.

Types without an H parameter (DDSketch, KLL, KLLDynamic, KMV, Hydra,
UnivMon, HyperLogLogHIPImpl) always write 0xff or HASHER_DEFAULT_XX as
the second byte for a consistent 2-byte header across all native blobs.

The native MessagePackCodec shims in message_pack_format/native/ are
updated to match the specialized impl split.

Docs updated: docs/msgpack-magic-ids.md now shows the 2-byte native
header layout and the full updated ID table.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Closes the coverage gap: the earlier split-impl work only had RegularPath
round-trip tests.  New tests:

  count_min_fast_path_round_trip_serialization  — verifies magic bytes
    0x82 / HASHER_DEFAULT_XX and data survives serialize→deserialize
  count_min_mode_mismatch_is_rejected           — FastPath bytes →
    RegularPath decoder must error
  count_sketch_fast_path_round_trip_serialization — same for Count
  count_sketch_mode_mismatch_is_rejected
  hll_magic_bytes_are_variant_specific          — Classic=0x86, ErtlMLE=0x87
  hll_variant_mismatch_is_rejected              — ErtlMLE bytes → Classic
    decoder must error

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@milindsrivastava1997

Copy link
Copy Markdown

One thought - a 2 digit hex magic ID restricts us to 256 IDs. While this seems reasonable for the short term, it's not out of the question that the number of sketches may go beyond that. We can make it 3 digits, or if we want to ensure complete future proofness (at the cost of some complexity), the magic ID can be prepended by a number that denotes the number of digits in the magic ID. Are there any references in other frameworks as to how this is handled?

@GordonYuanyc

Copy link
Copy Markdown
Collaborator Author

I agree that a single-byte sketch ID gives us an unnecessary 256-ID ceiling.

Prometheus and OTAP both use lightweight metadata before the raw binary payload so the receiver can interpret the following bytes from the bytes themselves. Prometheus stores a chunk encoding before each chunk payload; OTAP wraps raw Arrow IPC bytes with schema_id/type metadata.

For us, a thin wrapper around the existing MessagePack payload should be enough:

[ "ASK1" | version:u8 | kind_id_len:u8 | kind_id:bytes | msgpack_payload ]

"ASK1" identifies this as an ASAP sketch binary. version is for future wrapper-layout changes. kind_id_len + kind_id avoids the 256-ID limit while keeping small IDs compact. kind_id should be canonical unsigned big-endian with no leading zero bytes.

@GordonYuanyc

Copy link
Copy Markdown
Collaborator Author

With this wrapper, the serialized bytes are self-describing enough for the deserializer to determine which sketch decoder to use.

@milindsrivastava1997

Copy link
Copy Markdown

Sounds good. This encoding scheme makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants