Skip to content

Add msgpack as alternative serialization format for entity storage #2

@deucalioncodes

Description

@deucalioncodes

Problem

ic-python-db currently uses json.dumps/json.loads for all entity serialization in db_engine.py. While CPython 3.13 on WASI likely has the C-accelerated _json module, there is still overhead from:

  1. JSON text format is larger than binary alternatives (string quoting, key repetition)
  2. Every db.save() and db.load() goes through JSON encode/decode
  3. On IC canisters, larger payloads mean more StableBTreeMap storage and more cycles

The current format is compact JSON (no indentation). The pretty=True option only exists on the read-only dump_json() and raw_dump_json() export/debug utilities.

Proposal

Add msgpack as the internal serialization format in db_engine.py.

Benefits

  • Smaller payloads — msgpack is typically 30-40% smaller than compact JSON for the same data
  • Faster encoding/decoding — even pure-Python msgpack is competitive with C-JSON for small dicts; a C-accelerated msgpack would be significantly faster
  • Binary-native — no string escaping overhead for keys/values

Size comparison for typical entity dicts

Format Relative size
JSON (pretty) ~2-3x baseline
JSON (compact) 1x (current)
msgpack ~0.6-0.7x
CBOR ~0.6-0.7x
Protocol Buffers ~0.3-0.5x

Implementation approach

The change is localized to db_engine.py — swap json.dumps/json.loads with msgpack equivalents:

import msgpack

def save(self, type_name, id, data):
    key = f"{type_name}@{id}"
    self._db_storage.insert(key, msgpack.packb(data))
    
def load(self, type_name, id):
    key = f"{type_name}@{id}"
    data = self._db_storage.get(key)
    if data:
        return msgpack.unpackb(data)
    return None

Additional optimization: strip redundant fields from storage

Since the storage key is {type_name}@{id}, the _type and _id fields inside the stored value are redundant. Stripping them at the save boundary and re-injecting on load saves ~15-20% on top of the format change:

def save(self, type_name, id, data):
    key = f"{type_name}@{id}"
    to_store = {k: v for k, v in data.items() if k not in ("_type", "_id")}
    self._db_storage.insert(key, msgpack.packb(to_store))

def load(self, type_name, id):
    key = f"{type_name}@{id}"
    data = self._db_storage.get(key)
    if data:
        result = msgpack.unpackb(data)
        result["_type"] = type_name
        result["_id"] = id
        return result
    return None

This is safe because _type and _id are only needed in the public API (serialize()/deserialize()), not in the stored representation. The _serialize_base() method already adds them when building the dict for external use.

Combined savings: ~45-55% reduction vs current compact JSON.

Alternatives considered

CBOR (Concise Binary Object Representation)

  • Similar size/speed to msgpack
  • Self-describing, RFC 8949
  • Used in parts of the IC ecosystem (HTTP interface)
  • No clear advantage over msgpack for this use case

Protocol Buffers

  • 50-70% smaller than JSON (schema-based, no field names stored)
  • But requires a .proto schema definition and code generation
  • Since ic-python-db is schema-flexible (entities defined dynamically at runtime), you'd be forced into a generic map<string, Value> structure, which erases most of protobuf's size advantage
  • Would only pay off with a per-entity-type generated proto, requiring a code generation pipeline — massive architectural shift
  • Verdict: overkill for this library

User-configurable format

  • Considered and rejected. Reasons:
    • Serialization format is an internal storage detail, not a user-facing concern
    • Migration complexity explodes (format detection on every read, full data migration on switch)
    • Testing surface doubles
    • Users would always want "the fastest and smallest" — the library should own this decision
  • Better approach: version the storage format internally with a metadata key, keep JSON for public-facing APIs (dump_json, serialize, deserialize)

Design decisions

  1. The library owns the internal format — users don't choose it
  2. JSON stays for public APIsdump_json(), serialize(), deserialize() remain JSON-based for human readability and interop
  3. Version the storage format — store a metadata key indicating the format version so future changes can auto-migrate transparently
  4. Migration path — on load, try msgpack first; if it fails, fall back to JSON (for pre-migration data). On save, always use the new format.

Considerations

  • Storage interface change: Storage.insert/get currently use str values. With msgpack the values would be bytes. Need to update Storage ABC and MemoryStorage, or base64-encode (which eats into size savings).
  • WASI C extension: For maximum benefit, the msgpack C extension would need to be cross-compiled for wasm32-wasip1. Pure-Python msgpack may not outperform C-accelerated JSON.
  • Audit log: Also uses json.dumps — should be updated consistently.
  • Depends on: Issue Fix redundant save on every Entity.load() — eliminates unnecessary json.dumps + storage write #1 (redundant save on load) should be fixed first (done).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions