From 7e67c28c82925b3d3daed3037186263800caec39 Mon Sep 17 00:00:00 2001 From: Zeke Date: Sat, 13 Jun 2026 23:12:32 -0700 Subject: [PATCH] design: large zset/list representations + OBJECT ENCODING mapping (closes #134, closes #135, closes #40) ZSET_LARGE.md (#134): large sorted-set = ordered index (provisional skiplist per large-collection-bakeoff.md) plus a parallel member->score hashmap, sync invariant, ZRANGEBYSCORE vs ZRANGEBYLEX; final structure deferred to #136 [redis-zset-skiplist-plus-ht]. LIST_LARGE.md (#135): quicklist-equivalent chunked listpack deque, O(1) head/tail, ~8KB node sizing [redis-list-max-listpack-size-neg2], chunk split/merge; flat-vs-indexed deferred to #136. OBJECT_ENCODING_MAPPING.md (#40): total internal-repr -> OBJECT ENCODING name table over {embstr,int,raw,listpack,intset,hashtable,skiplist,quicklist}, DEBUG OBJECT synthesis, assert_encoding wiring [valkey-assert-encoding-vocab]. Authored+reviewed via workflow. CI passes. Closes #134, closes #135, closes #40. Signed-off-by: Zeke --- docs/design/LIST_LARGE.md | 104 +++++++++++++++++++++++++ docs/design/OBJECT_ENCODING_MAPPING.md | 100 ++++++++++++++++++++++++ docs/design/README.md | 6 ++ docs/design/ZSET_LARGE.md | 98 +++++++++++++++++++++++ 4 files changed, 308 insertions(+) create mode 100644 docs/design/LIST_LARGE.md create mode 100644 docs/design/OBJECT_ENCODING_MAPPING.md create mode 100644 docs/design/ZSET_LARGE.md diff --git a/docs/design/LIST_LARGE.md b/docs/design/LIST_LARGE.md new file mode 100644 index 0000000..8cbec44 --- /dev/null +++ b/docs/design/LIST_LARGE.md @@ -0,0 +1,104 @@ +# Design: List large representation (quicklist-equivalent chunked deque) + +Issue: #135. Decisions: ADR-0018 (encoding thresholds), ADR-0005 (per-shard +unsynchronized map), ADR-0009 (behavioral equivalence). Related: #113 (small +listpack list chunk), #35 (index), #40 (OBJECT ENCODING name and ql_* fields), +#136 (large-collection-bakeoff), #128 (list command semantics), #52 +(value compression), #8 (harness). + +## Goal and scope + +A list that outgrows a single small listpack chunk (ADR-0018) needs a structure +with O(1) head and tail operations and bounded per-node memory, the quicklist +contract. This spec fixes the chunked-deque shape, the node-size policy, the +traversal model for the interior commands, and how chunks split and merge, plus +the ql_nodes/ql_avg_node fields #40 must synthesize. Scope is the representation +above the listpack threshold; the small chunk is #113, the threshold is +ADR-0018/#37, and the flat-deque-versus-indexed-chunk choice is the #136 +bake-off. This spec sets the provisional flat baseline and the contract. + +## Design + +### Chunked deque of listpack nodes + +- The large list is a deque of compact listpack chunks, the quicklist shape + Redis uses [redis-list-max-listpack-size-neg2]. Each chunk is one contiguous + listpack with the ~6-byte header (total bytes plus element count) and a 1-byte + terminator [redis-listpack-header-6-bytes], holding a run of elements in order; + the chunks are linked head-to-tail. The whole list is one value on one core + (ADR-0005), so no chunk link is synchronized. +- The provisional structure is a flat doubly linked deque of chunks (a plain + prev/next chain). It is provisional because #136 evaluates an indexed chunk + structure (a small B-tree or rope of chunks) for faster positional access; this + spec commits to the chunk-deque trait, not to flat-versus-indexed. + +### Node sizing (~8 KB) + +- A chunk's byte budget maps to list-max-listpack-size -2, the Redis default that + caps each node's listpack at 8 KB rather than an element count + [redis-list-max-listpack-size-neg2]. A push that would exceed the budget starts + a new chunk; the cap keeps each node cache-resident and bounds the cost of an + interior memmove within one chunk. IronCache stores only the listpack bytes per + chunk, contrasting Redis's 32-byte quicklistNode struct (prev/next, listpack + ptr, sz, count, and bitfields) [redis-quicklist-node-32-bytes]; interior-node + LZF compression [redis-quicklist-node-32-bytes] is a design choice deferred to + COMPRESSION.md (#52), not adopted here. + +### Head/tail O(1) and interior traversal + +- LPUSH/RPUSH/LPOP/RPOP touch only the head or tail chunk: an append or pop + inside that chunk's listpack, allocating or freeing a chunk only at the budget + boundary, so end operations are O(1) amortized. +- LINDEX/LRANGE/LSET/LINSERT walk the chunk chain accumulating element counts to + locate the target chunk, then scan within it. Each chunk carries its element + count in the listpack header [redis-listpack-header-6-bytes], so locating a + chunk by index is a walk over chunk counts, not over every element; the flat + baseline makes this O(number of chunks), which is the cost #136 weighs against + an indexed variant. LSET rewrites one entry in place; LINSERT inserts into the + target chunk's listpack with at most one tail memmove within that chunk. + +### Chunk split and merge + +- An insert that pushes a chunk past the ~8 KB budget + [redis-list-max-listpack-size-neg2] splits it into two chunks at an element + boundary near the midpoint. Deletions that leave two adjacent chunks jointly + under the budget merge them, bounding chunk count and keeping ql_avg_node + meaningful. The merge low-watermark (how empty before merging) is harness-tuned + (#8), a churn-versus-resident-bytes trade, not fixed here. + +### ql_nodes and ql_avg_node derivation + +- ql_nodes is the live chunk count; ql_avg_node is total element count divided by + ql_nodes. Both are computed from the deque IronCache actually holds and + surfaced through DEBUG OBJECT for `quicklist` keys [redis-quicklist-node-32-bytes], + the synthesis #40 wires in. They reflect IronCache chunking, not a Redis node + layout, and are a pure function of the current representation (#40). + +## Open questions + +- Flat doubly linked chunk chain vs an indexed chunk structure (small B-tree or + rope) for positional access, decided by #136 on throughput-per-core and + bytes-per-element. +- The chunk split point (strict midpoint vs fill-the-tail) and the merge + low-watermark, tuned on the harness (#8). +- Whether a chunk reuses the #113 `pack` exactly or a length-only variant sized + to the ~8 KB cap [redis-list-max-listpack-size-neg2]. + +## Acceptance and test hooks + +- LPUSH/RPUSH/LPOP/RPOP touch only the end chunk and allocate or free a chunk + only at the byte budget (O(1) amortized, structural test). +- An interior LINSERT performs at most one tail memmove within the target chunk + and never rewrites another chunk; no chunk exceeds the ~8 KB budget after split + [redis-list-max-listpack-size-neg2] (property test). +- DEBUG OBJECT reports ql_nodes equal to the live chunk count and a consistent + ql_avg_node [redis-quicklist-node-32-bytes]; OBJECT ENCODING reports + `quicklist` [valkey-assert-encoding-vocab] (ADR-0009, name map #40). +- LINDEX/LRANGE/LSET match the oracle across chunk boundaries (#97/#98, #128). + +## References + +- ADR-0005, ADR-0009, ADR-0018; issues #113, #35, #40, #136, #128, #52, #37, + #8, #97, #98. +- Claims: [redis-list-max-listpack-size-neg2], [redis-listpack-header-6-bytes], + [redis-quicklist-node-32-bytes], [valkey-assert-encoding-vocab]. diff --git a/docs/design/OBJECT_ENCODING_MAPPING.md b/docs/design/OBJECT_ENCODING_MAPPING.md new file mode 100644 index 0000000..d9e9f4b --- /dev/null +++ b/docs/design/OBJECT_ENCODING_MAPPING.md @@ -0,0 +1,100 @@ +# Design: OBJECT ENCODING / DEBUG OBJECT compatibility mapping + +Issue: #40. Decisions: ADR-0009 (behavioral equivalence via OBJECT ENCODING), +ADR-0018 (encoding thresholds). Related: #35 (index, parent), #111 (object +layout), #112 (scalar encodings), #113 (collection container), #134 (large +zset), #135 (large list), #95 (conformance), #150 (DEBUG OBJECT command). + +## Goal and scope + +Clients and conformance suites introspect storage through OBJECT ENCODING and +DEBUG OBJECT and branch on the exact synthetic name returned, so IronCache must +report Redis-vocabulary names even though its internal representations are chosen +for a Rust runtime, not Redis's C internals (ADR-0009). This spec fixes the total +function from every internal representation to one reported name, the DEBUG +OBJECT field synthesis, and the assert_encoding wiring. Out of scope are the +structures themselves (#35, #112, #113, #134, #135) and the thresholds at which +they convert (ADR-0018/#37); this spec reports the active representation's name, +it does not decide the representation. + +## Design + +### The representation-to-name table (total function) + +- The reported vocabulary is the eight Redis synthetic names the conformance + suite asserts on [valkey-assert-encoding-vocab]: embstr, int, raw, listpack, + intset, hashtable, skiplist, quicklist. The mapping is a total function: each + internal representation maps to exactly one name, never two. Issue #40's + acceptance table collapsed embstr/raw into a single bullet and left the + embstr-vs-raw split as an open decision; this spec keeps both names, matching + ENCODINGS.md, which reports out-of-line strings as the `raw`-class. +- String types: a pointer-tagged inline integer (#112) reports `int`; an inline + short string (SSO, the embstr-class up to the inline threshold + [redis-embstr-threshold-44]) reports `embstr`; an out-of-line string with a + variable-width header [redis-sds-header-variants] reports `raw`. The embstr/raw + boundary is the inline-value threshold (#111), reported off the current + representation, not recomputed from config. +- Collection types: the small universal `pack` container (#113) reports + `listpack` for hash, list, set, and zset alike; the all-integer sorted-array + analog [redis-intset-layout] reports `intset`. The large hash and set report + `hashtable`, the large sorted set (#134) reports `skiplist`, and the chunked + list deque (#135) reports `quicklist`. The borrowed name `quicklist` describes + the chunked shape, not the 32-byte Redis node layout [redis-quicklist-node-32-bytes]. + +### Name derives from representation, not from thresholds + +- The reported name is a pure function of the active internal representation, so + reconfiguring an ADR-0018 threshold (which changes WHEN a value converts) never + changes the name reported for a value that has not converted. Two keys of the + same logical type report different names exactly when their representations + differ (for example a 50-member zset listpack vs a 5000-member zset skiplist), + matching the oracle (ADR-0009). + +### DEBUG OBJECT field synthesis + +- DEBUG OBJECT emits a line with `encoding:` from the same function above, + so OBJECT ENCODING and DEBUG OBJECT always agree on the name. Fields IronCache + can compute honestly are synthesized: `serializedlength` from the value's + encoded byte size, and for `quicklist` keys `ql_nodes` (the live chunk count) + and `ql_avg_node` (elements per chunk), both derived from IronCache's chunking + (#135) rather than a Redis node count [redis-quicklist-node-32-bytes]. Fields + that name a Redis-internal IronCache does not have are omitted rather than + emitted as fabricated zeros, so no test asserts on an invented internal. + +### assert_encoding wiring and rejected alternatives + +- The conformance suite adopts Valkey's assert_encoding helper, which runs OBJECT + ENCODING and matches the expected name from the same vocabulary + [valkey-assert-encoding-vocab], treating a mismatch as a correctness failure + (#95). Reporting native names (`btree-zset`, `radix-hash`) even behind a flag + is rejected: it would fork the test corpus and defeat compatibility. A separate + read-only native-introspection verb for IronCache's own debugging is left open + and would never be OBJECT ENCODING. + +## Open questions + +- The exact embstr-vs-raw byte boundary (the inline-value threshold shared with + #111), and whether any string ever reports `raw` below it. +- Which DEBUG OBJECT fields beyond serializedlength/ql_nodes/ql_avg_node are + load-bearing for the target suites, surfaced as #95 enumerates them. +- Whether a native-name introspection command is worth adding for debugging + (separate verb, never OBJECT ENCODING). + +## Acceptance and test hooks + +- Every internal representation maps to exactly one name from {embstr, int, raw, + listpack, intset, hashtable, skiplist, quicklist} (a documented total-function + table, unit-tested for totality). +- OBJECT ENCODING and DEBUG OBJECT agree on the name for the same key, and the + name does not change when only thresholds are reconfigured (property test). +- assert_encoding passes against IronCache across the size ladder and at every + conversion boundary [valkey-assert-encoding-vocab] (#95/#97/#98). +- A `quicklist` key returns a plausible ql_nodes derived from IronCache chunking + [redis-quicklist-node-32-bytes] (#135). + +## References + +- ADR-0009, ADR-0018; issues #35, #111, #112, #113, #134, #135, #95, #97, #98, + #150. +- Claims: [valkey-assert-encoding-vocab], [redis-quicklist-node-32-bytes], + [redis-embstr-threshold-44], [redis-sds-header-variants], [redis-intset-layout]. diff --git a/docs/design/README.md b/docs/design/README.md index 62f85be..2329ee9 100644 --- a/docs/design/README.md +++ b/docs/design/README.md @@ -119,3 +119,9 @@ Specs added as the M1 milestone progresses. monoio/glommio/tokio swappable (#27). - [IOURING_DATAPATH.md](IOURING_DATAPATH.md): the Linux io_uring net fast path (per-shard ring, registered fixed buffers, multishot + one-shot fallback) (#28). +- [ZSET_LARGE.md](ZSET_LARGE.md): the large sorted-set representation (ordered + index plus parallel member->score map; final structure deferred to #136) (#134). +- [LIST_LARGE.md](LIST_LARGE.md): the large list (quicklist-equivalent chunked + listpack deque, O(1) head/tail, ~8KB node sizing) (#135). +- [OBJECT_ENCODING_MAPPING.md](OBJECT_ENCODING_MAPPING.md): the internal-repr to + OBJECT ENCODING name map and DEBUG OBJECT field synthesis (#40). diff --git a/docs/design/ZSET_LARGE.md b/docs/design/ZSET_LARGE.md new file mode 100644 index 0000000..08b2e5a --- /dev/null +++ b/docs/design/ZSET_LARGE.md @@ -0,0 +1,98 @@ +# Design: Sorted-set large representation (ordered index plus member map) + +Issue: #134. Decisions: ADR-0018 (encoding thresholds), ADR-0005 (per-shard +unsynchronized map), ADR-0009 (behavioral equivalence). Related: #113 (small +listpack zset), #35 (index), #40 (OBJECT ENCODING name), #136 +(large-collection-bakeoff), #128 (zset command semantics), #8 (harness). + +## Goal and scope + +A sorted set that outgrows the small listpack container (ADR-0018) needs a +structure that serves both an ordered range/rank query and an O(1) member point +lookup, the two access patterns the zset command set demands. This spec fixes the +two-structure shape, the sync invariant that keeps them consistent on one core, +the ordering contracts behind ZRANGEBYSCORE and ZRANGEBYLEX, and which knobs are +harness parameters rather than fixed numbers. Scope is the representation above +the listpack threshold only. The promotion thresholds are ADR-0018/#37, the small +container is #113, and the final choice of ordered-index structure is the #136 +bake-off; this spec sets the provisional baseline and the contract every +candidate must satisfy, not the winner. + +## Design + +### Two structures, one value + +- The large zset is a dual structure mirroring Redis: an ordered index keyed by + (score, member) for range and rank, plus a parallel hashmap from member to + score for O(1) ZSCORE and ZADD score-update [redis-zset-skiplist-plus-ht]. The + member bytes are stored once and shared between both views, so a member is not + duplicated per structure [redis-zset-skiplist-plus-ht]. The whole value lives + in one kvobj on one core (ADR-0005), so neither structure takes a lock. +- The provisional ordered index is a skiplist [redis-zset-skiplist-plus-ht]. It + is provisional because the #136 bake-off evaluates a cache-conscious B-tree and + an ART against it on throughput-per-core and bytes-per-element; a B-tree packs + many keys per cache line versus the skiplist's one element per tower node + [skiplist-vs-btree-cache], and ART keeps keys ordered at a low per-key byte + cost [art-adaptive-radix-tree-icde13]. This spec commits to the trait the index + sits behind, not the structure that wins. + +### The sync invariant + +- Every member appears in exactly one of two states: present in BOTH the ordered + index and the member map with the same score, or present in NEITHER. There is + no transient single-structure state observable to a command, because all + mutation runs inline on the owning core (ADR-0005) with no yield point inside a + zset write. ZADD that updates a score is a remove-then-reinsert in the ordered + index plus an in-place score rewrite in the map; ZREM deletes from both. A + property test asserts the two views agree on membership and score after every + operation (the sync invariant). + +### Ordering: ZRANGEBYSCORE vs ZRANGEBYLEX + +- The ordered index is sorted by (score, member): primarily ascending score, + ties broken by member byte order, the ordering Redis defines for a skiplist + zset [redis-zset-skiplist-plus-ht]. ZRANGEBYSCORE, ZRANK, and ZRANGE by index + walk this order directly, forward or reversed. +- ZRANGEBYLEX assumes all members share one score and returns a purely + lexicographic member range. Because the index already breaks score ties by + member bytes, the equal-score run is contiguous and already in member order, so + ZRANGEBYLEX is a sub-scan of that run with no second index. Its result is + defined only when scores are equal, matching the oracle (ADR-0009, #128). + +### Level and fanout as harness parameters + +- For the skiplist baseline the max level and the level-promotion probability are + harness parameters (#8), not fixed here; for a B-tree or ART candidate the + analogous knob is node fanout. They are swept in the #136 bake-off because the + right value depends on IronCache's value layout and the thread-per-core engine, + where cross-paper numbers do not transfer. + +## Open questions + +- The final ordered-index structure (skiplist vs cache-conscious B-tree vs ART), + decided by #136 on throughput-per-core and bytes-per-element. +- Whether the member map is a distinct per-zset hashbrown table or folds into the + ordered index nodes once the structure is chosen (#136), and the score-update + path's exact cost under each. +- Whether a maintained rank/size annotation is worth its bytes for O(log n) ZRANK + versus a counted walk, tuned on the harness (#8). + +## Acceptance and test hooks + +- After any ZADD/ZREM/ZINCRBY the ordered index and the member map agree on + membership and score for every member (the sync invariant, property test). +- ZSCORE is a single member-map lookup with no ordered-index walk; ZRANGEBYSCORE + and ZRANK walk the (score, member) order and match the oracle (#97/#98). +- ZRANGEBYLEX over an equal-score set returns the lexicographic member range and + matches the oracle; mixed scores follow the oracle's defined behavior + (#97/#98, #128). +- OBJECT ENCODING reports `skiplist` for the large zset regardless of the chosen + internal structure [valkey-assert-encoding-vocab] (ADR-0009, name map in #40). + +## References + +- ADR-0005, ADR-0009, ADR-0018; issues #113, #35, #40, #136, #128, #37, #8, + #97, #98. +- Claims: [redis-zset-skiplist-plus-ht], [redis-zset-max-listpack-entries-128], + [skiplist-vs-btree-cache], [art-adaptive-radix-tree-icde13], + [valkey-assert-encoding-vocab].