Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions docs/design/LIST_LARGE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Design: List large representation (quicklist-equivalent chunked deque)

Issue: #135. Decisions: ADR-0018 (encoding thresholds), ADR-0005 (per-shard
unsynchronized map), ADR-0009 (behavioral equivalence). Related: #113 (small
listpack list chunk), #35 (index), #40 (OBJECT ENCODING name and ql_* fields),
#136 (large-collection-bakeoff), #128 (list command semantics), #52
(value compression), #8 (harness).

## Goal and scope

A list that outgrows a single small listpack chunk (ADR-0018) needs a structure
with O(1) head and tail operations and bounded per-node memory, the quicklist
contract. This spec fixes the chunked-deque shape, the node-size policy, the
traversal model for the interior commands, and how chunks split and merge, plus
the ql_nodes/ql_avg_node fields #40 must synthesize. Scope is the representation
above the listpack threshold; the small chunk is #113, the threshold is
ADR-0018/#37, and the flat-deque-versus-indexed-chunk choice is the #136
bake-off. This spec sets the provisional flat baseline and the contract.

## Design

### Chunked deque of listpack nodes

- The large list is a deque of compact listpack chunks, the quicklist shape
Redis uses [redis-list-max-listpack-size-neg2]. Each chunk is one contiguous
listpack with the ~6-byte header (total bytes plus element count) and a 1-byte
terminator [redis-listpack-header-6-bytes], holding a run of elements in order;
the chunks are linked head-to-tail. The whole list is one value on one core
(ADR-0005), so no chunk link is synchronized.
- The provisional structure is a flat doubly linked deque of chunks (a plain
prev/next chain). It is provisional because #136 evaluates an indexed chunk
structure (a small B-tree or rope of chunks) for faster positional access; this
spec commits to the chunk-deque trait, not to flat-versus-indexed.

### Node sizing (~8 KB)

- A chunk's byte budget maps to list-max-listpack-size -2, the Redis default that
caps each node's listpack at 8 KB rather than an element count
[redis-list-max-listpack-size-neg2]. A push that would exceed the budget starts
a new chunk; the cap keeps each node cache-resident and bounds the cost of an
interior memmove within one chunk. IronCache stores only the listpack bytes per
chunk, contrasting Redis's 32-byte quicklistNode struct (prev/next, listpack
ptr, sz, count, and bitfields) [redis-quicklist-node-32-bytes]; interior-node
LZF compression [redis-quicklist-node-32-bytes] is a design choice deferred to
COMPRESSION.md (#52), not adopted here.

### Head/tail O(1) and interior traversal

- LPUSH/RPUSH/LPOP/RPOP touch only the head or tail chunk: an append or pop
inside that chunk's listpack, allocating or freeing a chunk only at the budget
boundary, so end operations are O(1) amortized.
- LINDEX/LRANGE/LSET/LINSERT walk the chunk chain accumulating element counts to
locate the target chunk, then scan within it. Each chunk carries its element
count in the listpack header [redis-listpack-header-6-bytes], so locating a
chunk by index is a walk over chunk counts, not over every element; the flat
baseline makes this O(number of chunks), which is the cost #136 weighs against
an indexed variant. LSET rewrites one entry in place; LINSERT inserts into the
target chunk's listpack with at most one tail memmove within that chunk.

### Chunk split and merge

- An insert that pushes a chunk past the ~8 KB budget
[redis-list-max-listpack-size-neg2] splits it into two chunks at an element
boundary near the midpoint. Deletions that leave two adjacent chunks jointly
under the budget merge them, bounding chunk count and keeping ql_avg_node
meaningful. The merge low-watermark (how empty before merging) is harness-tuned
(#8), a churn-versus-resident-bytes trade, not fixed here.

### ql_nodes and ql_avg_node derivation

- ql_nodes is the live chunk count; ql_avg_node is total element count divided by
ql_nodes. Both are computed from the deque IronCache actually holds and
surfaced through DEBUG OBJECT for `quicklist` keys [redis-quicklist-node-32-bytes],
the synthesis #40 wires in. They reflect IronCache chunking, not a Redis node
layout, and are a pure function of the current representation (#40).

## Open questions

- Flat doubly linked chunk chain vs an indexed chunk structure (small B-tree or
rope) for positional access, decided by #136 on throughput-per-core and
bytes-per-element.
- The chunk split point (strict midpoint vs fill-the-tail) and the merge
low-watermark, tuned on the harness (#8).
- Whether a chunk reuses the #113 `pack` exactly or a length-only variant sized
to the ~8 KB cap [redis-list-max-listpack-size-neg2].

## Acceptance and test hooks

- LPUSH/RPUSH/LPOP/RPOP touch only the end chunk and allocate or free a chunk
only at the byte budget (O(1) amortized, structural test).
- An interior LINSERT performs at most one tail memmove within the target chunk
and never rewrites another chunk; no chunk exceeds the ~8 KB budget after split
[redis-list-max-listpack-size-neg2] (property test).
- DEBUG OBJECT reports ql_nodes equal to the live chunk count and a consistent
ql_avg_node [redis-quicklist-node-32-bytes]; OBJECT ENCODING reports
`quicklist` [valkey-assert-encoding-vocab] (ADR-0009, name map #40).
- LINDEX/LRANGE/LSET match the oracle across chunk boundaries (#97/#98, #128).

## References

- ADR-0005, ADR-0009, ADR-0018; issues #113, #35, #40, #136, #128, #52, #37,
#8, #97, #98.
- Claims: [redis-list-max-listpack-size-neg2], [redis-listpack-header-6-bytes],
[redis-quicklist-node-32-bytes], [valkey-assert-encoding-vocab].
100 changes: 100 additions & 0 deletions docs/design/OBJECT_ENCODING_MAPPING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Design: OBJECT ENCODING / DEBUG OBJECT compatibility mapping

Issue: #40. Decisions: ADR-0009 (behavioral equivalence via OBJECT ENCODING),
ADR-0018 (encoding thresholds). Related: #35 (index, parent), #111 (object
layout), #112 (scalar encodings), #113 (collection container), #134 (large
zset), #135 (large list), #95 (conformance), #150 (DEBUG OBJECT command).

## Goal and scope

Clients and conformance suites introspect storage through OBJECT ENCODING and
DEBUG OBJECT and branch on the exact synthetic name returned, so IronCache must
report Redis-vocabulary names even though its internal representations are chosen
for a Rust runtime, not Redis's C internals (ADR-0009). This spec fixes the total
function from every internal representation to one reported name, the DEBUG
OBJECT field synthesis, and the assert_encoding wiring. Out of scope are the
structures themselves (#35, #112, #113, #134, #135) and the thresholds at which
they convert (ADR-0018/#37); this spec reports the active representation's name,
it does not decide the representation.

## Design

### The representation-to-name table (total function)

- The reported vocabulary is the eight Redis synthetic names the conformance
suite asserts on [valkey-assert-encoding-vocab]: embstr, int, raw, listpack,
intset, hashtable, skiplist, quicklist. The mapping is a total function: each
internal representation maps to exactly one name, never two. Issue #40's
acceptance table collapsed embstr/raw into a single bullet and left the
embstr-vs-raw split as an open decision; this spec keeps both names, matching
ENCODINGS.md, which reports out-of-line strings as the `raw`-class.
- String types: a pointer-tagged inline integer (#112) reports `int`; an inline
short string (SSO, the embstr-class up to the inline threshold
[redis-embstr-threshold-44]) reports `embstr`; an out-of-line string with a
variable-width header [redis-sds-header-variants] reports `raw`. The embstr/raw
boundary is the inline-value threshold (#111), reported off the current
representation, not recomputed from config.
- Collection types: the small universal `pack` container (#113) reports
`listpack` for hash, list, set, and zset alike; the all-integer sorted-array
analog [redis-intset-layout] reports `intset`. The large hash and set report
`hashtable`, the large sorted set (#134) reports `skiplist`, and the chunked
list deque (#135) reports `quicklist`. The borrowed name `quicklist` describes
the chunked shape, not the 32-byte Redis node layout [redis-quicklist-node-32-bytes].

### Name derives from representation, not from thresholds

- The reported name is a pure function of the active internal representation, so
reconfiguring an ADR-0018 threshold (which changes WHEN a value converts) never
changes the name reported for a value that has not converted. Two keys of the
same logical type report different names exactly when their representations
differ (for example a 50-member zset listpack vs a 5000-member zset skiplist),
matching the oracle (ADR-0009).

### DEBUG OBJECT field synthesis

- DEBUG OBJECT emits a line with `encoding:<name>` from the same function above,
so OBJECT ENCODING and DEBUG OBJECT always agree on the name. Fields IronCache
can compute honestly are synthesized: `serializedlength` from the value's
encoded byte size, and for `quicklist` keys `ql_nodes` (the live chunk count)
and `ql_avg_node` (elements per chunk), both derived from IronCache's chunking
(#135) rather than a Redis node count [redis-quicklist-node-32-bytes]. Fields
that name a Redis-internal IronCache does not have are omitted rather than
emitted as fabricated zeros, so no test asserts on an invented internal.

### assert_encoding wiring and rejected alternatives

- The conformance suite adopts Valkey's assert_encoding helper, which runs OBJECT
ENCODING and matches the expected name from the same vocabulary
[valkey-assert-encoding-vocab], treating a mismatch as a correctness failure
(#95). Reporting native names (`btree-zset`, `radix-hash`) even behind a flag
is rejected: it would fork the test corpus and defeat compatibility. A separate
read-only native-introspection verb for IronCache's own debugging is left open
and would never be OBJECT ENCODING.

## Open questions

- The exact embstr-vs-raw byte boundary (the inline-value threshold shared with
#111), and whether any string ever reports `raw` below it.
- Which DEBUG OBJECT fields beyond serializedlength/ql_nodes/ql_avg_node are
load-bearing for the target suites, surfaced as #95 enumerates them.
- Whether a native-name introspection command is worth adding for debugging
(separate verb, never OBJECT ENCODING).

## Acceptance and test hooks

- Every internal representation maps to exactly one name from {embstr, int, raw,
listpack, intset, hashtable, skiplist, quicklist} (a documented total-function
table, unit-tested for totality).
- OBJECT ENCODING and DEBUG OBJECT agree on the name for the same key, and the
name does not change when only thresholds are reconfigured (property test).
- assert_encoding passes against IronCache across the size ladder and at every
conversion boundary [valkey-assert-encoding-vocab] (#95/#97/#98).
- A `quicklist` key returns a plausible ql_nodes derived from IronCache chunking
[redis-quicklist-node-32-bytes] (#135).

## References

- ADR-0009, ADR-0018; issues #35, #111, #112, #113, #134, #135, #95, #97, #98,
#150.
- Claims: [valkey-assert-encoding-vocab], [redis-quicklist-node-32-bytes],
[redis-embstr-threshold-44], [redis-sds-header-variants], [redis-intset-layout].
6 changes: 6 additions & 0 deletions docs/design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,3 +119,9 @@ Specs added as the M1 milestone progresses.
monoio/glommio/tokio swappable (#27).
- [IOURING_DATAPATH.md](IOURING_DATAPATH.md): the Linux io_uring net fast path
(per-shard ring, registered fixed buffers, multishot + one-shot fallback) (#28).
- [ZSET_LARGE.md](ZSET_LARGE.md): the large sorted-set representation (ordered
index plus parallel member->score map; final structure deferred to #136) (#134).
- [LIST_LARGE.md](LIST_LARGE.md): the large list (quicklist-equivalent chunked
listpack deque, O(1) head/tail, ~8KB node sizing) (#135).
- [OBJECT_ENCODING_MAPPING.md](OBJECT_ENCODING_MAPPING.md): the internal-repr to
OBJECT ENCODING name map and DEBUG OBJECT field synthesis (#40).
98 changes: 98 additions & 0 deletions docs/design/ZSET_LARGE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Design: Sorted-set large representation (ordered index plus member map)

Issue: #134. Decisions: ADR-0018 (encoding thresholds), ADR-0005 (per-shard
unsynchronized map), ADR-0009 (behavioral equivalence). Related: #113 (small
listpack zset), #35 (index), #40 (OBJECT ENCODING name), #136
(large-collection-bakeoff), #128 (zset command semantics), #8 (harness).

## Goal and scope

A sorted set that outgrows the small listpack container (ADR-0018) needs a
structure that serves both an ordered range/rank query and an O(1) member point
lookup, the two access patterns the zset command set demands. This spec fixes the
two-structure shape, the sync invariant that keeps them consistent on one core,
the ordering contracts behind ZRANGEBYSCORE and ZRANGEBYLEX, and which knobs are
harness parameters rather than fixed numbers. Scope is the representation above
the listpack threshold only. The promotion thresholds are ADR-0018/#37, the small
container is #113, and the final choice of ordered-index structure is the #136
bake-off; this spec sets the provisional baseline and the contract every
candidate must satisfy, not the winner.

## Design

### Two structures, one value

- The large zset is a dual structure mirroring Redis: an ordered index keyed by
(score, member) for range and rank, plus a parallel hashmap from member to
score for O(1) ZSCORE and ZADD score-update [redis-zset-skiplist-plus-ht]. The
member bytes are stored once and shared between both views, so a member is not
duplicated per structure [redis-zset-skiplist-plus-ht]. The whole value lives
in one kvobj on one core (ADR-0005), so neither structure takes a lock.
- The provisional ordered index is a skiplist [redis-zset-skiplist-plus-ht]. It
is provisional because the #136 bake-off evaluates a cache-conscious B-tree and
an ART against it on throughput-per-core and bytes-per-element; a B-tree packs
many keys per cache line versus the skiplist's one element per tower node
[skiplist-vs-btree-cache], and ART keeps keys ordered at a low per-key byte
cost [art-adaptive-radix-tree-icde13]. This spec commits to the trait the index
sits behind, not the structure that wins.

### The sync invariant

- Every member appears in exactly one of two states: present in BOTH the ordered
index and the member map with the same score, or present in NEITHER. There is
no transient single-structure state observable to a command, because all
mutation runs inline on the owning core (ADR-0005) with no yield point inside a
zset write. ZADD that updates a score is a remove-then-reinsert in the ordered
index plus an in-place score rewrite in the map; ZREM deletes from both. A
property test asserts the two views agree on membership and score after every
operation (the sync invariant).

### Ordering: ZRANGEBYSCORE vs ZRANGEBYLEX

- The ordered index is sorted by (score, member): primarily ascending score,
ties broken by member byte order, the ordering Redis defines for a skiplist
zset [redis-zset-skiplist-plus-ht]. ZRANGEBYSCORE, ZRANK, and ZRANGE by index
walk this order directly, forward or reversed.
- ZRANGEBYLEX assumes all members share one score and returns a purely
lexicographic member range. Because the index already breaks score ties by
member bytes, the equal-score run is contiguous and already in member order, so
ZRANGEBYLEX is a sub-scan of that run with no second index. Its result is
defined only when scores are equal, matching the oracle (ADR-0009, #128).

### Level and fanout as harness parameters

- For the skiplist baseline the max level and the level-promotion probability are
harness parameters (#8), not fixed here; for a B-tree or ART candidate the
analogous knob is node fanout. They are swept in the #136 bake-off because the
right value depends on IronCache's value layout and the thread-per-core engine,
where cross-paper numbers do not transfer.

## Open questions

- The final ordered-index structure (skiplist vs cache-conscious B-tree vs ART),
decided by #136 on throughput-per-core and bytes-per-element.
- Whether the member map is a distinct per-zset hashbrown table or folds into the
ordered index nodes once the structure is chosen (#136), and the score-update
path's exact cost under each.
- Whether a maintained rank/size annotation is worth its bytes for O(log n) ZRANK
versus a counted walk, tuned on the harness (#8).

## Acceptance and test hooks

- After any ZADD/ZREM/ZINCRBY the ordered index and the member map agree on
membership and score for every member (the sync invariant, property test).
- ZSCORE is a single member-map lookup with no ordered-index walk; ZRANGEBYSCORE
and ZRANK walk the (score, member) order and match the oracle (#97/#98).
- ZRANGEBYLEX over an equal-score set returns the lexicographic member range and
matches the oracle; mixed scores follow the oracle's defined behavior
(#97/#98, #128).
- OBJECT ENCODING reports `skiplist` for the large zset regardless of the chosen
internal structure [valkey-assert-encoding-vocab] (ADR-0009, name map in #40).

## References

- ADR-0005, ADR-0009, ADR-0018; issues #113, #35, #40, #136, #128, #37, #8,
#97, #98.
- Claims: [redis-zset-skiplist-plus-ht], [redis-zset-max-listpack-entries-128],
[skiplist-vs-btree-cache], [art-adaptive-radix-tree-icde13],
[valkey-assert-encoding-vocab].
Loading