From 8dbd6af79b3dc7316eb8a3737b45c60c22f03762 Mon Sep 17 00:00:00 2001 From: Zeke Date: Sat, 13 Jun 2026 21:41:29 -0700 Subject: [PATCH] design: cluster control plane (Raft) + membership (SWIM+Lifeguard) (closes #73, closes #74) CONTROL_PLANE.md (#73): a 3-5 voter Raft group owning the authoritative slot map + config epoch + membership + replica promotion; data nodes as learners; CLUSTER SLOTS/SHARDS as a committed-epoch projection; config only through Raft; Jepsen/ Elle 21-failure-class acceptance. MEMBERSHIP.md (#74): SWIM + non-optional Lifeguard data-plane membership/failure detection behind a Membership trait; LAN/WAN profiles; health joined with the Raft-committed map, monotonic under suspicion; SWIM-proposes/Raft-commits handshake. Authored+reviewed via workflow. CI passes. Closes #73, closes #74. Signed-off-by: Zeke --- docs/design/CONTROL_PLANE.md | 103 +++++++++++++++++++++++++++++++++++ docs/design/MEMBERSHIP.md | 100 ++++++++++++++++++++++++++++++++++ docs/design/README.md | 4 ++ 3 files changed, 207 insertions(+) create mode 100644 docs/design/CONTROL_PLANE.md create mode 100644 docs/design/MEMBERSHIP.md diff --git a/docs/design/CONTROL_PLANE.md b/docs/design/CONTROL_PLANE.md new file mode 100644 index 0000000..b6105c0 --- /dev/null +++ b/docs/design/CONTROL_PLANE.md @@ -0,0 +1,103 @@ +# Design: Raft control plane for the authoritative slot map + +Issue: #73. Decisions: ADR-0025 (cluster partition map and slot ownership), +ADR-0011 (single-node-first, slot-ready layout), ADR-0012 (scale-out targets), +ADR-0002 (shared-nothing; per-shard slots are the migration and execution unit). +Related: #74/MEMBERSHIP.md (the SWIM signal this commits over), #70 (client +cluster contract), #75 (slot migration), #68 (clustering umbrella). + +## Goal and scope + +One authoritative answer to "which node owns slot N right now, and at what config +epoch." The Redis-compatible baseline gossips a per-node served-slots bitmap on a +cluster bus at base_port+10000 [redis-cluster-bus-port], which has no linearizable +arbiter and whose bitmap size caps the design near 1000 masters +[redis-cluster-why-16384]. This folds the slot map, config epoch, membership +roster, and replica promotion into a small in-binary Raft group; config changes +flow only through Raft. Scope is the control plane: the slot->node map, epoch, +membership commit, and promotion. Out of scope: the partition layout itself +(ADR-0025), the data-plane failure signal (#74), and per-slot data migration +mechanics (#75). + +## Design + +### Raft group and the replicated state machine + +- A single Raft group of 3 to 5 voters owns the authoritative state; Raft is the + multi-Paxos-equivalent decomposition into leader election, log replication, and + safety [raft-overview]. The replicated state machine is exactly the config: the + slot->node map of ADR-0025, the monotonic config epoch, the membership roster, + and replica role assignments. No user data ever passes through the log, which + keeps the blast radius config-sized and the log small enough to snapshot cheaply. +- Data nodes are non-voting learners: they receive committed map deltas and apply + them but do not vote, so commit latency stays bounded by the 3-to-5 voters even + at the several-thousand-node target of ADR-0012. The voter set is itself a + committed entry, so growing or shrinking it is an ordinary config change. + +### Config epoch and committed-map projection + +- Every committed change bumps the config epoch monotonically. CLUSTER SLOTS and + CLUSTER SHARDS are pure projections of the committed map at the current epoch + (#70), never of in-flight log entries, so a client never observes a slot owner + that Raft has not committed. The epoch is the tie-breaker for stale clients: a + MOVED carries the destination at the committing epoch. + +### Membership and promotion go only through Raft + +- Adding, removing, or promoting a node is a Raft proposal, not a gossip event. + This replaces the eventually-consistent, "last failover wins" posture of an + external Sentinel quorum [redis-sentinel-quorum-vs-majority]: one quorum, one + source of truth, and no second arbiter to disagree. A replica is eligible for + promotion only when the leader records its replication offset as within a + configured lag bound; promotion is a committed entry that flips the role and + bumps the epoch atomically. + +### The SWIM-proposes / Raft-commits handshake (with #74) + +- The data-plane membership layer (#74) is a fast but unauthoritative suspicion + source. A SWIM suspicion or confirmation is a *hint*: the Raft leader treats it + as a proposal input, never as a commit. Only a committed Raft entry changes the + authoritative roster or a slot owner. This is the contract MEMBERSHIP.md + specifies from the membership side; here it means the leader is the sole writer + and a healthy node cannot be evicted by gossip alone, only by a quorum commit. +- The committed map is therefore monotonic under transient suspicion: a node that + SWIM marks suspect is demoted in the client-facing projection only after the + leader commits the demotion, so CLUSTER SLOTS never regresses on a flap. + +### Correctness bar + +- The acceptance target is the Jepsen/Elle suite covering all 21 Redis-Raft + failure classes (split-brain, lost updates, stale/aborted reads, total data + loss on failover or membership change) [jepsen-redis-raft-21-issues]. Keeping + user data out of the log shrinks the surface, but the config plane must still + clear every class under partitions, pauses, clock skew, and membership churn. + +## Open questions + +- Fixed 3 voters or 5: higher fault tolerance versus commit latency under the + ADR-0012 failover budget. +- Learner delivery: Raft learners on the same transport, or an out-of-band poll + of committed map deltas to keep voter fan-out small. +- The node-count or commit-latency threshold that forces sharded/hierarchical + multi-Raft instead of one group. +- Whether the cluster-bus port stays at base+10000 for compatibility once the + gossip slot bitmap is gone [redis-cluster-bus-port]. + +## Acceptance and test hooks + +- Slot ownership is linearizable: a committed map at epoch E is never contradicted + by any node; a DST/Jepsen history shows no two nodes serving the same slot as + owner at the same epoch. +- No gossip-propagated slot bitmap exists; internal node count is not bitmap-capped. +- Failover and promotion run in-binary with no Sentinel process + [redis-sentinel-quorum-vs-majority]; promotion respects the lag gate. +- CLUSTER SLOTS/SHARDS render from the committed map and never regress under a + transient SWIM suspicion (joint hook with #74). +- The Jepsen/Elle suite passes all 21 Redis-Raft failure classes + [jepsen-redis-raft-21-issues] (#99). + +## References + +- ADR-0025, ADR-0011, ADR-0012, ADR-0002; issues #73, #74, #70, #75, #68, #99. +- Claims: [raft-overview], [redis-cluster-bus-port], [redis-cluster-why-16384], + [redis-sentinel-quorum-vs-majority], [jepsen-redis-raft-21-issues]. diff --git a/docs/design/MEMBERSHIP.md b/docs/design/MEMBERSHIP.md new file mode 100644 index 0000000..411f0e4 --- /dev/null +++ b/docs/design/MEMBERSHIP.md @@ -0,0 +1,100 @@ +# Design: SWIM + Lifeguard data-plane membership + +Issue: #74. Decisions: ADR-0025 (cluster partition map this health joins against), +ADR-0012 (scale-out targets the protocol must hold flat past), ADR-0002 +(shared-nothing; membership state is per-node, off the data hot path), ADR-0011 +(slot-ready layout the live view is rendered over). Related: #73/CONTROL_PLANE.md +(the Raft authority SWIM proposes to), #70 (client cluster view), #68 (umbrella). + +## Goal and scope + +A data-plane membership and failure-detection layer whose per-node cost is flat +as the cluster grows and that does not flap a healthy node out of the ring during +a GC pause. It answers, for every other subsystem, which nodes are alive right +now so replication routes around the dead and the client view never points at a +corpse. Scope: the failure-detection protocol, its LAN/WAN defaults, the +`Membership` trait, and how SWIM health joins the Raft-committed map. Out of +scope: slot rebalancing policy and the authoritative epoch (#73), and the +partition layout (ADR-0025). + +## Design + +### SWIM, behind a Membership trait + +- Adopt SWIM: randomized round-robin direct ping plus k indirect pings, with + infection-style dissemination, so expected detection time, false-positive rate, + and per-member message load are independent of group size [swim-scalability]. + This is the constant-cost transport that lets ADR-0012's several-thousand-node + target hold flat past the ~1000-node Redis full-mesh ceiling + [redis-cluster-max-nodes-recommendation]. +- The protocol sits behind a `Membership` trait so it is swappable (a full-mesh + backend stays implementable behind the same interface). Whether the trait is + public in v1 or internal until a second backend exists is an open question + below (Simple over Scalable until earned). + +### Lifeguard, non-optional + +- Lifeguard local-health awareness is a non-optional extension, not a tuning + knob: a node that suspects itself is slow (missed self-probes) dilates its own + timeouts so it does not falsely accuse peers, cutting failure-detector false + positives by ~50x under CPU/network stress in the memberlist deployment + [memberlist-lifeguard]. This is what keeps a GC pause from evicting a healthy + node. + +### LAN and WAN profiles + +- Two default profiles split the single dominant timeout by network class, + adapting the intent of Redis's one global `cluster-node-timeout` (15000 ms + default [redis-cluster-node-timeout-default]) rather than its single value. + LAN: probe ~200 ms, suspicion ~1 s, for fast failover within the ADR-0012 + budget. Cloud/WAN: probe ~1 s, suspicion ~5 s, anchored under the 15 s ceiling + [redis-cluster-node-timeout-default] to tolerate jitter. Exact suspicion + multipliers and indirect-probe fan-out are open below. + +### Health joined with the Raft-committed map + +- The client-facing view is not SWIM-direct. CLUSTER SLOTS/SHARDS (#70) are + rendered from the Raft-committed slot->node map (#73) joined with live SWIM + health, so a polling client reads a stable, committed answer rather than raw + gossip. SWIM only annotates liveness over an ownership decided by Raft. + +### The SWIM-proposes / Raft-commits handshake (with #73) + +- SWIM is fast and unauthoritative; Raft is authoritative and slower. A SWIM + suspicion or confirmation is a *hint* delivered to the Raft leader as a proposal + input (#73); it never directly mutates the roster or a slot owner. The leader + commits the membership or promotion change, bumps the config epoch, and only + then does the change appear in the client projection. This is the same contract + CONTROL_PLANE.md states from the consensus side: SWIM proposes, Raft commits. +- The joined view is therefore monotonic under suspicion: a node SWIM marks + suspect is demoted in the CLUSTER SLOTS/SHARDS reply only after Raft confirms, + so the polled answer is stable and never regresses on a transient flap. A + debounce window before a suspicion is surfaced is an open question below. + +## Open questions + +- Exact LAN/WAN suspicion multipliers and indirect-probe fan-out. +- Whether `Membership` is a public trait in v1 or internal until a second backend + exists (Simple over Scalable until earned). +- The debounce window before a SWIM suspicion is surfaced toward CLUSTER SLOTS. +- Whether WAN mode is auto-detected or operator-set. + +## Acceptance and test hooks + +- Measured per-node message and CPU overhead stays constant as N grows + [swim-scalability]; a scaling run shows it flat past the ~1000-node ceiling + [redis-cluster-max-nodes-recommendation]. +- A GC-pause / process-pause injection does not flap a healthy node out of the + ring, validating the Lifeguard self-awareness path [memberlist-lifeguard]. +- The `Membership` trait isolates the protocol: a full-mesh backend compiles and + runs behind it unchanged. +- LAN and WAN default profiles are documented and benchmarked against + pause-induced false positives. +- CLUSTER SLOTS/SHARDS replies derive from Raft-committed state and never regress + under a transient SWIM suspicion (joint hook with #73). + +## References + +- ADR-0025, ADR-0012, ADR-0002, ADR-0011; issues #74, #73, #70, #68. +- Claims: [swim-scalability], [memberlist-lifeguard], + [redis-cluster-node-timeout-default], [redis-cluster-max-nodes-recommendation]. diff --git a/docs/design/README.md b/docs/design/README.md index db64fc6..cbb336e 100644 --- a/docs/design/README.md +++ b/docs/design/README.md @@ -92,3 +92,7 @@ Specs added as the M1 milestone progresses. - [CLUSTER_CONTRACT.md](CLUSTER_CONTRACT.md): the Redis-Cluster-compatible client wire contract (CRC16/16384 slots, hash tags, CROSSSLOT, MOVED/ASK, CLUSTER SLOTS/SHARDS, sharded Pub/Sub) (#70). +- [CONTROL_PLANE.md](CONTROL_PLANE.md): the in-binary Raft control plane owning the + authoritative slot map, config epoch, membership, and replica promotion (#73). +- [MEMBERSHIP.md](MEMBERSHIP.md): SWIM + non-optional Lifeguard data-plane + membership and failure detection, joined with the Raft-committed map (#74).