ELares · ELares · Jun 14, 2026 · Jun 14, 2026
@@ -0,0 +1,103 @@
+# Design: Raft control plane for the authoritative slot map
+
+Issue: #73. Decisions: ADR-0025 (cluster partition map and slot ownership),
+ADR-0011 (single-node-first, slot-ready layout), ADR-0012 (scale-out targets),
+ADR-0002 (shared-nothing; per-shard slots are the migration and execution unit).
+Related: #74/MEMBERSHIP.md (the SWIM signal this commits over), #70 (client
+cluster contract), #75 (slot migration), #68 (clustering umbrella).
+
+## Goal and scope
+
+One authoritative answer to "which node owns slot N right now, and at what config
+epoch." The Redis-compatible baseline gossips a per-node served-slots bitmap on a
+cluster bus at base_port+10000 [redis-cluster-bus-port], which has no linearizable
+arbiter and whose bitmap size caps the design near 1000 masters
+[redis-cluster-why-16384]. This folds the slot map, config epoch, membership
+roster, and replica promotion into a small in-binary Raft group; config changes
+flow only through Raft. Scope is the control plane: the slot->node map, epoch,
+membership commit, and promotion. Out of scope: the partition layout itself
+(ADR-0025), the data-plane failure signal (#74), and per-slot data migration
+mechanics (#75).
+
+## Design
+
+### Raft group and the replicated state machine
+
+- A single Raft group of 3 to 5 voters owns the authoritative state; Raft is the
+  multi-Paxos-equivalent decomposition into leader election, log replication, and
+  safety [raft-overview]. The replicated state machine is exactly the config: the
+  slot->node map of ADR-0025, the monotonic config epoch, the membership roster,
+  and replica role assignments. No user data ever passes through the log, which
+  keeps the blast radius config-sized and the log small enough to snapshot cheaply.
+- Data nodes are non-voting learners: they receive committed map deltas and apply
+  them but do not vote, so commit latency stays bounded by the 3-to-5 voters even
+  at the several-thousand-node target of ADR-0012. The voter set is itself a
+  committed entry, so growing or shrinking it is an ordinary config change.
+
+### Config epoch and committed-map projection
+
+- Every committed change bumps the config epoch monotonically. CLUSTER SLOTS and
+  CLUSTER SHARDS are pure projections of the committed map at the current epoch
+  (#70), never of in-flight log entries, so a client never observes a slot owner
+  that Raft has not committed. The epoch is the tie-breaker for stale clients: a
+  MOVED carries the destination at the committing epoch.
+
+### Membership and promotion go only through Raft
+
+- Adding, removing, or promoting a node is a Raft proposal, not a gossip event.
+  This replaces the eventually-consistent, "last failover wins" posture of an
+  external Sentinel quorum [redis-sentinel-quorum-vs-majority]: one quorum, one
+  source of truth, and no second arbiter to disagree. A replica is eligible for
+  promotion only when the leader records its replication offset as within a
+  configured lag bound; promotion is a committed entry that flips the role and
+  bumps the epoch atomically.
+
+### The SWIM-proposes / Raft-commits handshake (with #74)
+
+- The data-plane membership layer (#74) is a fast but unauthoritative suspicion
+  source. A SWIM suspicion or confirmation is a *hint*: the Raft leader treats it
+  as a proposal input, never as a commit. Only a committed Raft entry changes the
+  authoritative roster or a slot owner. This is the contract MEMBERSHIP.md
+  specifies from the membership side; here it means the leader is the sole writer
+  and a healthy node cannot be evicted by gossip alone, only by a quorum commit.
+- The committed map is therefore monotonic under transient suspicion: a node that
+  SWIM marks suspect is demoted in the client-facing projection only after the
+  leader commits the demotion, so CLUSTER SLOTS never regresses on a flap.
+
+### Correctness bar
+
+- The acceptance target is the Jepsen/Elle suite covering all 21 Redis-Raft
+  failure classes (split-brain, lost updates, stale/aborted reads, total data
+  loss on failover or membership change) [jepsen-redis-raft-21-issues]. Keeping
+  user data out of the log shrinks the surface, but the config plane must still
+  clear every class under partitions, pauses, clock skew, and membership churn.
+
+## Open questions
+
+- Fixed 3 voters or 5: higher fault tolerance versus commit latency under the
+  ADR-0012 failover budget.
+- Learner delivery: Raft learners on the same transport, or an out-of-band poll
+  of committed map deltas to keep voter fan-out small.
+- The node-count or commit-latency threshold that forces sharded/hierarchical
+  multi-Raft instead of one group.
+- Whether the cluster-bus port stays at base+10000 for compatibility once the
+  gossip slot bitmap is gone [redis-cluster-bus-port].
+
+## Acceptance and test hooks
+
+- Slot ownership is linearizable: a committed map at epoch E is never contradicted
+  by any node; a DST/Jepsen history shows no two nodes serving the same slot as
+  owner at the same epoch.
+- No gossip-propagated slot bitmap exists; internal node count is not bitmap-capped.
+- Failover and promotion run in-binary with no Sentinel process
+  [redis-sentinel-quorum-vs-majority]; promotion respects the lag gate.
+- CLUSTER SLOTS/SHARDS render from the committed map and never regress under a
+  transient SWIM suspicion (joint hook with #74).
+- The Jepsen/Elle suite passes all 21 Redis-Raft failure classes
+  [jepsen-redis-raft-21-issues] (#99).
+
+## References
+
+- ADR-0025, ADR-0011, ADR-0012, ADR-0002; issues #73, #74, #70, #75, #68, #99.
+- Claims: [raft-overview], [redis-cluster-bus-port], [redis-cluster-why-16384],
+  [redis-sentinel-quorum-vs-majority], [jepsen-redis-raft-21-issues].
@@ -0,0 +1,100 @@
+# Design: SWIM + Lifeguard data-plane membership
+
+Issue: #74. Decisions: ADR-0025 (cluster partition map this health joins against),
+ADR-0012 (scale-out targets the protocol must hold flat past), ADR-0002
+(shared-nothing; membership state is per-node, off the data hot path), ADR-0011
+(slot-ready layout the live view is rendered over). Related: #73/CONTROL_PLANE.md
+(the Raft authority SWIM proposes to), #70 (client cluster view), #68 (umbrella).
+
+## Goal and scope
+
+A data-plane membership and failure-detection layer whose per-node cost is flat
+as the cluster grows and that does not flap a healthy node out of the ring during
+a GC pause. It answers, for every other subsystem, which nodes are alive right
+now so replication routes around the dead and the client view never points at a
+corpse. Scope: the failure-detection protocol, its LAN/WAN defaults, the
+`Membership` trait, and how SWIM health joins the Raft-committed map. Out of
+scope: slot rebalancing policy and the authoritative epoch (#73), and the
+partition layout (ADR-0025).
+
+## Design
+
+### SWIM, behind a Membership trait
+
+- Adopt SWIM: randomized round-robin direct ping plus k indirect pings, with
+  infection-style dissemination, so expected detection time, false-positive rate,
+  and per-member message load are independent of group size [swim-scalability].
+  This is the constant-cost transport that lets ADR-0012's several-thousand-node
+  target hold flat past the ~1000-node Redis full-mesh ceiling
+  [redis-cluster-max-nodes-recommendation].
+- The protocol sits behind a `Membership` trait so it is swappable (a full-mesh
+  backend stays implementable behind the same interface). Whether the trait is
+  public in v1 or internal until a second backend exists is an open question
+  below (Simple over Scalable until earned).
+
+### Lifeguard, non-optional
+
+- Lifeguard local-health awareness is a non-optional extension, not a tuning
+  knob: a node that suspects itself is slow (missed self-probes) dilates its own
+  timeouts so it does not falsely accuse peers, cutting failure-detector false
+  positives by ~50x under CPU/network stress in the memberlist deployment
+  [memberlist-lifeguard]. This is what keeps a GC pause from evicting a healthy
+  node.
+
+### LAN and WAN profiles
+
+- Two default profiles split the single dominant timeout by network class,
+  adapting the intent of Redis's one global `cluster-node-timeout` (15000 ms
+  default [redis-cluster-node-timeout-default]) rather than its single value.
+  LAN: probe ~200 ms, suspicion ~1 s, for fast failover within the ADR-0012
+  budget. Cloud/WAN: probe ~1 s, suspicion ~5 s, anchored under the 15 s ceiling
+  [redis-cluster-node-timeout-default] to tolerate jitter. Exact suspicion
+  multipliers and indirect-probe fan-out are open below.
+
+### Health joined with the Raft-committed map
+
+- The client-facing view is not SWIM-direct. CLUSTER SLOTS/SHARDS (#70) are
+  rendered from the Raft-committed slot->node map (#73) joined with live SWIM
+  health, so a polling client reads a stable, committed answer rather than raw
+  gossip. SWIM only annotates liveness over an ownership decided by Raft.
+
+### The SWIM-proposes / Raft-commits handshake (with #73)
+
+- SWIM is fast and unauthoritative; Raft is authoritative and slower. A SWIM
+  suspicion or confirmation is a *hint* delivered to the Raft leader as a proposal
+  input (#73); it never directly mutates the roster or a slot owner. The leader
+  commits the membership or promotion change, bumps the config epoch, and only
+  then does the change appear in the client projection. This is the same contract
+  CONTROL_PLANE.md states from the consensus side: SWIM proposes, Raft commits.
+- The joined view is therefore monotonic under suspicion: a node SWIM marks
+  suspect is demoted in the CLUSTER SLOTS/SHARDS reply only after Raft confirms,
+  so the polled answer is stable and never regresses on a transient flap. A
+  debounce window before a suspicion is surfaced is an open question below.
+
+## Open questions
+
+- Exact LAN/WAN suspicion multipliers and indirect-probe fan-out.
+- Whether `Membership` is a public trait in v1 or internal until a second backend
+  exists (Simple over Scalable until earned).
+- The debounce window before a SWIM suspicion is surfaced toward CLUSTER SLOTS.
+- Whether WAN mode is auto-detected or operator-set.
+
+## Acceptance and test hooks
+
+- Measured per-node message and CPU overhead stays constant as N grows
+  [swim-scalability]; a scaling run shows it flat past the ~1000-node ceiling
+  [redis-cluster-max-nodes-recommendation].
+- A GC-pause / process-pause injection does not flap a healthy node out of the
+  ring, validating the Lifeguard self-awareness path [memberlist-lifeguard].
+- The `Membership` trait isolates the protocol: a full-mesh backend compiles and
+  runs behind it unchanged.
+- LAN and WAN default profiles are documented and benchmarked against
+  pause-induced false positives.
+- CLUSTER SLOTS/SHARDS replies derive from Raft-committed state and never regress
+  under a transient SWIM suspicion (joint hook with #73).
+
+## References
+
+- ADR-0025, ADR-0012, ADR-0002, ADR-0011; issues #74, #73, #70, #68.
+- Claims: [swim-scalability], [memberlist-lifeguard],
+  [redis-cluster-node-timeout-default], [redis-cluster-max-nodes-recommendation].
@@ -92,3 +92,7 @@ Specs added as the M1 milestone progresses.
 - [CLUSTER_CONTRACT.md](CLUSTER_CONTRACT.md): the Redis-Cluster-compatible client
   wire contract (CRC16/16384 slots, hash tags, CROSSSLOT, MOVED/ASK, CLUSTER
   SLOTS/SHARDS, sharded Pub/Sub) (#70).
+- [CONTROL_PLANE.md](CONTROL_PLANE.md): the in-binary Raft control plane owning the
+  authoritative slot map, config epoch, membership, and replica promotion (#73).
+- [MEMBERSHIP.md](MEMBERSHIP.md): SWIM + non-optional Lifeguard data-plane
+  membership and failure detection, joined with the Raft-committed map (#74).