Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions docs/design/CONTROL_PLANE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Design: Raft control plane for the authoritative slot map

Issue: #73. Decisions: ADR-0025 (cluster partition map and slot ownership),
ADR-0011 (single-node-first, slot-ready layout), ADR-0012 (scale-out targets),
ADR-0002 (shared-nothing; per-shard slots are the migration and execution unit).
Related: #74/MEMBERSHIP.md (the SWIM signal this commits over), #70 (client
cluster contract), #75 (slot migration), #68 (clustering umbrella).

## Goal and scope

One authoritative answer to "which node owns slot N right now, and at what config
epoch." The Redis-compatible baseline gossips a per-node served-slots bitmap on a
cluster bus at base_port+10000 [redis-cluster-bus-port], which has no linearizable
arbiter and whose bitmap size caps the design near 1000 masters
[redis-cluster-why-16384]. This folds the slot map, config epoch, membership
roster, and replica promotion into a small in-binary Raft group; config changes
flow only through Raft. Scope is the control plane: the slot->node map, epoch,
membership commit, and promotion. Out of scope: the partition layout itself
(ADR-0025), the data-plane failure signal (#74), and per-slot data migration
mechanics (#75).

## Design

### Raft group and the replicated state machine

- A single Raft group of 3 to 5 voters owns the authoritative state; Raft is the
multi-Paxos-equivalent decomposition into leader election, log replication, and
safety [raft-overview]. The replicated state machine is exactly the config: the
slot->node map of ADR-0025, the monotonic config epoch, the membership roster,
and replica role assignments. No user data ever passes through the log, which
keeps the blast radius config-sized and the log small enough to snapshot cheaply.
- Data nodes are non-voting learners: they receive committed map deltas and apply
them but do not vote, so commit latency stays bounded by the 3-to-5 voters even
at the several-thousand-node target of ADR-0012. The voter set is itself a
committed entry, so growing or shrinking it is an ordinary config change.

### Config epoch and committed-map projection

- Every committed change bumps the config epoch monotonically. CLUSTER SLOTS and
CLUSTER SHARDS are pure projections of the committed map at the current epoch
(#70), never of in-flight log entries, so a client never observes a slot owner
that Raft has not committed. The epoch is the tie-breaker for stale clients: a
MOVED carries the destination at the committing epoch.

### Membership and promotion go only through Raft

- Adding, removing, or promoting a node is a Raft proposal, not a gossip event.
This replaces the eventually-consistent, "last failover wins" posture of an
external Sentinel quorum [redis-sentinel-quorum-vs-majority]: one quorum, one
source of truth, and no second arbiter to disagree. A replica is eligible for
promotion only when the leader records its replication offset as within a
configured lag bound; promotion is a committed entry that flips the role and
bumps the epoch atomically.

### The SWIM-proposes / Raft-commits handshake (with #74)

- The data-plane membership layer (#74) is a fast but unauthoritative suspicion
source. A SWIM suspicion or confirmation is a *hint*: the Raft leader treats it
as a proposal input, never as a commit. Only a committed Raft entry changes the
authoritative roster or a slot owner. This is the contract MEMBERSHIP.md
specifies from the membership side; here it means the leader is the sole writer
and a healthy node cannot be evicted by gossip alone, only by a quorum commit.
- The committed map is therefore monotonic under transient suspicion: a node that
SWIM marks suspect is demoted in the client-facing projection only after the
leader commits the demotion, so CLUSTER SLOTS never regresses on a flap.

### Correctness bar

- The acceptance target is the Jepsen/Elle suite covering all 21 Redis-Raft
failure classes (split-brain, lost updates, stale/aborted reads, total data
loss on failover or membership change) [jepsen-redis-raft-21-issues]. Keeping
user data out of the log shrinks the surface, but the config plane must still
clear every class under partitions, pauses, clock skew, and membership churn.

## Open questions

- Fixed 3 voters or 5: higher fault tolerance versus commit latency under the
ADR-0012 failover budget.
- Learner delivery: Raft learners on the same transport, or an out-of-band poll
of committed map deltas to keep voter fan-out small.
- The node-count or commit-latency threshold that forces sharded/hierarchical
multi-Raft instead of one group.
- Whether the cluster-bus port stays at base+10000 for compatibility once the
gossip slot bitmap is gone [redis-cluster-bus-port].

## Acceptance and test hooks

- Slot ownership is linearizable: a committed map at epoch E is never contradicted
by any node; a DST/Jepsen history shows no two nodes serving the same slot as
owner at the same epoch.
- No gossip-propagated slot bitmap exists; internal node count is not bitmap-capped.
- Failover and promotion run in-binary with no Sentinel process
[redis-sentinel-quorum-vs-majority]; promotion respects the lag gate.
- CLUSTER SLOTS/SHARDS render from the committed map and never regress under a
transient SWIM suspicion (joint hook with #74).
- The Jepsen/Elle suite passes all 21 Redis-Raft failure classes
[jepsen-redis-raft-21-issues] (#99).

## References

- ADR-0025, ADR-0011, ADR-0012, ADR-0002; issues #73, #74, #70, #75, #68, #99.
- Claims: [raft-overview], [redis-cluster-bus-port], [redis-cluster-why-16384],
[redis-sentinel-quorum-vs-majority], [jepsen-redis-raft-21-issues].
100 changes: 100 additions & 0 deletions docs/design/MEMBERSHIP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Design: SWIM + Lifeguard data-plane membership

Issue: #74. Decisions: ADR-0025 (cluster partition map this health joins against),
ADR-0012 (scale-out targets the protocol must hold flat past), ADR-0002
(shared-nothing; membership state is per-node, off the data hot path), ADR-0011
(slot-ready layout the live view is rendered over). Related: #73/CONTROL_PLANE.md
(the Raft authority SWIM proposes to), #70 (client cluster view), #68 (umbrella).

## Goal and scope

A data-plane membership and failure-detection layer whose per-node cost is flat
as the cluster grows and that does not flap a healthy node out of the ring during
a GC pause. It answers, for every other subsystem, which nodes are alive right
now so replication routes around the dead and the client view never points at a
corpse. Scope: the failure-detection protocol, its LAN/WAN defaults, the
`Membership` trait, and how SWIM health joins the Raft-committed map. Out of
scope: slot rebalancing policy and the authoritative epoch (#73), and the
partition layout (ADR-0025).

## Design

### SWIM, behind a Membership trait

- Adopt SWIM: randomized round-robin direct ping plus k indirect pings, with
infection-style dissemination, so expected detection time, false-positive rate,
and per-member message load are independent of group size [swim-scalability].
This is the constant-cost transport that lets ADR-0012's several-thousand-node
target hold flat past the ~1000-node Redis full-mesh ceiling
[redis-cluster-max-nodes-recommendation].
- The protocol sits behind a `Membership` trait so it is swappable (a full-mesh
backend stays implementable behind the same interface). Whether the trait is
public in v1 or internal until a second backend exists is an open question
below (Simple over Scalable until earned).

### Lifeguard, non-optional

- Lifeguard local-health awareness is a non-optional extension, not a tuning
knob: a node that suspects itself is slow (missed self-probes) dilates its own
timeouts so it does not falsely accuse peers, cutting failure-detector false
positives by ~50x under CPU/network stress in the memberlist deployment
[memberlist-lifeguard]. This is what keeps a GC pause from evicting a healthy
node.

### LAN and WAN profiles

- Two default profiles split the single dominant timeout by network class,
adapting the intent of Redis's one global `cluster-node-timeout` (15000 ms
default [redis-cluster-node-timeout-default]) rather than its single value.
LAN: probe ~200 ms, suspicion ~1 s, for fast failover within the ADR-0012
budget. Cloud/WAN: probe ~1 s, suspicion ~5 s, anchored under the 15 s ceiling
[redis-cluster-node-timeout-default] to tolerate jitter. Exact suspicion
multipliers and indirect-probe fan-out are open below.

### Health joined with the Raft-committed map

- The client-facing view is not SWIM-direct. CLUSTER SLOTS/SHARDS (#70) are
rendered from the Raft-committed slot->node map (#73) joined with live SWIM
health, so a polling client reads a stable, committed answer rather than raw
gossip. SWIM only annotates liveness over an ownership decided by Raft.

### The SWIM-proposes / Raft-commits handshake (with #73)

- SWIM is fast and unauthoritative; Raft is authoritative and slower. A SWIM
suspicion or confirmation is a *hint* delivered to the Raft leader as a proposal
input (#73); it never directly mutates the roster or a slot owner. The leader
commits the membership or promotion change, bumps the config epoch, and only
then does the change appear in the client projection. This is the same contract
CONTROL_PLANE.md states from the consensus side: SWIM proposes, Raft commits.
- The joined view is therefore monotonic under suspicion: a node SWIM marks
suspect is demoted in the CLUSTER SLOTS/SHARDS reply only after Raft confirms,
so the polled answer is stable and never regresses on a transient flap. A
debounce window before a suspicion is surfaced is an open question below.

## Open questions

- Exact LAN/WAN suspicion multipliers and indirect-probe fan-out.
- Whether `Membership` is a public trait in v1 or internal until a second backend
exists (Simple over Scalable until earned).
- The debounce window before a SWIM suspicion is surfaced toward CLUSTER SLOTS.
- Whether WAN mode is auto-detected or operator-set.

## Acceptance and test hooks

- Measured per-node message and CPU overhead stays constant as N grows
[swim-scalability]; a scaling run shows it flat past the ~1000-node ceiling
[redis-cluster-max-nodes-recommendation].
- A GC-pause / process-pause injection does not flap a healthy node out of the
ring, validating the Lifeguard self-awareness path [memberlist-lifeguard].
- The `Membership` trait isolates the protocol: a full-mesh backend compiles and
runs behind it unchanged.
- LAN and WAN default profiles are documented and benchmarked against
pause-induced false positives.
- CLUSTER SLOTS/SHARDS replies derive from Raft-committed state and never regress
under a transient SWIM suspicion (joint hook with #73).

## References

- ADR-0025, ADR-0012, ADR-0002, ADR-0011; issues #74, #73, #70, #68.
- Claims: [swim-scalability], [memberlist-lifeguard],
[redis-cluster-node-timeout-default], [redis-cluster-max-nodes-recommendation].
4 changes: 4 additions & 0 deletions docs/design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,3 +92,7 @@ Specs added as the M1 milestone progresses.
- [CLUSTER_CONTRACT.md](CLUSTER_CONTRACT.md): the Redis-Cluster-compatible client
wire contract (CRC16/16384 slots, hash tags, CROSSSLOT, MOVED/ASK, CLUSTER
SLOTS/SHARDS, sharded Pub/Sub) (#70).
- [CONTROL_PLANE.md](CONTROL_PLANE.md): the in-binary Raft control plane owning the
authoritative slot map, config epoch, membership, and replica promotion (#73).
- [MEMBERSHIP.md](MEMBERSHIP.md): SWIM + non-optional Lifeguard data-plane
membership and failure detection, joined with the Raft-committed map (#74).
Loading