Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions docs/adr/0026-replication-consistency-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# ADR-0026: Default replication and consistency model (async primary/replica plus WAIT)

Status: Accepted
Issue: #76

## Context

IronCache must ship a single default replication and consistency model before
the opt-in tiers (#77, #78, #12) can be specified against a fixed baseline. The
default has to be correct under the failure modes operators actually hit and
stay drop-in compatible with how Redis clients already reason about durability.
The top two tenets, Compatible then Efficient, both bear on the choice: clients
should port over unchanged, and the hot write path should pay no quorum tax.
Scope here is the default model only; this ADR does not specify the streaming
protocol, replica handoff mechanics, or the read contract (#147 owns the
replica-read contract, #149 owns node lifecycle).

Redis Cluster replicates asynchronously between a primary and its replicas by
default, and exposes WAIT for callers that want bounded synchronous
acknowledgement [redis-cluster-async-replication]. WAIT confirms in-memory
replica receipt, not disk persistence, and the Redis docs are explicit that it
does not make the store strongly consistent: a write synchronously replicated
to several replicas can still be lost [redis-wait-not-cp]. The Jepsen analysis
reaches the same conclusion from the failure side, that default async
replication can drop acknowledged writes on failover [redis-wait-not-strongly-consistent].
The two strong-consistency alternatives are per-shard Raft or quorum writes,
which remove single-node-failover write loss at the cost of write latency and
operational weight on every write, and Dynamo-style leaderless quorums with a
sloppy quorum over the first N healthy nodes plus hinted handoff and
app-level conflict resolution [dynamo-quorum-sloppy-hinted].

## Decision

- **Default to asynchronous primary/replica replication, with WAIT exposed for
bounded synchronous acks.** This is the Compatible and Efficient choice:
clients that already speak Redis replication semantics and tooling port over
unchanged, and the steady-state write path pays no quorum round-trip
[redis-cluster-async-replication]. WAIT N timeout is offered as a per-command
durability floor, not a consistency mode.
- **Document the default as best-effort, not CP, and name the loss window.**
There is a write-loss window: a write acknowledged to the client but not yet
replicated can be lost on primary failover or on the minority side of a
partition. WAIT bounds this window but does not eliminate it, because it
confirms in-memory receipt only and is not strong consistency
[redis-wait-not-cp] [redis-wait-not-strongly-consistent]. This honesty is a
shipped requirement, surfaced to clients per #147.
- **Ship three guardrail defaults.** `replica-read-only` is on, so replicas
reject writes and cannot silently diverge [redis-replica-read-only-default].
`min-replicas-to-write` is wired so a primary can stop accepting writes when
too few replicas are in sync [redis-min-replicas-to-write-default].
`min-replicas-max-lag` bounds how stale an in-sync replica may be before it
stops counting toward that floor [redis-min-replicas-max-lag-default]. The
shipped numeric defaults track the pinned upstream values in claims.yaml.
- **Strong consistency is opt-in, never a tax on every write.** No-acknowledged-
write-loss on single-node failover is real value, but it is delivered through
an opt-in quorum/Raft tier (#78, #12), layered on this async baseline, not by
changing the default. Whether that becomes a headline differentiator is
deferred to those issues; this ADR commits only to the async default.

## Rejected Alternatives

- **Per-shard Raft or quorum writes by default.** Removes acknowledged-write
loss on single-node failover and gives a clean CP story. Rejected as the
default: it adds write latency and operational weight to every write and
diverges from Redis defaults, breaking Compatible, which ranks above the
consistency gain. It survives as the opt-in tier in #78 and #12, layered on
this baseline rather than replacing it.
- **Dynamo-style sloppy quorum with hinted handoff.** Stays writable during
partitions via a sloppy quorum over the first N healthy nodes, hinted handoff,
and vector-clock conflict resolution [dynamo-quorum-sloppy-hinted]. Rejected:
its conflict, read-repair, and merge model is foreign to the Redis data model,
surprising for compatibility-focused users, and complex, so it violates
Compatible and Simple for a marginal availability gain. This is the rejection
this ADR exists to freeze so it is not relitigated.

## Consequences

- Unmodified Redis clients and replication tooling work against IronCache with
no protocol change, and the steady-state write path carries no quorum tax,
satisfying Compatible then Efficient [redis-cluster-async-replication].
- The default is explicitly best-effort, not CP. The acknowledged-but-
unreplicated write-loss window on failover and partition is documented and
surfaced to clients (#147), and WAIT is positioned as a durability floor that
bounds but does not close it [redis-wait-not-cp] [redis-wait-not-strongly-consistent].
- Three guardrails ship on by intent: read-only replicas, a min-in-sync-
replicas write floor, and a max-replica-lag bound, so a misconfigured or
lagging fleet fails toward refusing writes rather than silently diverging
[redis-replica-read-only-default] [redis-min-replicas-to-write-default]
[redis-min-replicas-max-lag-default].
- The opt-in strong-consistency tier (#78, #12) is unblocked to build on a
fixed async baseline, and the replica-read contract (#147) and node lifecycle
(#149) are specified against this decision rather than against an open one.
1 change: 1 addition & 0 deletions docs/adr/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ in [OPEN.md](OPEN.md); research questions in [QUESTIONS.md](QUESTIONS.md).
| [0023](0023-cold-tier-engine.md) | Cold-tier engine (reject RocksDB/LSM, adopt hybrid log) | Accepted | #65 |
| [0024](0024-geo-command-scope.md) | Geo command family scope (non-goal for v1) | Accepted | #133 |
| [0025](0025-cluster-partition-count.md) | Cluster keyspace partition count (16384 dual-purpose unit) | Accepted | #72 |
| [0026](0026-replication-consistency-model.md) | Default replication and consistency model (async primary/replica plus WAIT) | Accepted | #76 |

As `[DECISION]` issues close, each adds its row here and its `NNNN-*.md` record.
The numbering is monotonic and never reused, even after supersession.
93 changes: 93 additions & 0 deletions docs/design/NODE_LIFECYCLE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Design: Cluster bootstrap and node lifecycle (seed/MEET join, learner-to-voter-to-slot-owner)

Issue: #149. Decisions: ADR-0026 (async primary/replica default, replica
guardrails). Related: #73 (CONTROL_PLANE: Raft slot map and config epoch), #74
(MEMBERSHIP: SWIM plus Lifeguard), #69 (single-node-first staged path), #75
(migration), #1 (vision).

## Goal and scope

The roster owns steady-state membership (#74 SWIM), the authoritative map (#73
Raft), and migration (#75), but nothing owns how a node enters or leaves the
cluster. This spec owns the lifecycle: cold-start seed discovery and a CLUSTER
MEET-equivalent handshake, the staged promotion of a joining node from
SWIM-discovered to Raft learner to voter to slot owner, the operator/CLI
add-node and remove-node surface, and the single-node-to-first-replica
bootstrap that #69's staged path assumes. Scope is the transitions between
states; the SWIM signal itself (#74), the Raft commit semantics (#73), and the
slot-migration mechanics (#75) are owned elsewhere and only invoked here.

## Design

### Seed and MEET join

- A new node boots with a seed list (operator-supplied or CLI add-node). It
contacts a seed, which performs a MEET-equivalent handshake: the seed
introduces the joiner into the SWIM membership view (#74) so the rest of the
ring learns of it through normal gossip, no full-mesh fan-out.
- SWIM membership is a hint, not authority: a node SWIM has discovered is not
yet part of the cluster's committed state. Per #74's contract, SWIM proposes
and Raft commits, so MEET only makes a node a candidate for promotion.

### Learner to voter to slot-owner promotion

- A SWIM-discovered node is first admitted to the Raft control plane (#73) as a
non-voting learner: it receives committed slot-map deltas and config-epoch
updates but does not vote, so it cannot affect commit latency or quorum while
it catches up [raft-overview].
- A learner is promoted to voter only by an explicit committed control-plane
decision (#73), keeping the voter set small (the #73 3-to-5 voter group) and
the slot map linearizable. Voter promotion is a control-plane role change, not
a data assignment.
- Becoming a slot owner is the last and separate step: the control plane assigns
slots and, for a replica being promoted toward ownership, applies a
replication-lag gate before the node is eligible, since replication is async
(ADR-0026). Replica handoff reuses PSYNC2-style secondary-replid resync so
promotion does not force a full resync [redis-psync2-secondary-replid]. Slot
movement itself runs through migration (#75) under MOVED/ASK.

### Add/remove-node operator surface

- add-node (CLI/operator) supplies a seed and triggers the MEET handshake, then
the staged learner-to-voter-to-owner path above; the operator observes each
stage via CLUSTER SHARDS health/role (#74) rather than poking internal state.
- remove-node is the reverse and drains first: slots are migrated off (#75),
the node is demoted from voter to learner to leave the quorum cleanly, then
removed from the committed membership (#73) and finally from the SWIM view
(#74). A node is never removed from the map while it still owns a slot.

### Single-node to first-replica bootstrap

- A single node boots as a degenerate one-voter control plane owning all 16384
slots, consistent with #69's single-node-first staged layout. The first
replica joins by the same seed/MEET path, enters as a learner, and attaches as
an async replica of the primary under ADR-0026 (replica-read-only on,
min-replicas guardrails inactive at one replica). This is the transition #69
assumes but does not specify: the inter-stage step from standalone to a
primary-with-replica pair.

## Open questions

- The replication-lag threshold that gates a replica from learner to
slot-owner-eligible (ties to ADR-0026 min-replicas-max-lag and #73's
promotion-policy open decision).
- Whether learner admission to Raft is automatic on SWIM discovery or requires
an explicit operator add-node (#73 lists data-nodes-as-learners as open).
- Seed-list bootstrapping when all seeds are down, and how MEET interacts with a
partitioned SWIM view.

## Acceptance and test hooks

- A node added by seed/MEET appears as a SWIM hint, then a Raft learner, then a
voter, then a slot owner, with each stage visible in CLUSTER SHARDS and never
skipped.
- remove-node drains all slots (#75) and demotes through learner before the
node leaves committed membership; no slot is orphaned.
- A standalone node accepts a first replica via MEET and reaches a
primary-with-async-replica pair per #69 and ADR-0026.

## References

- ADR-0026; issues #149, #73, #74, #69, #75, #1; specs CONTROL_PLANE (#73),
MEMBERSHIP (#74).
- Claims: [raft-overview], [redis-psync2-secondary-replid].
4 changes: 4 additions & 0 deletions docs/design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,7 @@ Specs added as the M1 milestone progresses.
authoritative slot map, config epoch, membership, and replica promotion (#73).
- [MEMBERSHIP.md](MEMBERSHIP.md): SWIM + non-optional Lifeguard data-plane
membership and failure detection, joined with the Raft-committed map (#74).
- [REPLICA_READ.md](REPLICA_READ.md): the replica-read contract (READONLY/
READWRITE, replica routing, bounded staleness surfaced to clients) (#147).
- [NODE_LIFECYCLE.md](NODE_LIFECYCLE.md): cluster bootstrap and node lifecycle
(seed/MEET join, learner to voter to slot-owner promotion, add/remove-node) (#149).
86 changes: 86 additions & 0 deletions docs/design/REPLICA_READ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Design: Replica-read contract (READONLY/READWRITE, replica routing, bounded staleness)

Issue: #147. Decisions: ADR-0026 (async primary/replica default, best-effort not
CP, replica-read-only on). Related: #70 (CLUSTER_CONTRACT: 16384 slots, CRC16,
MOVED/ASK), #76 (replication default), #1 (vision).

## Goal and scope

Redis Cluster clients scale reads by sending READONLY on a connection and then
routing reads to replicas; this is part of the wire contract IronCache promises
to keep. ADR-0026 fixes replica-read-only on, so replicas reject writes, but it
does not decide whether clients may READ from replicas, how the
READONLY/READWRITE connection-state pair behaves, how a replica answers for the
slots it serves, or how async-replication staleness is bounded and surfaced.
This spec owns that command-pair and consistency contract. Scope: the
per-connection READONLY/READWRITE state machine, replica read routing under the
#70 slot view, and the bounded-staleness signal surfaced to clients. Out of
scope: the slot map authority (#73), migration redirection mechanics (#70), and
the write path (#76).

## Design

### READONLY/READWRITE connection state

- A connection carries one bit: read-write (default) or read-only. READONLY
sets the bit; READWRITE clears it. The bit is per-connection, not global, and
is unaffected by the node role. This mirrors the Redis Cluster READONLY/
READWRITE pair that lets a replica serve reads for slots it replicates
[redis-cluster-readonly-replica].
- On a replica, a read for an owned-or-replicated slot succeeds only when the
read-only bit is set; otherwise the replica returns MOVED to the primary, so
a default (read-write) connection keeps the strong-read behavior unmodified
clients expect [redis-cluster-readonly-replica]. Writes on a replica are
always rejected per ADR-0026's replica-read-only posture, independent of the
bit.

### Replica read routing

- Slot ownership and the CLUSTER SLOTS/SHARDS projection come from
CLUSTER_CONTRACT (#70); this spec only adds the replica leg. A read-only
connection whose key hashes (CRC16 mod 16384, #70) to a slot this replica
replicates is answered locally; a key for a slot this node neither owns nor
replicates returns MOVED, driving the client's normal map refresh.
- Because replication is asynchronous (ADR-0026), a replica read may observe a
value older than the primary. This is the Envoy ReadPolicy model: non-primary
read targets may return stale data due to async replication
[envoy-redis-readpolicy] [redis-cluster-async-replication]. IronCache does
not silently proxy reads to the primary to hide this; the client chose the
replica by setting READONLY and is told the staleness bound.

### Bounded staleness surfaced to clients

- Each replica tracks its replication lag against the primary using the same
in-sync signal ADR-0026 bounds with min-replicas-max-lag. A replica whose lag
exceeds the configured staleness bound stops serving read-only reads for its
slots and returns MOVED, so a client never silently reads beyond the bound.
- The bound is observable, not just enforced: it is exposed through INFO
replication fields and the CLUSTER SHARDS health/role projection (#70), so a
client or operator can reason about the worst-case staleness of any replica
read. This makes the best-effort-not-CP property of ADR-0026 legible at the
read path rather than hidden.

## Open questions

- Whether to expose an Envoy-style per-request ReadPolicy hint
(PREFER_REPLICA/PREFER_MASTER) beyond the binary READONLY/READWRITE bit
[envoy-redis-readpolicy], or keep the Redis-native pair only for v1.
- The exact default staleness bound, and whether it is derived from
min-replicas-max-lag (ADR-0026) or set independently per keyspace.
- How a replica that crosses the staleness bound interacts with the #70 ASK
path during an in-flight slot migration.

## Acceptance and test hooks

- A READONLY connection reads from a replica for a replicated slot; the same
connection after READWRITE gets MOVED to the primary for that slot.
- A replica past its staleness bound returns MOVED for read-only reads and its
lag is visible in INFO/CLUSTER SHARDS before and after crossing the bound.
- Unmodified redis-cli, go-redis, lettuce, and ioredis route replica reads via
READONLY without errors, matching the #70 contract.

## References

- ADR-0026; issues #147, #76, #70, #1; specs CLUSTER_CONTRACT (#70).
- Claims: [redis-cluster-readonly-replica], [envoy-redis-readpolicy],
[redis-cluster-async-replication].
31 changes: 31 additions & 0 deletions docs/prior-art/claims.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6982,3 +6982,34 @@ claims:
note: rustsec.org states the database is maintained by the Rust Secure Code Working Group, covers
crates published via crates.io, uses RUSTSEC-YYYY-NNNN IDs (example RUSTSEC-2022-0051), and exports
to OSV in real time for the GitHub Advisory Database/Dependabot. Cross-checked against github.com/rustsec/advisory-db.
- id: redis-cluster-readonly-replica
dimension: resp-protocol-compat
system: Redis Cluster (READONLY / READWRITE connection commands)
version: 'Redis 3.0.0+ (current docs: redis.io latest)'
claim: Redis Cluster surfaces replica reads with bounded staleness as a per-connection contract via
the READONLY and READWRITE commands. READONLY tells a Redis Cluster replica node that the client is
willing to read possibly stale data and is not interested in running write queries; with the connection
in readonly mode the cluster only sends a redirection to the client when the operation involves keys
not served by the replica's master (e.g. slots never owned by that master, or after a resharding).
READWRITE resets the readonly flag back to the default, where read queries against a replica are redirected
to the authoritative master. Both commands have been available since Redis 3.0.0 and belong to the
`cluster` command group.
value: READONLY = opt-in per-connection 'willing to read possibly stale data' from replica (redirect
only for keys not served by replica's master); READWRITE resets to default; since Redis 3.0.0; group=cluster
source_url: https://redis.io/docs/latest/commands/readonly/
accessed_date: '2026-06-13'
confidence: high
confidence_reason: Read directly from the official redis.io command reference for READONLY; the 'willing
to read possibly stale data' wording, redirection conditions, 'since 3.0.0' version, and 'cluster'
command group are quoted from the upstream page, with READWRITE semantics corroborated by the companion
redis.io page.
load_bearing: false
verification:
verdict: confirmed
best_source_url: https://redis.io/docs/latest/commands/readonly/
note: 'WebFetch of redis.io/docs/latest/commands/readonly confirmed the verbatim description (''willing
to read possibly stale data and is not interested in running write queries''), the redirection-only-for-unserved-keys
behavior, ''Available since: 3.0.0'', and ''Group: cluster''. The complementary READWRITE command
(redis.io/docs/latest/commands/readwrite) was surfaced in the same search and resets the readonly
flag; redis-py-cluster docs independently note replicas may not return latest data due to asynchronous
replication, consistent with the bounded-staleness framing.'
Loading