Raft is a consensus algorithm designed as an alternative to Paxos. It was created to be more understandable than Paxos while providing equivalent safety and performance guarantees. In Chorus, Raft ensures that all nodes agree on the same sequence of operations, providing strong consistency guarantees for the distributed key-value store.
Raft maintains a replicated log of operations across all nodes in the cluster. The log is the source of truth for the system state, and all nodes must have identical logs to maintain consistency.
Raft uses a leader-based approach where one node is elected as the leader and handles all client requests. Followers replicate the leader's log and redirect client requests to the leader.
Raft guarantees several safety properties:
- Election Safety: At most one leader can be elected per term
- Leader Append-Only: A leader never overwrites or deletes entries in its log
- Log Matching: If two logs contain an entry with the same index and term, they are identical up to that point
- Leader Completeness: If a log entry is committed in a given term, it will be present in the logs of all future leaders
Each log entry in Chorus contains:
- Term: The term number when the entry was created
- Index: The position in the log
- Command: The actual operation (SET or DELETE)
- Timestamp: When the entry was created
Commands are applied to the Finite State Machine (FSM) in the following order:
- SET Command: Stores or updates a key-value pair
- DELETE Command: Removes a key-value pair
- Snapshot Command: Creates a snapshot of the current state
- Triggered when log size exceeds
SnapshotThreshold - Occurs at regular intervals defined by
SnapshotInterval - Creates a point-in-time view of the entire key-value store
- Used when a new node joins the cluster
- Applied during node recovery after prolonged downtime
- Reduces recovery time by avoiding full log replay
Chorus provides several configurable Raft parameters:
- HeartbeatTimeout: How often the leader sends heartbeats (default: 1s)
- ElectionTimeout: How long followers wait before starting election (default: 1s)
- CommitTimeout: Maximum time to wait for log commit (default: 50ms)
- LeaderLeaseTimeout: Leader lease timeout (default: 500ms)
- MaxAppendEntries: Maximum entries per AppendEntries RPC (default: 64)
- SnapshotInterval: Time between automatic snapshots (default: 30s)
- SnapshotThreshold: Log entries before snapshot (default: 1024)
In the current implementation, cluster membership is statically configured:
- Nodes are defined in configuration files
- Initial cluster bootstrap uses predefined node list
- All nodes must know about each other
- The first node (node1) bootstraps the cluster
- Subsequent nodes join by contacting existing nodes
- Once joined, nodes participate in consensus
- Latency: Determined by majority replication time
- Throughput: Limited by leader and network capacity
- Consistency: Strong consistency guaranteed for all writes
- Leader Reads: Always return latest committed data
- Follower Reads: May return slightly stale data but faster
- Availability: Reads available from any non-candidate node
- Full Recovery: Requires replaying entire log
- Snapshot Recovery: Much faster, uses latest snapshot
- Incremental Catch-up: Followers catch up incrementally
Chorus implements the standard Raft RPCs:
- RequestVote: Sent by candidates during elections
- AppendEntries: Sent by leaders to replicate log entries
- InstallSnapshot: Sent by leaders to transfer snapshots
The implementation uses several storage interfaces:
- LogStore: Persistent storage for Raft log
- StableStore: Storage for stable state (current term, voted for)
- SnapshotStore: Storage for snapshots
- Transport: Network transport for RPC communication
The FSM in Chorus:
- Applies commands to the key-value store
- Handles snapshot creation and restoration
- Provides thread-safe access to the store