A hands-on Java 21 study project covering every major distributed-systems concept required for the Elasticsearch Distributed Systems engineering role. Every class is extensively annotated with study notes, interview talking points, and direct references to real Elasticsearch internals.
| Package | Class | Concept |
|---|---|---|
model |
Node |
Node roles, quorum formula, master-eligible nodes |
model |
ShardRouting |
Primary/replica model, ISR, seqNo, primaryTerm, shard lifecycle |
model |
ClusterState |
Immutable state, two-phase commit publication, builder pattern |
service |
RaftLeaderElection |
Raft consensus — terms, roles, voting, heartbeats, split-brain avoidance |
service |
ShardAllocationService |
Allocation constraints, same-node exclusion, Murmur3 routing hash, ISR promotion |
service |
ConcurrentIndexingService |
StampedLock, LongAdder, optimistic CAS, Callable-based replica fan-out |
util |
TranslogWriter |
Write-ahead log, fsync strategy, CRC frames, crash-recovery replay |
util |
AsyncNetworkChannel |
Netty request/response correlation, timeout scheduling, backpressure |
client |
ElasticsearchClientFactory |
Official Java client lifecycle, connection pooling, auth |
config |
ElasticsearchConfig |
Environment-based configuration record |
| — | Main |
End-to-end wiring demo — runs the full lifecycle |
- Three roles: FOLLOWER → CANDIDATE → LEADER
- Randomised election timeouts to prevent split-vote
- Term-based epoch: any message with a higher term causes an immediate step-down
knownLeaderId()returnsOptional<String>— explicit "no leader yet" contractvotingConfiguration()returns the current Raft quorum membership- Quorum and fault-tolerance formulas:
quorum = ⌊N / 2⌋ + 1
faultTolerance = ⌊(N - 1) / 2⌋
- Hard constraint: primary and replica of the same shard never on the same node
- ISR promotion on primary failure (replica with highest seqNo wins)
- Document routing via consistent hash:
targetShard = |Murmur3(routing_value)| % num_primary_shards
- Why shard counts are fixed at index creation time
- Disk threshold watermarking and rebalance throttling
StampedLock— optimistic reads (no lock overhead on the hot path) with fallback to pessimistic read lockAtomicLongseqNo generator — single CAS instruction, lock-freeLongAdderops counter — stripe-sharded to eliminate CAS contention under parallel writesCallable<Long>replica tasks — return the replica's local checkpoint for global checkpoint advancement- ISR replica discovery from live
ClusterStaterouting table - Optimistic concurrency control:
if_seq_no+if_primary_term
- Write-ahead log appended before the client is acknowledged
- Frame format:
seqNo(8) | primaryTerm(8) | bodyLen(4) | body | CRC32(4) syncOnWrite=true→FileChannel.force()per request (durability = request mode)readOpsFrom(path, offset)— crash-recovery replay usingRandomAccessFile.seek()to jump directly to the Lucene commit offset- CRC mismatch on trailing entry = partial write from a crash → safe truncation
- Netty's fire-and-forget + correlation map pattern
- Every request gets a unique
requestId; aCompletableFutureis stored in aConcurrentHashMap - Response frame completes the future; timeout callback on the event loop cleans it up
- Mirrors
TransportService#sendRequest+PendingResponseHandlersin ES source
- Fully immutable Java
recordwith deep defensive copies - Only the elected master produces a new state (via
Builder) - Version is monotonically increasing — stale states are silently ignored
- Two-phase commit: pre-publish to quorum → commit → nodes apply atomically
# Cluster sizing
Quorum : ⌊N/2⌋ + 1 (N = master-eligible nodes)
Fault tol. : ⌊(N-1)/2⌋
Rec. sizes : 1, 3, 5, 7 (odd — 4 buys nothing over 3)
# Document routing
shard = |Murmur3(routing_value)| % number_of_primary_shards
# Durability modes (index.translog.durability)
request → fsync on every bulk request (default, no data loss)
async → fsync on interval (default 5s, up to 5s of data loss)
# Checkpoints
localCheckpoint = highest seqNo this copy has processed consecutively
globalCheckpoint = min(localCheckpoint) across all ISR members
= safe translog truncation point
Requirements: Java 21+, Maven 3.9+
# Build
mvn clean package -DskipTests
# Run tests
mvn test
# Run the demo
mvn exec:java -Dexec.mainClass="com.elasticsearch.distributed.Main"The demo output walks through:
- 3-node cluster bootstrap
- Raft election simulation
- Index allocation (3 primaries × 1 replica)
- Node failure + ISR promotion
- Concurrent indexing with translog writes
- Async inter-node request/response
src/main/java/com/elasticsearch/distributed/
├── Main.java ← end-to-end wiring demo
├── client/
│ └── ElasticsearchClientFactory.java
├── config/
│ └── ElasticsearchConfig.java
├── model/
│ ├── ClusterState.java
│ ├── Node.java
│ └── ShardRouting.java
├── service/
│ ├── ConcurrentIndexingService.java
│ ├── RaftLeaderElection.java
│ └── ShardAllocationService.java
└── util/
├── AsyncNetworkChannel.java
└── TranslogWriter.java