-
Notifications
You must be signed in to change notification settings - Fork 137
Open
Description
Summary
The execution engine in epaxos-exec.go contains an unbounded busy-wait loop that can cause system-wide deadlock when processing transitive dependencies.
Bug Type
| Attribute | Value |
|---|---|
| Type | Liveness Violation |
| Severity | 🔴 Critical |
| Threat Model | CFT (Compliant) |
Problematic Code
File: src/epaxos/epaxos-exec.go, Lines 71-83
func (e *Exec) strongconnect(v *Instance, index *int) bool {
// ...
for q := int32(0); q < int32(e.r.N); q++ {
inst := v.Deps[q]
for i := e.r.ExecedUpTo[q] + 1; i <= inst; i++ {
// BUG: Unbounded wait for instance to exist
for e.r.InstanceSpace[q][i] == nil || e.r.InstanceSpace[q][i].Cmds == nil || v.Cmds == nil {
time.Sleep(1000 * 1000) // ← NO TIMEOUT! Can wait forever!
}
// ...
// BUG: Unbounded wait for instance to be COMMITTED
for e.r.InstanceSpace[q][i].Status != epaxosproto.COMMITTED {
time.Sleep(1000 * 1000) // ← NO TIMEOUT! INFINITE WAIT!
}
// ...
}
}
}Root Cause
strongconnect(Tarjan's SCC algorithm) busy-waits for dependencies to becomeCOMMITTED- The main
executeCommandsloop (which triggers recovery) is blocked waiting forstrongconnectto return - Recovery for uncommitted dependencies cannot be initiated because the recovery trigger is in the blocked loop
- Circular dependency: Execution waits for recovery, but recovery cannot start because execution is blocked
Trigger Conditions
- Command X is
COMMITTEDwith dependencies - One of X's transitive dependencies (e.g., Z) is
PREACCEPTEDbut notCOMMITTED - Z's leader has crashed
executeCommand(X)is called, which callsstrongconnectstrongconnectenters unbounded busy-wait for Z to becomeCOMMITTED- Recovery for Z cannot be triggered (blocked by
strongconnect) - System deadlock
Deadlock Scenario
Dependency Chain: X → Y → Z
State:
X: COMMITTED
Y: COMMITTED
Z: PREACCEPTED (leader crashed)
Execution Flow:
executeCommands()
→ executeCommand(X)
→ findSCC(X)
→ strongconnect(X)
→ check dependency Y
→ strongconnect(Y)
→ check dependency Z
→ for Z.Status != COMMITTED {
time.Sleep(...) // INFINITE WAIT!
}
// Recovery CANNOT be triggered!
// Because executeCommands is blocked here!
Impact
- System-wide execution deadlock
- All commands depending on the stuck chain cannot execute
- No automatic recovery possible
- Requires manual intervention (restart)
Metadata
Metadata
Assignees
Labels
No labels