Skip to content

[Liveness Bug] Execution Engine Deadlock in strongconnect Due to Unbounded Busy-Wait #26

@SherlockShemol

Description

@SherlockShemol

Summary

The execution engine in epaxos-exec.go contains an unbounded busy-wait loop that can cause system-wide deadlock when processing transitive dependencies.

Bug Type

Attribute Value
Type Liveness Violation
Severity 🔴 Critical
Threat Model CFT (Compliant)

Problematic Code

File: src/epaxos/epaxos-exec.go, Lines 71-83

func (e *Exec) strongconnect(v *Instance, index *int) bool {
    // ...
    for q := int32(0); q < int32(e.r.N); q++ {
        inst := v.Deps[q]
        for i := e.r.ExecedUpTo[q] + 1; i <= inst; i++ {
            // BUG: Unbounded wait for instance to exist
            for e.r.InstanceSpace[q][i] == nil || e.r.InstanceSpace[q][i].Cmds == nil || v.Cmds == nil {
                time.Sleep(1000 * 1000)  // ← NO TIMEOUT! Can wait forever!
            }
            // ...
            // BUG: Unbounded wait for instance to be COMMITTED
            for e.r.InstanceSpace[q][i].Status != epaxosproto.COMMITTED {
                time.Sleep(1000 * 1000)  // ← NO TIMEOUT! INFINITE WAIT!
            }
            // ...
        }
    }
}

Root Cause

  1. strongconnect (Tarjan's SCC algorithm) busy-waits for dependencies to become COMMITTED
  2. The main executeCommands loop (which triggers recovery) is blocked waiting for strongconnect to return
  3. Recovery for uncommitted dependencies cannot be initiated because the recovery trigger is in the blocked loop
  4. Circular dependency: Execution waits for recovery, but recovery cannot start because execution is blocked

Trigger Conditions

  1. Command X is COMMITTED with dependencies
  2. One of X's transitive dependencies (e.g., Z) is PREACCEPTED but not COMMITTED
  3. Z's leader has crashed
  4. executeCommand(X) is called, which calls strongconnect
  5. strongconnect enters unbounded busy-wait for Z to become COMMITTED
  6. Recovery for Z cannot be triggered (blocked by strongconnect)
  7. System deadlock

Deadlock Scenario

Dependency Chain: X → Y → Z

State:
  X: COMMITTED
  Y: COMMITTED  
  Z: PREACCEPTED (leader crashed)

Execution Flow:
  executeCommands() 
    → executeCommand(X)
      → findSCC(X)
        → strongconnect(X)
          → check dependency Y
            → strongconnect(Y)
              → check dependency Z
                → for Z.Status != COMMITTED {
                      time.Sleep(...)  // INFINITE WAIT!
                  }
                  // Recovery CANNOT be triggered!
                  // Because executeCommands is blocked here!

Impact

  • System-wide execution deadlock
  • All commands depending on the stuck chain cannot execute
  • No automatic recovery possible
  • Requires manual intervention (restart)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions