Skip to content

clanker talk doesn't restart Hermes bridge on crash — REPL error-spams until exit #21

@rafeegnash

Description

@rafeegnash

Problem

When the Hermes bridge process crashes mid-session, Prompt returns an error and the clanker talk REPL loop just prints it. Every subsequent prompt hits a dead bridge and the user sees error spam until they type exit. There is no health check or restart.

Where

  • cmd/talk.go:111-114 — REPL loop's error handling
  • internal/hermes/runner.go — no IsHealthy() or restart mechanism

Fix

  1. In the REPL loop, detect "bridge process exited" errors and call runner.Stop() then start a fresh runner before accepting the next input
  2. Cap restarts to 3 per minute before giving up
  3. On exit, log the underlying cause (look at stderr or exit code captured in feat(k8s/cluster): add errorHint method to EKS provider #10's dispatch loop)
if isBridgeExitError(err) {
    if restartCount.IncrementWithinWindow() > 3 {
        fmt.Println("bridge keeps crashing, giving up")
        return err
    }
    runner.Stop()
    if err := runner.Start(ctx); err != nil { return err }
    continue
}

Acceptance criteria

  • Manual test: kill the bridge process mid-session, next prompt restarts the bridge transparently
  • Counter prevents infinite restart loops
  • Error message on permanent failure includes the bridge stderr tail

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority: criticalMust fix immediately - security or data loss risk

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions