Skip to content

fix(hermes): runner concurrency + bridge restart in talk#187

Merged
rafeegnash merged 1 commit into
masterfrom
fix/20-hermes-scanner-race
Jun 4, 2026
Merged

fix(hermes): runner concurrency + bridge restart in talk#187
rafeegnash merged 1 commit into
masterfrom
fix/20-hermes-scanner-race

Conversation

@rafeegnash
Copy link
Copy Markdown
Collaborator

Summary

Bundled fix for #20 (hermes Runner scanner data race + lost notifications) and #21 (clanker talk doesn't recover when the bridge crashes). They share the same plumbing — the dispatcher's done signal is exactly where bridge-death detection lives, so splitting them would create surface-area for the seam to drift.

#20 — Runner dispatcher

  • Single dispatcher goroutine owns bufio.Scanner. No other goroutine touches it. (Pre-fix: call() and Prompt() both called Scan() on the shared scanner → race + cross-talk.)
  • Responses route by ID via a per-request inbox map; notifications go to the currently-active prompt's sink.
  • promptMu serialises Prompt() calls — the wire protocol doesn't tag notifications with a request ID, so overlapping prompts can't be demuxed safely.
  • ErrBridgeExited + IsBridgeExitError so callers can distinguish "bridge died" from "you asked a bad question."
  • Pending callers unblock when the dispatcher sees EOF; previously they hung forever.
  • Prompt drains queued notifications before yielding the final event so deltas aren't reordered behind the final when the bridge sends them back-to-back.
  • Panic recovery on the dispatcher and prompt-streamer goroutines.

#21 — Bridge restart in talk

  • cmd/talk REPL retries one prompt after restarting the bridge on ErrBridgeExited.
  • Sliding-window cap: 3 restarts per minute. If the bridge keeps crashing, the REPL surfaces the error and exits cleanly rather than burning CPU in a restart loop.
  • Small ringBuffer captures last 4 KiB of bridge stderr — when the restart message surfaces it includes the trailing 3 lines (typically the Python traceback's ModuleNotFoundError: ... line) instead of just "bridge process exited."

Test plan

  • go test -race -count=1 ./internal/hermes/... — passes
  • go vet ./... — clean
  • gofmt -d — clean
  • New tests cover:
    • ID-routed responses don't cross
    • Notifications delivered in order, then final
    • Concurrent Prompt callers don't see each other's deltas
    • Bridge-death unblocks pending calls with ErrBridgeExited
    • 200-delta flood: notif sink drops without wedging the final response
    • IsBridgeExitError recognises wrapped + joined errors

Closes #20
Closes #21

Two bundled fixes — they share the same plumbing.

#20 — Runner had multiple goroutines calling scanner.Scan on the same
bufio.Scanner. Under -race that's an immediate data race; in
production it manifested as cross-talk between concurrent prompts and
silent loss of the final response when the bridge died mid-call.

Replace the ad-hoc reader with a single dispatcher goroutine that owns
the scanner. Responses route to the awaiting goroutine via a per-ID
inbox map; notifications go to the currently-active prompt's sink.
Prompts are serialised by promptMu (the wire protocol doesn't tag
notifications with a request ID, so overlapping prompts can't be
demuxed). Pending callers unblock with ErrBridgeExited when the
dispatcher sees EOF — previously they hung indefinitely.

Prompt now drains queued notifications before emitting the final
event so deltas never get reordered behind the final when both
channels are buffered.

#21 — talk's REPL now restarts the bridge transparently when it dies
mid-session, capped at 3 restarts per minute. Before, every prompt
after the first crash just printed "bridge process exited" forever.
A small stderrTail ring buffer captures the last 4 KiB of bridge
stderr so the error message includes the actual Python traceback
(usually a ModuleNotFoundError) rather than the bare exit string.

Tests cover ID routing, ordered delta-then-final delivery, prompt
serialisation under concurrent callers, bridge-death unblock, and a
flood scenario that exercises the dispatcher's drop-on-full notif
sink semantics. All pass under -race.

Closes #20
Closes #21
@rafeegnash rafeegnash merged commit f516b6b into master Jun 4, 2026
5 checks passed
@rafeegnash rafeegnash deleted the fix/20-hermes-scanner-race branch June 4, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant