Skip to content

fix(smithy): silent rate-limit attributes limit to agent provider, not hardcoded 'claude'#110

Open
fnnzzz wants to merge 1 commit into
stoneforge-ai:masterfrom
fnnzzz:fix/silent-rate-limit-uses-agent-provider
Open

fix(smithy): silent rate-limit attributes limit to agent provider, not hardcoded 'claude'#110
fnnzzz wants to merge 1 commit into
stoneforge-ai:masterfrom
fnnzzz:fix/silent-rate-limit-uses-agent-provider

Conversation

@fnnzzz
Copy link
Copy Markdown
Contributor

@fnnzzz fnnzzz commented May 7, 2026

Summary

When the dispatch daemon detects a rate limit from a session that exited without producing any output (the "silent rate limit" / rapid-exit branch), or from the orphan-recovery loop noticing a rate-limit pattern in a worker's session history, and no fallbackChain is configured, it previously called handleRateLimitDetected('claude', resetTime) — attributing the limit to 'claude' regardless of the worker's actual provider.

For codex (or any non-claude) workers this surfaces in the dashboard rate-limit banner as Dispatch paused — claude hit its rate limit. and causes getRateLimitStatus() to report 'claude' as the limited executable even when no claude session has ever run. Dispatch is still correctly paused (any tracked limit pauses dispatch in no-fallback-chain mode), but the attribution is misleading and masks the real failing provider from operators.

This PR adds a private resolveDefaultExecutableForAgent(agent) helper that mirrors resolveExecutableWithFallback's resolution priority — agent's executablePath override → workspace defaultExecutablePaths[provider] → bare provider name — without doing any rate-limit tracker lookups. Both rapid-exit sites now attribute the rate limit to the failing worker's executable, so the banner and getRateLimitStatus() reflect what actually hit the limit.

'claude-code' (the canonical provider name) maps to 'claude' (the binary name) so existing tracker entries and banner display for default claude workers are unchanged.

Sites changed

  • packages/smithy/src/services/dispatch-daemon.tsattachRapidExitDetector (silent rate limit on rapid exit) and spawnRecoveryStewardForTask (orphan-recovery rate-limit pattern detection); both replace handleRateLimitDetected('claude', …) with handleRateLimitDetected(this.resolveDefaultExecutableForAgent(worker), …) in the else branch where fallbackChain is empty. The configured-fallbackChain branch is unchanged.

Out of scope

  • The pause behaviour itself: isDispatchPaused() still pauses on any tracked limit when no fallback chain is configured. Per-provider independent pausing is what fallbackChain is for; this PR only fixes the attribution of the recorded limit.
  • The assistant-rate-limit-message path: it already routes the actual session.executablePath from the spawner via the rate_limited event and was not affected by this hardcode.

Tests

Adds 4 regression tests in packages/smithy/src/services/dispatch-daemon.bun.test.ts:

  • silent rapid-exit on a codex worker (provider: 'codex') records 'codex', not 'claude'
  • silent rapid-exit on a default-provider worker still records 'claude' (backward compat)
  • pattern-detected rate limit on a codex worker records 'codex'
  • pattern-detected rate limit honors worker.executablePath override

Test plan

  • bun test src/services/dispatch-daemon.bun.test.ts — 144 pass, 2 skip, 0 fail (140 → 144 with the 4 new tests)
  • bun test src (full smithy suite) — 1572 pass, 19 skip, 0 fail
  • tsc --noEmit clean

🤖 Generated with Claude Code

…t hardcoded 'claude'

When the dispatch daemon detects a rate limit from a session that exited
without producing any output ("silent rate limit" / rapid-exit branch),
or from the orphan-recovery loop noticing a rate-limit pattern in a
worker's session history, and no `fallbackChain` is configured, it
previously called `handleRateLimitDetected('claude', resetTime)` —
attributing the limit to `'claude'` regardless of the worker's actual
provider.

For codex (or any non-claude) workers this surfaced in the dashboard
banner as "Dispatch paused — claude hit its rate limit." and caused
`getRateLimitStatus()` to report `'claude'` as the limited executable
even when no claude session ever ran. Dispatch was still correctly
paused (any tracked limit pauses dispatch in no-fallback-chain mode),
but the attribution was misleading and masked the real failing
provider from operators.

This change adds a private `resolveDefaultExecutableForAgent(agent)`
helper that mirrors `resolveExecutableWithFallback`'s resolution
priority — agent's `executablePath` override → workspace
`defaultExecutablePaths[provider]` → bare provider name — without doing
any rate-limit tracker lookups. Both rapid-exit sites now attribute
the rate limit to the failing worker's executable.

`'claude-code'` (the canonical provider name) maps to `'claude'` (the
binary name) so existing tracker entries and banner display for default
claude workers are unchanged.

Adds 4 regression tests in `dispatch-daemon.bun.test.ts`:
- silent rapid-exit on a codex worker records 'codex', not 'claude'
- silent rapid-exit on a default-provider worker still records 'claude'
- pattern-detected rate limit on a codex worker records 'codex'
- pattern-detected rate limit honors `worker.executablePath` override

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant