fix(smithy): silent rate-limit attributes limit to agent provider, not hardcoded 'claude'#110
Open
fnnzzz wants to merge 1 commit into
Open
Conversation
…t hardcoded 'claude'
When the dispatch daemon detects a rate limit from a session that exited
without producing any output ("silent rate limit" / rapid-exit branch),
or from the orphan-recovery loop noticing a rate-limit pattern in a
worker's session history, and no `fallbackChain` is configured, it
previously called `handleRateLimitDetected('claude', resetTime)` —
attributing the limit to `'claude'` regardless of the worker's actual
provider.
For codex (or any non-claude) workers this surfaced in the dashboard
banner as "Dispatch paused — claude hit its rate limit." and caused
`getRateLimitStatus()` to report `'claude'` as the limited executable
even when no claude session ever ran. Dispatch was still correctly
paused (any tracked limit pauses dispatch in no-fallback-chain mode),
but the attribution was misleading and masked the real failing
provider from operators.
This change adds a private `resolveDefaultExecutableForAgent(agent)`
helper that mirrors `resolveExecutableWithFallback`'s resolution
priority — agent's `executablePath` override → workspace
`defaultExecutablePaths[provider]` → bare provider name — without doing
any rate-limit tracker lookups. Both rapid-exit sites now attribute
the rate limit to the failing worker's executable.
`'claude-code'` (the canonical provider name) maps to `'claude'` (the
binary name) so existing tracker entries and banner display for default
claude workers are unchanged.
Adds 4 regression tests in `dispatch-daemon.bun.test.ts`:
- silent rapid-exit on a codex worker records 'codex', not 'claude'
- silent rapid-exit on a default-provider worker still records 'claude'
- pattern-detected rate limit on a codex worker records 'codex'
- pattern-detected rate limit honors `worker.executablePath` override
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the dispatch daemon detects a rate limit from a session that exited without producing any output (the "silent rate limit" / rapid-exit branch), or from the orphan-recovery loop noticing a rate-limit pattern in a worker's session history, and no
fallbackChainis configured, it previously calledhandleRateLimitDetected('claude', resetTime)— attributing the limit to'claude'regardless of the worker's actual provider.For codex (or any non-claude) workers this surfaces in the dashboard rate-limit banner as
Dispatch paused — claude hit its rate limit.and causesgetRateLimitStatus()to report'claude'as the limited executable even when no claude session has ever run. Dispatch is still correctly paused (any tracked limit pauses dispatch in no-fallback-chain mode), but the attribution is misleading and masks the real failing provider from operators.This PR adds a private
resolveDefaultExecutableForAgent(agent)helper that mirrorsresolveExecutableWithFallback's resolution priority — agent'sexecutablePathoverride → workspacedefaultExecutablePaths[provider]→ bare provider name — without doing any rate-limit tracker lookups. Both rapid-exit sites now attribute the rate limit to the failing worker's executable, so the banner andgetRateLimitStatus()reflect what actually hit the limit.'claude-code'(the canonical provider name) maps to'claude'(the binary name) so existing tracker entries and banner display for default claude workers are unchanged.Sites changed
packages/smithy/src/services/dispatch-daemon.ts—attachRapidExitDetector(silent rate limit on rapid exit) andspawnRecoveryStewardForTask(orphan-recovery rate-limit pattern detection); both replacehandleRateLimitDetected('claude', …)withhandleRateLimitDetected(this.resolveDefaultExecutableForAgent(worker), …)in theelsebranch wherefallbackChainis empty. The configured-fallbackChainbranch is unchanged.Out of scope
isDispatchPaused()still pauses on any tracked limit when no fallback chain is configured. Per-provider independent pausing is whatfallbackChainis for; this PR only fixes the attribution of the recorded limit.session.executablePathfrom the spawner via therate_limitedevent and was not affected by this hardcode.Tests
Adds 4 regression tests in
packages/smithy/src/services/dispatch-daemon.bun.test.ts:provider: 'codex') records'codex', not'claude''claude'(backward compat)'codex'worker.executablePathoverrideTest plan
bun test src/services/dispatch-daemon.bun.test.ts— 144 pass, 2 skip, 0 fail (140 → 144 with the 4 new tests)bun test src(full smithy suite) — 1572 pass, 19 skip, 0 failtsc --noEmitclean🤖 Generated with Claude Code