feat(runner): one-shot retry on opencode null-with-no-errorKind#87
Merged
Conversation
Victor caught this on the PR #83 audit: qwen3.6-plus reviewer ran on opencode-go, exited cleanly with empty output (no errorKind, no message), and went straight to fallback chain advance — no retry attempt. The PR #79 retry classifier returned false because isRetryableErrorKind(undefined) was hard-false. That's the right default for codex/claude/gemini (a null with no kind usually means the model genuinely produced nothing — retry would produce nothing again), but opencode-go's gateway has known transport flakes where a second attempt succeeds with the same prompt. Fix: extend isRetryableErrorKind to accept an optional `lineage` hint. When `kind` is undefined AND `lineage === 'opencode'`, treat as retryable. Other lineages keep the conservative default. The lineage hint does NOT override an explicit non-retryable kind — auth/quota/ db-corrupt are still terminal regardless of lineage. Both reviewer-driver and doer-driver call sites now pass `entry.lineage` so the chain step picks up the new behaviour. Retry visibility: no UI work needed — the existing `transient_retry` cli_warning already renders as an amber chip on the participant card via participant-card.tsx's `participant.warnings` block. The chip appears the moment retry fires; the message reads "Transient X failure on Y/Z — retrying once before advancing fallback." Tests: 974 -> 975 passing (+5 new cases on the lineage hint behavior).
This was referenced May 24, 2026
Merged
chorus-codes
added a commit
that referenced
this pull request
May 24, 2026
Codex's bundled MCP server boots racily — slow first start, handshake timeout under load, server crash on first start. The error LOOKS auth-shaped (was originally classified terminal under "auth issue, same remediation as token_refresh_lost") but real codex auth failures always surface as token_refresh_lost. mcp_handshake_failed is almost always a transient boot race. Caught when codex hit this on the PR #87 audit chat — the run shows "MCP_HANDSHAKE_FAILED — failed: handshaking with MCP server failed" and went straight to claude-sonnet-4-6 fallback with no retry. Victor confirmed codex was working fine in the previous chat just minutes earlier — definitionally transient. Fix: move mcp_handshake_failed from terminal to transient in isRetryableErrorKind. Worst case: wastes one extra codex call on a genuine misconfig (fails twice and advances to fallback as before). Best case: catches the boot race and saves the fallback swap. Note: kindToStatus still maps mcp_handshake_failed -> auth_invalid for cli-health tracking. That's a separate concern (it triggers the 10-min precheck cooldown + "Auth broken" badge); flagging as follow-up but PR #81's auto-heal on cred-file mtime change covers the common recovery path. Worth revisiting in a dedicated PR if the badge proves sticky in practice. Tests: 978 passing. One terminal test removed, one transient test added. Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
This was referenced May 24, 2026
Merged
chorus-codes
added a commit
that referenced
this pull request
May 25, 2026
…pass + moonshot gap) (#91) * refactor(runner): shim-owned retry policy (closes PR #87 no_output bypass) PR #87 added a lineage-dispatch escape valve in isRetryableErrorKind for opencode null-with-no-kind. The chorus self-audit (6 request-changes / 2 approves) found two real problems: 1. THE BUG (antigravity + gemini, HIGH): The COMMON opencode-go failure is empty-stdout exit with eventCount=0. The runner's finally block synthesises kind='no_output' for that case — and no_output is universally terminal — so the new opencode- null retry was UNREACHABLE for the bug it shipped to fix. It only helped the rare case where opencode emits some events but no usable content. 2. SLIPPERY SLOPE (every reviewer): Central classifier with a lineage exception is a one-off escape valve that grows into a table. Moonshot (also opencode-go) was already an unprotected gap (codex flagged via isOpenCodeFamily idiom). Fix: move the retry-on-null-kind and retry-on-no-output policies out of the central classifier and into the shim layer. Each AgentShim declares its own `retryPolicy: { extraKinds?, onNullKind?, onNoOutput? }` block. - opencode shim: opts into both flags (closes the no_output bypass) - kimi shim (moonshot): same opts-in (closes the moonshot gap) - codex/claude/gemini/grok/antigravity/openrouter shims: unchanged — fall through to the universal transient list as before isRetryableErrorKind signature changes from (kind, lineage?) to (kind, shim?). Universal terminal/transient layer unchanged; shim policies layer on top. Universal terminal kinds (quota_exhausted, token_refresh_lost, opencode_db_corrupt) still beat shim opt-ins — explicit terminal kinds cannot be overridden by a shim. Tests: 978 -> 982 passing. Test cases now use tiny shim stubs instead of lineage strings, matching the production call shape. Out of scope (separate follow-ups flagged in audit): - Telemetry: record retry-succeeded rows so the policy's effectiveness is measurable. Universal across reviewers. - Verdict-parse-failure false positive (codex finding #5): opencode-cli-7 traced this and found non-empty content always returns a real verdict, so apology-shaped responses don't reach the retry path. Closed as non-issue. * fix(reviewer): classify ambiguous-verdict so opencode retry skips apology shapes PR #91 audit catch (multiple reviewers, MEDIUM-HIGH severity): when a reviewer writes non-empty prose with no approve/reject keyword, verdictFromReviewerText returns null. Pre-fix, lastError.kind stayed undefined → the new opencode shim's onNullKind=true policy fired a retry that wasted tokens on the same prose shape. This isn't the transport flake retry was designed for — it's a successful model run that just didn't match the verdict heuristic (apology, refusal, off-topic continuation). Classify it as `verdict_ambiguous` (not in any retry list) so the driver advances the chain cleanly. Guarded on `!lastError.kind` so a real classified failure earlier in the run (e.g. stream_failure with the finally block writing a REVIEWER FAILED summary to answer.md) survives the verdict check. Closes my own PR description claim (PR #91) that opencode-cli-7 had closed this — that earlier trace was incomplete. The audit correctly re-traced it and found the path IS reachable. Tests: 983 -> 984 (+1 contract test for verdict_ambiguous + 1 classifier test confirming it's terminal even on opencode shim). --------- Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-opened from #85 after rebase onto post-#83 main. The conflict was a clean integration: PR #83 added the tryClaim wrapper + retry loop; this PR adds the `entry.lineage` second arg to `isRetryableErrorKind` inside that loop.
Why
Victor caught on the PR #83 audit screenshot: qwen3.6-plus on opencode-go exited with empty answer (no `errorKind` classified) and went straight to claude-sonnet-4-6 fallback — the PR #79 retry never fired because `isRetryableErrorKind(undefined)` is hard-false.
Fix
`isRetryableErrorKind(kind, lineage?)` — when `kind === undefined && lineage === 'opencode'`, treat as retryable. Other lineages unchanged.
Why opencode specifically:
Retry visibility
The existing `transient_retry` cli_warning already renders as an amber chip on the participant card via `participant.warnings` rendering. With this PR, opencode failures that previously went silent-to-fallback now fire the warning, so users see "TRANSIENT_RETRY — Transient failure on opencode/opencode-go/qwen3.6-plus — retrying once before advancing fallback." before any swap banner appears.
Tests
977 → 978 passing (+5 cases for the new lineage hint behavior; +1 case for the contract that an explicit terminal kind still wins on opencode).
Test plan