feat(runner): one-shot retry on opencode null-with-no-errorKind by chorus-codes · Pull Request #87 · chorus-codes/chorus

chorus-codes · 2026-05-24T02:04:31Z

Re-opened from #85 after rebase onto post-#83 main. The conflict was a clean integration: PR #83 added the tryClaim wrapper + retry loop; this PR adds the `entry.lineage` second arg to `isRetryableErrorKind` inside that loop.

Why

Victor caught on the PR #83 audit screenshot: qwen3.6-plus on opencode-go exited with empty answer (no `errorKind` classified) and went straight to claude-sonnet-4-6 fallback — the PR #79 retry never fired because `isRetryableErrorKind(undefined)` is hard-false.

Fix

`isRetryableErrorKind(kind, lineage?)` — when `kind === undefined && lineage === 'opencode'`, treat as retryable. Other lineages unchanged.

Why opencode specifically:

opencode-go's gateway has known transport flakes where the subprocess exits 0 with empty stdout but a second attempt succeeds with the same prompt
codex/claude/gemini null-with-no-kind almost always means the model genuinely produced nothing — retry would produce the same nothing
An explicit non-retryable kind (auth/quota/db-corrupt/no_output) still wins; the lineage hint never overrides

Retry visibility

The existing `transient_retry` cli_warning already renders as an amber chip on the participant card via `participant.warnings` rendering. With this PR, opencode failures that previously went silent-to-fallback now fire the warning, so users see "TRANSIENT_RETRY — Transient failure on opencode/opencode-go/qwen3.6-plus — retrying once before advancing fallback." before any swap banner appears.

Tests

977 → 978 passing (+5 cases for the new lineage hint behavior; +1 case for the contract that an explicit terminal kind still wins on opencode).

Test plan

`pnpm typecheck` clean
`pnpm test` 978/978 passing
Chorus self-audit (re-uses PR feat(runner): one-shot retry on opencode null-with-no-errorKind + retry chip visibility #85's audit since the diff is unchanged content-wise)

Victor caught this on the PR #83 audit: qwen3.6-plus reviewer ran on opencode-go, exited cleanly with empty output (no errorKind, no message), and went straight to fallback chain advance — no retry attempt. The PR #79 retry classifier returned false because isRetryableErrorKind(undefined) was hard-false. That's the right default for codex/claude/gemini (a null with no kind usually means the model genuinely produced nothing — retry would produce nothing again), but opencode-go's gateway has known transport flakes where a second attempt succeeds with the same prompt. Fix: extend isRetryableErrorKind to accept an optional `lineage` hint. When `kind` is undefined AND `lineage === 'opencode'`, treat as retryable. Other lineages keep the conservative default. The lineage hint does NOT override an explicit non-retryable kind — auth/quota/ db-corrupt are still terminal regardless of lineage. Both reviewer-driver and doer-driver call sites now pass `entry.lineage` so the chain step picks up the new behaviour. Retry visibility: no UI work needed — the existing `transient_retry` cli_warning already renders as an amber chip on the participant card via participant-card.tsx's `participant.warnings` block. The chip appears the moment retry fires; the message reads "Transient X failure on Y/Z — retrying once before advancing fallback." Tests: 974 -> 975 passing (+5 new cases on the lineage hint behavior).

Codex's bundled MCP server boots racily — slow first start, handshake timeout under load, server crash on first start. The error LOOKS auth-shaped (was originally classified terminal under "auth issue, same remediation as token_refresh_lost") but real codex auth failures always surface as token_refresh_lost. mcp_handshake_failed is almost always a transient boot race. Caught when codex hit this on the PR #87 audit chat — the run shows "MCP_HANDSHAKE_FAILED — failed: handshaking with MCP server failed" and went straight to claude-sonnet-4-6 fallback with no retry. Victor confirmed codex was working fine in the previous chat just minutes earlier — definitionally transient. Fix: move mcp_handshake_failed from terminal to transient in isRetryableErrorKind. Worst case: wastes one extra codex call on a genuine misconfig (fails twice and advances to fallback as before). Best case: catches the boot race and saves the fallback swap. Note: kindToStatus still maps mcp_handshake_failed -> auth_invalid for cli-health tracking. That's a separate concern (it triggers the 10-min precheck cooldown + "Auth broken" badge); flagging as follow-up but PR #81's auto-heal on cred-file mtime change covers the common recovery path. Worth revisiting in a dedicated PR if the badge proves sticky in practice. Tests: 978 passing. One terminal test removed, one transient test added. Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>

…pass + moonshot gap) (#91) * refactor(runner): shim-owned retry policy (closes PR #87 no_output bypass) PR #87 added a lineage-dispatch escape valve in isRetryableErrorKind for opencode null-with-no-kind. The chorus self-audit (6 request-changes / 2 approves) found two real problems: 1. THE BUG (antigravity + gemini, HIGH): The COMMON opencode-go failure is empty-stdout exit with eventCount=0. The runner's finally block synthesises kind='no_output' for that case — and no_output is universally terminal — so the new opencode- null retry was UNREACHABLE for the bug it shipped to fix. It only helped the rare case where opencode emits some events but no usable content. 2. SLIPPERY SLOPE (every reviewer): Central classifier with a lineage exception is a one-off escape valve that grows into a table. Moonshot (also opencode-go) was already an unprotected gap (codex flagged via isOpenCodeFamily idiom). Fix: move the retry-on-null-kind and retry-on-no-output policies out of the central classifier and into the shim layer. Each AgentShim declares its own `retryPolicy: { extraKinds?, onNullKind?, onNoOutput? }` block. - opencode shim: opts into both flags (closes the no_output bypass) - kimi shim (moonshot): same opts-in (closes the moonshot gap) - codex/claude/gemini/grok/antigravity/openrouter shims: unchanged — fall through to the universal transient list as before isRetryableErrorKind signature changes from (kind, lineage?) to (kind, shim?). Universal terminal/transient layer unchanged; shim policies layer on top. Universal terminal kinds (quota_exhausted, token_refresh_lost, opencode_db_corrupt) still beat shim opt-ins — explicit terminal kinds cannot be overridden by a shim. Tests: 978 -> 982 passing. Test cases now use tiny shim stubs instead of lineage strings, matching the production call shape. Out of scope (separate follow-ups flagged in audit): - Telemetry: record retry-succeeded rows so the policy's effectiveness is measurable. Universal across reviewers. - Verdict-parse-failure false positive (codex finding #5): opencode-cli-7 traced this and found non-empty content always returns a real verdict, so apology-shaped responses don't reach the retry path. Closed as non-issue. * fix(reviewer): classify ambiguous-verdict so opencode retry skips apology shapes PR #91 audit catch (multiple reviewers, MEDIUM-HIGH severity): when a reviewer writes non-empty prose with no approve/reject keyword, verdictFromReviewerText returns null. Pre-fix, lastError.kind stayed undefined → the new opencode shim's onNullKind=true policy fired a retry that wasted tokens on the same prose shape. This isn't the transport flake retry was designed for — it's a successful model run that just didn't match the verdict heuristic (apology, refusal, off-topic continuation). Classify it as `verdict_ambiguous` (not in any retry list) so the driver advances the chain cleanly. Guarded on `!lastError.kind` so a real classified failure earlier in the run (e.g. stream_failure with the finally block writing a REVIEWER FAILED summary to answer.md) survives the verdict check. Closes my own PR description claim (PR #91) that opencode-cli-7 had closed this — that earlier trace was incomplete. The audit correctly re-traced it and found the path IS reachable. Tests: 983 -> 984 (+1 contract test for verdict_ambiguous + 1 classifier test confirming it's terminal even on opencode shim). --------- Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>

chorus-codes merged commit a7f05e2 into main May 24, 2026
2 checks passed

chorus-codes deleted the feat/retry-opencode-null-v2 branch May 24, 2026 02:06

This was referenced May 24, 2026

chore: bump to v0.8.56 #88

Merged

fix(runner): mcp_handshake_failed is transient, not auth #89

Merged

This was referenced May 24, 2026

chore: bump to v0.8.57 #90

Merged

refactor(runner): shim-owned retry policy (closes PR #87 no_output bypass + moonshot gap) #91

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runner): one-shot retry on opencode null-with-no-errorKind#87

feat(runner): one-shot retry on opencode null-with-no-errorKind#87
chorus-codes merged 1 commit into
mainfrom
feat/retry-opencode-null-v2

chorus-codes commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chorus-codes commented May 24, 2026

Why

Fix

Retry visibility

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant