Skip to content

feat(runner): one-shot retry on opencode null-with-no-errorKind#87

Merged
chorus-codes merged 1 commit into
mainfrom
feat/retry-opencode-null-v2
May 24, 2026
Merged

feat(runner): one-shot retry on opencode null-with-no-errorKind#87
chorus-codes merged 1 commit into
mainfrom
feat/retry-opencode-null-v2

Conversation

@chorus-codes
Copy link
Copy Markdown
Owner

Re-opened from #85 after rebase onto post-#83 main. The conflict was a clean integration: PR #83 added the tryClaim wrapper + retry loop; this PR adds the `entry.lineage` second arg to `isRetryableErrorKind` inside that loop.

Why

Victor caught on the PR #83 audit screenshot: qwen3.6-plus on opencode-go exited with empty answer (no `errorKind` classified) and went straight to claude-sonnet-4-6 fallback — the PR #79 retry never fired because `isRetryableErrorKind(undefined)` is hard-false.

Fix

`isRetryableErrorKind(kind, lineage?)` — when `kind === undefined && lineage === 'opencode'`, treat as retryable. Other lineages unchanged.

Why opencode specifically:

  • opencode-go's gateway has known transport flakes where the subprocess exits 0 with empty stdout but a second attempt succeeds with the same prompt
  • codex/claude/gemini null-with-no-kind almost always means the model genuinely produced nothing — retry would produce the same nothing
  • An explicit non-retryable kind (auth/quota/db-corrupt/no_output) still wins; the lineage hint never overrides

Retry visibility

The existing `transient_retry` cli_warning already renders as an amber chip on the participant card via `participant.warnings` rendering. With this PR, opencode failures that previously went silent-to-fallback now fire the warning, so users see "TRANSIENT_RETRY — Transient failure on opencode/opencode-go/qwen3.6-plus — retrying once before advancing fallback." before any swap banner appears.

Tests

977 → 978 passing (+5 cases for the new lineage hint behavior; +1 case for the contract that an explicit terminal kind still wins on opencode).

Test plan

Victor caught this on the PR #83 audit: qwen3.6-plus reviewer ran on
opencode-go, exited cleanly with empty output (no errorKind, no
message), and went straight to fallback chain advance — no retry
attempt. The PR #79 retry classifier returned false because
isRetryableErrorKind(undefined) was hard-false. That's the right
default for codex/claude/gemini (a null with no kind usually means
the model genuinely produced nothing — retry would produce nothing
again), but opencode-go's gateway has known transport flakes where
a second attempt succeeds with the same prompt.

Fix: extend isRetryableErrorKind to accept an optional `lineage` hint.
When `kind` is undefined AND `lineage === 'opencode'`, treat as
retryable. Other lineages keep the conservative default. The lineage
hint does NOT override an explicit non-retryable kind — auth/quota/
db-corrupt are still terminal regardless of lineage.

Both reviewer-driver and doer-driver call sites now pass
`entry.lineage` so the chain step picks up the new behaviour.

Retry visibility: no UI work needed — the existing `transient_retry`
cli_warning already renders as an amber chip on the participant card
via participant-card.tsx's `participant.warnings` block. The chip
appears the moment retry fires; the message reads "Transient X
failure on Y/Z — retrying once before advancing fallback."

Tests: 974 -> 975 passing (+5 new cases on the lineage hint behavior).
@chorus-codes chorus-codes merged commit a7f05e2 into main May 24, 2026
2 checks passed
@chorus-codes chorus-codes deleted the feat/retry-opencode-null-v2 branch May 24, 2026 02:06
chorus-codes added a commit that referenced this pull request May 24, 2026
Codex's bundled MCP server boots racily — slow first start, handshake
timeout under load, server crash on first start. The error LOOKS
auth-shaped (was originally classified terminal under "auth issue,
same remediation as token_refresh_lost") but real codex auth failures
always surface as token_refresh_lost. mcp_handshake_failed is almost
always a transient boot race.

Caught when codex hit this on the PR #87 audit chat — the run shows
"MCP_HANDSHAKE_FAILED — failed: handshaking with MCP server failed"
and went straight to claude-sonnet-4-6 fallback with no retry. Victor
confirmed codex was working fine in the previous chat just minutes
earlier — definitionally transient.

Fix: move mcp_handshake_failed from terminal to transient in
isRetryableErrorKind. Worst case: wastes one extra codex call on a
genuine misconfig (fails twice and advances to fallback as before).
Best case: catches the boot race and saves the fallback swap.

Note: kindToStatus still maps mcp_handshake_failed -> auth_invalid for
cli-health tracking. That's a separate concern (it triggers the 10-min
precheck cooldown + "Auth broken" badge); flagging as follow-up but
PR #81's auto-heal on cred-file mtime change covers the common
recovery path. Worth revisiting in a dedicated PR if the badge proves
sticky in practice.

Tests: 978 passing. One terminal test removed, one transient test added.

Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
chorus-codes added a commit that referenced this pull request May 25, 2026
…pass + moonshot gap) (#91)

* refactor(runner): shim-owned retry policy (closes PR #87 no_output bypass)

PR #87 added a lineage-dispatch escape valve in isRetryableErrorKind for
opencode null-with-no-kind. The chorus self-audit (6 request-changes /
2 approves) found two real problems:

1. THE BUG (antigravity + gemini, HIGH):
   The COMMON opencode-go failure is empty-stdout exit with eventCount=0.
   The runner's finally block synthesises kind='no_output' for that
   case — and no_output is universally terminal — so the new opencode-
   null retry was UNREACHABLE for the bug it shipped to fix. It only
   helped the rare case where opencode emits some events but no usable
   content.

2. SLIPPERY SLOPE (every reviewer):
   Central classifier with a lineage exception is a one-off escape
   valve that grows into a table. Moonshot (also opencode-go) was
   already an unprotected gap (codex flagged via isOpenCodeFamily idiom).

Fix: move the retry-on-null-kind and retry-on-no-output policies out of
the central classifier and into the shim layer. Each AgentShim declares
its own `retryPolicy: { extraKinds?, onNullKind?, onNoOutput? }` block.

- opencode shim: opts into both flags (closes the no_output bypass)
- kimi shim (moonshot): same opts-in (closes the moonshot gap)
- codex/claude/gemini/grok/antigravity/openrouter shims: unchanged —
  fall through to the universal transient list as before

isRetryableErrorKind signature changes from (kind, lineage?) to
(kind, shim?). Universal terminal/transient layer unchanged; shim
policies layer on top. Universal terminal kinds (quota_exhausted,
token_refresh_lost, opencode_db_corrupt) still beat shim opt-ins —
explicit terminal kinds cannot be overridden by a shim.

Tests: 978 -> 982 passing. Test cases now use tiny shim stubs instead
of lineage strings, matching the production call shape.

Out of scope (separate follow-ups flagged in audit):
- Telemetry: record retry-succeeded rows so the policy's effectiveness
  is measurable. Universal across reviewers.
- Verdict-parse-failure false positive (codex finding #5): opencode-cli-7
  traced this and found non-empty content always returns a real verdict,
  so apology-shaped responses don't reach the retry path. Closed as
  non-issue.

* fix(reviewer): classify ambiguous-verdict so opencode retry skips apology shapes

PR #91 audit catch (multiple reviewers, MEDIUM-HIGH severity): when a
reviewer writes non-empty prose with no approve/reject keyword,
verdictFromReviewerText returns null. Pre-fix, lastError.kind stayed
undefined → the new opencode shim's onNullKind=true policy fired a
retry that wasted tokens on the same prose shape.

This isn't the transport flake retry was designed for — it's a
successful model run that just didn't match the verdict heuristic
(apology, refusal, off-topic continuation). Classify it as
`verdict_ambiguous` (not in any retry list) so the driver advances
the chain cleanly.

Guarded on `!lastError.kind` so a real classified failure earlier in
the run (e.g. stream_failure with the finally block writing a
REVIEWER FAILED summary to answer.md) survives the verdict check.

Closes my own PR description claim (PR #91) that opencode-cli-7
had closed this — that earlier trace was incomplete. The audit
correctly re-traced it and found the path IS reachable.

Tests: 983 -> 984 (+1 contract test for verdict_ambiguous + 1
classifier test confirming it's terminal even on opencode shim).

---------

Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant