Skip to content

fix(runner): mcp_handshake_failed is transient, not auth#89

Merged
chorus-codes merged 1 commit into
mainfrom
fix/mcp-handshake-retryable
May 24, 2026
Merged

fix(runner): mcp_handshake_failed is transient, not auth#89
chorus-codes merged 1 commit into
mainfrom
fix/mcp-handshake-retryable

Conversation

@chorus-codes
Copy link
Copy Markdown
Owner

Why

Codex hit `MCP_HANDSHAKE_FAILED` on the PR #87 audit chat, went straight to claude-sonnet-4-6 fallback with no retry attempt. Victor confirmed codex was working fine minutes earlier — definitionally transient.

Root cause

`mcp_handshake_failed` was classified terminal in `isRetryableErrorKind` under the assumption "auth issue, same remediation as token_refresh_lost." That assumption is wrong: real codex auth failures always surface as `token_refresh_lost` (the explicit refresh-token error). `mcp_handshake_failed` is almost always codex's bundled MCP server booting racily.

Fix

One-line: add `mcp_handshake_failed` to the transient case list. Worst case wastes one extra codex call on a genuine misconfig (fails twice, advances to fallback as before). Best case catches the boot race.

Out of scope (follow-up)

`kindToStatus` in `cli-health.ts` still maps `mcp_handshake_failed` → `auth_invalid`, which triggers the 10-min precheck cooldown and "Auth broken" badge. PR #81's mtime auto-heal covers the common recovery path (next `codex login` or any cred file touch clears it). Worth revisiting separately if the badge proves sticky.

Tests

978 passing. One terminal test removed (`mcp_handshake_failed is terminal`), one transient test added.

Test plan

  • `pnpm typecheck` clean
  • `pnpm test` 978/978 passing

Codex's bundled MCP server boots racily — slow first start, handshake
timeout under load, server crash on first start. The error LOOKS
auth-shaped (was originally classified terminal under "auth issue,
same remediation as token_refresh_lost") but real codex auth failures
always surface as token_refresh_lost. mcp_handshake_failed is almost
always a transient boot race.

Caught when codex hit this on the PR #87 audit chat — the run shows
"MCP_HANDSHAKE_FAILED — failed: handshaking with MCP server failed"
and went straight to claude-sonnet-4-6 fallback with no retry. Victor
confirmed codex was working fine in the previous chat just minutes
earlier — definitionally transient.

Fix: move mcp_handshake_failed from terminal to transient in
isRetryableErrorKind. Worst case: wastes one extra codex call on a
genuine misconfig (fails twice and advances to fallback as before).
Best case: catches the boot race and saves the fallback swap.

Note: kindToStatus still maps mcp_handshake_failed -> auth_invalid for
cli-health tracking. That's a separate concern (it triggers the 10-min
precheck cooldown + "Auth broken" badge); flagging as follow-up but
PR #81's auto-heal on cred-file mtime change covers the common
recovery path. Worth revisiting in a dedicated PR if the badge proves
sticky in practice.

Tests: 978 passing. One terminal test removed, one transient test added.
@chorus-codes chorus-codes merged commit 8c9f5b7 into main May 24, 2026
2 checks passed
@chorus-codes chorus-codes deleted the fix/mcp-handshake-retryable branch May 24, 2026 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant