v0.8.58: Codex/Responses client reliability — retry/backoff parity, land function_call_output + xhigh effort fixes

## Why
The OpenAI Codex provider (`openai-codex`) is the only lane that bypasses CodeWhale's retry stack: a 429 or transient 5xx from the ChatGPT backend kills the turn immediately, while every Chat Completions provider transparently retries with backoff. This blocks #2984 (promote Codex out of preview) and makes Codex-parity benchmark runs flaky. Three correctness fixes for this path already exist as uncommitted work on branch `v0.8.58-constitution` and must be anchored: user-role `ToolResult` blocks were silently dropped (breaking all multi-turn tool calling on Codex), tool schemas went to the wire unsanitized, and effort `"maximum"` was not mapped to `xhigh`. Related: #2955 (Codex usage telemetry).

## Current state
- `crates/tui/src/client/responses.rs:80-113` — `handle_responses_stream` builds the request and calls raw `builder.json(&body).send()` (lines 103-107), then bails on any non-2xx:
  `anyhow::bail!("Responses API error (HTTP {status}): {raw}")` — no retry, no `Retry-After`, no `retry_status` banner, no `mark_request_failure` circuit-breaker accounting.
- `crates/tui/src/client/chat.rs:268` — the chat stream path goes through `self.send_with_retry(|| self.http_client.post(&url).json(&body))`.
- `crates/tui/src/client.rs:1019-1083` — `send_with_retry` wraps `with_retry` (`crates/tui/src/llm_client/mod.rs:1002`), honors `Retry-After` via `extract_retry_after` (`llm_client/mod.rs:1127`), sanitizes error bodies with `sanitize_http_error_body`, drives the `crate::retry_status` banner, and calls `wait_for_rate_limit` / `mark_request_success` / `mark_request_failure`.
- Dispatch: `crates/tui/src/client.rs:1141-1142` and `1151-1152` route `ApiProvider::OpenaiCodex` to `handle_responses_message` / `handle_responses_stream`.
- Endpoint: `CODEX_RESPONSES_PATH = "/codex/responses"` (`responses.rs:24`), appended to `self.base_url`. `RequestPayloadMode` (`crates/tui/src/config.rs:377-381`) has only a `ChatCompletions` variant today.
- In-flight fixes (uncommitted in the local working tree; if you are on a fresh clone where `codex_responses_reasoning_effort` does not exist in `responses.rs`, recreate them per the Plan):
  1. `convert_messages_to_responses_input` (`responses.rs:483`): a `ContentBlock::ToolResult` arm in the `"user"` role branch (`responses.rs:504-523`) that flushes pending `input_text` items and emits `{"type": "function_call_output", "call_id": ..., "output": ...}` via `parse_tool_use_id`. Previously user-role tool results hit `_ => {}` and were dropped, so the model never saw tool outputs.
  2. `tool_to_responses_function` (`responses.rs:597-607`): clones `tool.input_schema` and runs `schema_sanitize::sanitize_for_responses` (`crates/tui/src/tools/schema_sanitize.rs:85`) before sending `parameters`.
  3. `codex_responses_reasoning_effort` (`responses.rs:609-618`): extracted helper mapping `"xhigh" | "max" | "maximum"` → `xhigh`, `"off"/"none"/...` → omit reasoning, unknown → `medium`.
  4. Unit tests for all three in `responses.rs` `#[cfg(test)] mod tests` (`responses.rs:656-754`).

## Plan
1. Land the in-flight fixes (`crates/tui/src/client/responses.rs`): if the working tree already contains `codex_responses_reasoning_effort`, the `ToolResult` arm in the user-role branch, the `sanitize_for_responses` call, and the three tests listed above, commit them as-is. Otherwise implement them exactly as described in Current state items 1-4. Requires the `schema-sanitize-strict-mode` issue's `sanitize_for_responses` to exist.
2. Add retry/backoff to `handle_responses_stream` (`responses.rs:80-113`): replace the one-shot `builder...send()` with `self.send_with_retry(|| { ... })`, building the full request (POST `url`, `Content-Type`, `Accept: text/event-stream`, `OpenAI-Beta: responses=experimental`, `originator: codex_cli_rs`, optional `chatgpt-account-id` from `crate::oauth::codex_account_id()`, `.json(&body)`) inside the closure so each retry attempt gets a fresh `RequestBuilder`. `send_with_retry` is `pub(super)` in `crates/tui/src/client.rs:1019`, so it is callable from `responses.rs` (same module tree) with no visibility change.
3. Error mapping parity (`responses.rs`): keep a defensive non-success check after `send_with_retry` returns (mirroring `chat.rs:271-286`), but run the body through `sanitize_http_error_body(Some(self.api_provider.display_name()), status.as_u16(), &raw)` before bailing so terminal 4xx errors render like the chat path. Import what is needed from `super`.
4. Tests — must be in-src unit tests (the `#[cfg(test)]` module of `responses.rs` or `client.rs`): `crates/tui` is a bin-only crate (`crates/tui/Cargo.toml` has only a `[[bin]]` target; `mod client` is private in `crates/tui/src/main.rs:23`), so a `crates/tui/tests/` integration test CANNOT reach `DeepSeekClient` or `handle_responses_stream`. Use `wiremock` under `#[tokio::test]` (dev-dep, `crates/tui/Cargo.toml:78`; in-src wiremock pattern: `crates/tui/src/tools/web_search.rs:2365-2375`) to mount `POST /codex/responses` returning 429 with a `Retry-After: 0` header once, then a 200 SSE body; assert `handle_responses_stream` succeeds after the retry, and that a 400 surfaces without retry. Client setup: `DeepSeekClient::new` (`client.rs:582`) with a `Config` whose provider is `openai-codex` and whose `[providers.openai_codex] base_url` points at the mock (`deepseek_base_url` honors the provider override, `config.rs:2441-2460`; `http://127.0.0.1` passes `validate_base_url_security`, `client.rs:358`). Note: `deepseek_api_key()` for this provider calls `oauth::get_credentials()` (`config.rs:2610-2612`), which honors a `CODEX_ACCESS_TOKEN` env override (`crates/tui/src/oauth.rs:262`) — set a dummy token in the test using a mutex + save/restore env guard like the base-url tests do (`client.rs:3583-3614`, `ALLOW_INSECURE_HTTP_ENV_LOCK` pattern).
5. Stretch (optional, do NOT block landing): add `RequestPayloadMode::Responses` to `crates/tui/src/config.rs:377-381` and make the Responses path usable against standard `api.openai.com` `/v1/responses` instead of the hardwired `CODEX_RESPONSES_PATH`. If attempted, run `python3 scripts/check-provider-registry.py`. Skip entirely if it grows beyond ~100 lines.
6. Reference this issue and #2984 in the landing commit message.

## Acceptance criteria
- [ ] On `openai-codex`, a multi-turn tool-calling conversation serializes user-role `ToolResult` blocks as `function_call_output` input items (covered by `responses_input_includes_user_role_tool_results`).
- [ ] Tool parameters sent to `/codex/responses` are sanitized (`oneOf`/`anyOf`/`allOf`/`enum`/`not` stripped at the root) without mutating the original `Tool` (covered by `responses_function_tool_sanitizes_root_composition_schema`).
- [ ] `reasoning_effort: "maximum"` maps to `xhigh`; `"off"`/`"none"` omit the `reasoning` block (covered by `codex_reasoning_effort_uses_responses_labels`).
- [ ] A 429 (with `Retry-After`) or 5xx from `/codex/responses` is retried with backoff and succeeds when the backend recovers, instead of failing the turn on the first response.
- [ ] Retries surface in the `retry_status` banner and in `logging::warn` (`"HTTP retry reason=..."`) exactly like chat-path retries; client failure/success accounting (`mark_request_failure`/`mark_request_success`) fires for the Responses path.
- [ ] Non-retryable errors (400/401/403) fail fast with a sanitized error body that names the provider, matching chat-path formatting.
- [ ] All new/changed tests pass; no behavior change for non-Codex providers.

## Verification
```sh
cargo test -p codewhale-tui client::responses
cargo test -p codewhale-tui schema_sanitize
cargo test -p codewhale-tui --test tool_lifecycle_acceptance   # guard: existing mock-LLM flows still pass
cargo fmt --all -- --check
cargo clippy -p codewhale-tui --all-targets -- -D warnings
python3 scripts/check-provider-registry.py   # only if step 5 (stretch) touched config.rs provider data
```
Manual (requires Codex OAuth creds in `~/.codex/auth.json` via `codex login`; skip if headless without creds):
1. `cargo run -p codewhale-tui` → switch provider to `openai-codex`.
2. Ask for a task that forces two sequential tool calls (e.g. "read Cargo.toml, then list crates/"). Both tool results must round-trip — before fix 1, turn 2 stalls or the backend 400s on a `function_call` without matching output.
3. Retry path without creds: run the wiremock test from Plan step 4 (it sets a dummy `CODEX_ACCESS_TOKEN` itself; no real login needed) and confirm the log line `HTTP retry reason=rate_limited`.

## Out of scope
- #2984 itself (preview-flag removal / live round-trip sign-off) — this issue only makes the transport reliable enough for it.
- #2955 usage-telemetry alignment and #2961 cached/reasoning token normalization (`parse_responses_usage` stays as-is).
- WhaleFlow / v0.9.0 work (#2973-#2981).
- Streaming mid-stream reconnect/resume (retry covers request establishment only, same as the chat path).
- The stretch `RequestPayloadMode::Responses` is explicitly optional; do not let it block the retry work.

## Hints
- `send_with_retry` takes `FnMut() -> reqwest::RequestBuilder` and re-invokes the closure per attempt — build ALL headers inside the closure; do not move a consumed builder in.
- The bearer `Authorization` header is a default header on `self.http_client` (see comment at `responses.rs:87-91`) — do not add it again.
- After `send_with_retry` returns `Ok`, the status is always 2xx (`client.rs:1035-1036`); the post-call status check is belt-and-braces parity with `chat.rs:271`, not load-bearing.
- Retry classification lives in `crates/tui/src/llm_client/mod.rs` (`with_retry:1002`, `extract_retry_after:1127`, `RetryConfig` tests at `:1155`); nothing there should need changes.
- Existing tests in `responses.rs` show the `MessageRequest`/`Tool` construction shape — copy those rather than building fixtures from scratch.
- Env mutation in tests is process-global: take a `static Mutex` lock and restore prior values on drop, exactly as `client.rs:3583-3614` does for `ALLOW_INSECURE_HTTP_ENV` — otherwise the new test will flake against parallel tests.
- Composite tool-use ids are `"{call_id}|{item_id}"` (`parse_tool_use_id`, `responses.rs:622`); always emit `call_id` (left half) in `function_call_output`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.58: Codex/Responses client reliability — retry/backoff parity, land function_call_output + xhigh effort fixes #3019

Why

Current state

Plan

Acceptance criteria

Verification

Out of scope

Hints

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

v0.8.58: Codex/Responses client reliability — retry/backoff parity, land function_call_output + xhigh effort fixes #3019

Description

Why

Current state

Plan

Acceptance criteria

Verification

Out of scope

Hints

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions