Skip to content

fix(provider): stream-inactivity timeout + lower read_timeout (stall resilience)#163

Open
justrach wants to merge 1 commit into
release/0.2.17from
fix/provider-stall-resilience
Open

fix(provider): stream-inactivity timeout + lower read_timeout (stall resilience)#163
justrach wants to merge 1 commit into
release/0.2.17from
fix/provider-stall-resilience

Conversation

@justrach

@justrach justrach commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Problem

The interactive session froze for 15+ minutes when a streaming provider stalled mid-response (observed on Codex / gpt-5.5 — a 22-min hang with zero output; even graff list agents hung 28 min).

Two compounding causes:

  1. read_timeout was 900s (15 min) — and it does not even fire on a content-stalled HTTP/2 stream, because http2_keep_alive PINGs (keep_alive_interval_secs: 60, keep_alive_while_idle: true) keep the socket alive, so reqwest never sees a read-idle period.
  2. No application-level inactivity timeout on the response stream, so a silent-but-connected stream hung indefinitely (and leaked a zombie process).

Fix

  • Stream inactivity timeout (forge_domain/src/result_stream_ext.rs): into_full_streaming now time-boxes each provider event with STREAM_IDLE_TIMEOUT = 120s. A stall surfaces Error::Retryable, which the existing retry_with_config honors -> it re-issues the request instead of hanging. This is transport-agnostic (every provider funnels through this one BoxStream consumer) and sits inside the retry boundary (orch.rs execute_chat_turn drives the full stream within the retried closure), so the retry actually fires.
  • Lower read_timeout 900s -> 180s (forge_infra/src/http.rs, forge_config/src/http.rs) as a secondary guard for genuinely dead connections.

Verification

  • cargo check -p forge_domain -p forge_infra clean.
  • cargo test -p forge_app retry:: — both pass, incl. should_retry_recognizes_anyhow_wrapped_retryable (confirms should_retry honors a Retryable even through .context() layers, so the timeout error is retried).

A real provider stall cannot be reproduced on demand, so this is verified by compile + retry-path tests + boundary analysis rather than an integration test.

🤖 Generated with Claude Code

…ll resilience

The interactive session could freeze for 15+ minutes when a streaming
provider (notably Codex / gpt-5.5) stalled mid-response. Two causes:

1. reqwest's read_timeout default was 900s (15 min), and it does not fire
   on a content-stalled HTTP/2 stream anyway -- http2 keepalive PINGs keep
   the socket "alive" so there is never a read-idle period.
2. No application-level inactivity timeout on the response stream, so a
   silent-but-connected stream hung indefinitely (leaking a zombie proc).

Fixes:
- Add STREAM_IDLE_TIMEOUT (120s): into_full_streaming now time-boxes each
  provider event; a stall returns Error::Retryable, which the existing
  retry_with_config honors and re-issues. Transport-agnostic (every
  provider funnels through this one stream consumer) and inside the retry
  boundary (orch.rs execute_chat_turn drives the stream in the retried
  closure), so the retry actually fires.
- Lower the read_timeout default 900s -> 180s in both config defaults as a
  secondary guard for genuinely dead connections.

Verified: forge_domain compiles; retry tests pass incl.
should_retry_recognizes_anyhow_wrapped_retryable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Action required: PR inactive for 5 days.
Status update or closure in 10 days.

@github-actions github-actions Bot added the state: inactive No current action needed/possible; issue fixed, out of scope, or superseded. label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

state: inactive No current action needed/possible; issue fixed, out of scope, or superseded.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant