fix(provider): stream-inactivity timeout + lower read_timeout (stall resilience)#163
Open
justrach wants to merge 1 commit into
Open
fix(provider): stream-inactivity timeout + lower read_timeout (stall resilience)#163justrach wants to merge 1 commit into
justrach wants to merge 1 commit into
Conversation
…ll resilience The interactive session could freeze for 15+ minutes when a streaming provider (notably Codex / gpt-5.5) stalled mid-response. Two causes: 1. reqwest's read_timeout default was 900s (15 min), and it does not fire on a content-stalled HTTP/2 stream anyway -- http2 keepalive PINGs keep the socket "alive" so there is never a read-idle period. 2. No application-level inactivity timeout on the response stream, so a silent-but-connected stream hung indefinitely (leaking a zombie proc). Fixes: - Add STREAM_IDLE_TIMEOUT (120s): into_full_streaming now time-boxes each provider event; a stall returns Error::Retryable, which the existing retry_with_config honors and re-issues. Transport-agnostic (every provider funnels through this one stream consumer) and inside the retry boundary (orch.rs execute_chat_turn drives the stream in the retried closure), so the retry actually fires. - Lower the read_timeout default 900s -> 180s in both config defaults as a secondary guard for genuinely dead connections. Verified: forge_domain compiles; retry tests pass incl. should_retry_recognizes_anyhow_wrapped_retryable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Action required: PR inactive for 5 days. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The interactive session froze for 15+ minutes when a streaming provider stalled mid-response (observed on Codex / gpt-5.5 — a 22-min hang with zero output; even
graff list agentshung 28 min).Two compounding causes:
read_timeoutwas 900s (15 min) — and it does not even fire on a content-stalled HTTP/2 stream, becausehttp2_keep_alivePINGs (keep_alive_interval_secs: 60,keep_alive_while_idle: true) keep the socket alive, so reqwest never sees a read-idle period.Fix
forge_domain/src/result_stream_ext.rs):into_full_streamingnow time-boxes each provider event withSTREAM_IDLE_TIMEOUT = 120s. A stall surfacesError::Retryable, which the existingretry_with_confighonors -> it re-issues the request instead of hanging. This is transport-agnostic (every provider funnels through this oneBoxStreamconsumer) and sits inside the retry boundary (orch.rsexecute_chat_turndrives the full stream within the retried closure), so the retry actually fires.read_timeout900s -> 180s (forge_infra/src/http.rs,forge_config/src/http.rs) as a secondary guard for genuinely dead connections.Verification
cargo check -p forge_domain -p forge_infraclean.cargo test -p forge_app retry::— both pass, incl.should_retry_recognizes_anyhow_wrapped_retryable(confirmsshould_retryhonors aRetryableeven through.context()layers, so the timeout error is retried).A real provider stall cannot be reproduced on demand, so this is verified by compile + retry-path tests + boundary analysis rather than an integration test.
🤖 Generated with Claude Code