Replace wall-clock limit with activity-based idle timeout#207
Replace wall-clock limit with activity-based idle timeout#207
Conversation
The wall-clock timer killed interactions after timeout_seconds * 5 of total elapsed time, regardless of whether Claude was actively working. This caused healthy long-running tasks (multi-step research, detailed planning) to be killed mid-work. Replace with an idle timeout that resets every time the process emits a line of output. Same duration (10 min at defaults), different measurement: time since last output instead of time since start. Increase bot lock acquire timeout from 660s to 3600s since interactions can now legitimately run much longer than 10 minutes. Fixes #206
Review by KaiPR Review: Replace wall-clock limit with activity-based idle timeoutThe core logic is sound and the motivation is well-documented. A few findings: Warning: Infinite tool loops now run uncapped for up to 1 hourbot.py, The PR explicitly acknowledges this tradeoff, but it's worth surfacing in review: a process stuck in a tool loop emitting continuous output was previously killed after 10 minutes. It now runs until the 1-hour lock timeout fires. That's a 6× regression in worst-case blast radius for that failure mode. The PR's reasoning ("legitimate long tasks are more common") is defensible, but there's no intermediate backstop between "idle 10 min" and "lock 60 min" for the active-but-stuck case. Warning:
|
Use mocked time.monotonic for both test_idle_timeout and test_active_process_survives_past_old_wall_clock. No test should ever wait for a real timeout to fire.
Review by KaiThe prior review's findings are still open (fragile Suggestion:
|
Review by KaiPR Review: Replace wall-clock limit with activity-based idle timeoutCleanThe The
Remaining open items (from prior reviews, no material change)
Both were raised previously and the author has made their decisions. No new issues found. |
Review by KaiPR Review: Replace wall-clock limit with activity-based idle timeoutClean. No new issues found. Verification of open items from prior reviews
Active-but-stuck blast radius — still 6× larger (10 min → 60 min). No material change; author's accepted tradeoff. New code reviewed
Call sequence matches the documented pattern (1 init + 2 check/reset per active iteration = 5 before the jump).
|
Summary
The wall-clock timer in
claude.pykilled interactions aftertimeout_seconds * 5(10 min at defaults) of total elapsed time, regardless of whether Claude was actively working. Two failures on Mar 27 - both healthy interactions doing multi-step research with continuous tool-use output:This replaces the fixed wall-clock with an activity-based idle timeout that resets every time the process emits a line of output. Same duration, different measurement: "time since last output" instead of "time since start."
What changed
interaction_startbecomeslast_activity, reset on every non-empty readline. The idle check fires only after prolonged silence, not after prolonged work._LOCK_ACQUIRE_TIMEOUTincreased from 660s to 3600s. With no wall-clock cap, the lock timeout needs to accommodate legitimately long interactions. The idle timer in claude.py is the real safety net.Timeout behavior after this change
The "stuck tool loop" row changes from "caught" to "not caught." This is an acceptable tradeoff: true infinite tool loops are extremely rare, the user can always
/stop, and killing legitimate long work is a worse failure mode.Implementation note
With the current multipliers (readline=3x, idle=5x), the idle timeout can never fire independently in practice - the readline timeout always fires first for a truly silent process (3x < 5x). The idle timer is a "belt and suspenders" backup that would only matter if the multipliers were changed or if there were an edge case where readline keeps barely succeeding. The
test_idle_timeouttest uses a mockedtime.monotonicin thekai.claudenamespace to force the idle check to fire, which is the correct approach for verifying an error path that is dominated by another timeout in real execution.Test plan
test_idle_timeout- mocked time forces idle check to fire, verifies error path and "no output" messagetest_idle_timeout_normal_completion_unaffected- fast interaction completes without hitting idle timertest_active_process_survives_past_old_wall_clock- core behavioral test: active process emits output for ~8s (past old 5s limit), completes successfullytest_readline_timeout(existing, unchanged) - fully dead process caught by per-readline timeouttest_timeout_kills_claude(bot.py, comment-only update) - lock timeout still worksFixes #206