feat: detect and kill stuck agents after 20 minutes of inactivity #173
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this does
When running in parallel mode, agents sometimes hang indefinitely — usually waiting for an API response that never comes back, or stuck in some internal loop. This leaves features permanently marked as "in progress" and wastes a concurrency slot.
This PR adds a simple inactivity timeout: if an agent produces no output for 20 minutes, it gets killed and the feature is released back to the queue.
How it works
last_activitytimestamp per agent, updated every time stdout produces output_check_stuck_agents()method runs each iteration of the main orchestrator loopWhy 20 minutes?
Working agents produce continuous output — tool calls, code generation, thinking blocks. Even complex features that take 1-2 hours always have activity. 20 minutes of complete silence reliably indicates something is wrong.
Changes
parallel_orchestrator.py— 80 lines added (constant, tracking dict, spawn hooks, check method, cleanup, main loop integration)Test plan
--max-concurrency 2and verify agents complete normally (no false kills)