feat: detect and kill stuck agents after 20 minutes of inactivity #173

maswa · 2026-02-07T19:26:09Z

What this does

When running in parallel mode, agents sometimes hang indefinitely — usually waiting for an API response that never comes back, or stuck in some internal loop. This leaves features permanently marked as "in progress" and wastes a concurrency slot.

This PR adds a simple inactivity timeout: if an agent produces no output for 20 minutes, it gets killed and the feature is released back to the queue.

How it works

Tracks a last_activity timestamp per agent, updated every time stdout produces output
A _check_stuck_agents() method runs each iteration of the main orchestrator loop
If an agent exceeds 1200 seconds (20 min) of silence, its process tree is killed
Activity tracking is cleaned up when agents complete normally

Why 20 minutes?

Working agents produce continuous output — tool calls, code generation, thinking blocks. Even complex features that take 1-2 hours always have activity. 20 minutes of complete silence reliably indicates something is wrong.

Changes

parallel_orchestrator.py — 80 lines added (constant, tracking dict, spawn hooks, check method, cleanup, main loop integration)

Test plan

Run parallel mode with --max-concurrency 2 and verify agents complete normally (no false kills)
Manually test by killing a network connection mid-agent to verify stuck detection triggers
Verify killed features become available for other agents to pick up

Agents that hang without producing output are now automatically detected and killed after 20 minutes of inactivity. This prevents features from being stuck indefinitely when an agent hangs. Changes: - Add AGENT_INACTIVITY_TIMEOUT constant (1200 seconds) - Track last activity timestamp per agent - Kill and restart agents with no output for 20+ minutes - Clean up tracking on agent completion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ISSUE: ------ The production systemd service was starting uvicorn directly without ensuring the React frontend was built. This caused UI changes to not appear until someone manually ran 'npm run build'. ROOT CAUSE: ----------- The ExecStart line in autocoder-ui.service bypassed start_ui.py, which contains smart build detection logic: ExecStart=/home/stu/.../venv/bin/python -m uvicorn server.main:app SOLUTION: --------- Created a production wrapper script that: 1. Runs 'npm run build' to compile TypeScript and bundle React app 2. Starts uvicorn server 3. Ensures UI changes are reflected on every service restart FILES CREATED: -------------- 1. start_ui_production.sh - Production launcher for systemd - Builds frontend before starting server - Reports build status in logs - Fails fast if build fails 2. docs/BUILD_PROCESS.md - Comprehensive documentation - Problem description and solution - How build process works - Troubleshooting guide - Verification steps 3. verify_feature_173.py - Automated verification script - Tests wrapper script exists and is executable - Verifies systemd service configuration - Tests TypeScript compilation - Confirms dist directory is created SYSTEMD CHANGES: ---------------- Modified: ~/.config/systemd/user/autocoder-ui.service ExecStart: /home/stu/projects/autocoder/venv/bin/python ... → ExecStart: /home/stu/projects/autocoder/start_ui_production.sh VERIFICATION: ------------- All 6/6 checks passed: ✅ Wrapper script exists and is executable ✅ Systemd service uses wrapper script ✅ Wrapper script contains build command ✅ TypeScript strict mode enabled ✅ TypeScript compilation succeeds (7.03s) ✅ dist directory created with assets IMPACT: ------- Before: UI changes required manual 'npm run build' → service restart After: UI changes automatically built on every service start Build time: ~7 seconds (TypeScript + Vite bundling) Output: ui/dist/ with optimized assets (~1.2 MB gzipped) Marked feature AutoForgeAI#173 as PASSING.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: detect and kill stuck agents after 20 minutes of inactivity #173

feat: detect and kill stuck agents after 20 minutes of inactivity #173

Uh oh!

maswa commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: detect and kill stuck agents after 20 minutes of inactivity #173

Are you sure you want to change the base?

feat: detect and kill stuck agents after 20 minutes of inactivity #173

Uh oh!

Conversation

maswa commented Feb 7, 2026

What this does

How it works

Why 20 minutes?

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant