Skip to content

Fix connection slot leak when Chrome crashes while client is idle#48

Closed
marcosvrs wants to merge 3 commits into
dgtlmoon:masterfrom
marcosvrs:fix/connection-slot-leak
Closed

Fix connection slot leak when Chrome crashes while client is idle#48
marcosvrs wants to merge 3 commits into
dgtlmoon:masterfrom
marcosvrs:fix/connection-slot-leak

Conversation

@marcosvrs

Copy link
Copy Markdown

Problem

When a Chrome process crashes (OOM, segfault, etc.) while the connected client is idle, the connection slot is permanently leaked and never reclaimed. Over time this exhausts MAX_CONCURRENT_CHROME_PROCESSES and all new connections are rejected.

Root cause

The proxy creates two bridging tasks:

taskA = asyncio.create_task(hereToChromeCDP(...))   # Chrome -> Client
taskB = asyncio.create_task(puppeteerToHere(...))    # Client -> Chrome
await taskA
await taskB

When Chrome crashes:

  1. taskA detects the broken CDP WebSocket (ConnectionClosed) -> completes
  2. taskB blocks on async for message in chrome_websocket -- the client WebSocket is still alive, so it waits for a message that never comes -> hangs forever
  3. await taskB never returns -> handler never exits -> websocket.wait_closed() callback never fires -> stats_disconnect() never runs -> semaphore never released

The slot is only freed if the client happens to send another message (triggering a send to the dead Chrome ws) or disconnects. If the client is idle -- common for long-running browser sessions -- the slot is gone permanently.

Evidence

After 3 days of operation with MAX_CONCURRENT_CHROME_PROCESSES=3:

  • Stats: Active count 3 of max 3 with only 2 live Chrome processes
  • Network: 4 CLOSE_WAIT connections to dead Chrome debug ports (11845, 11680, 11854, 11875) that no longer exist
  • Processes: Zombie Chrome entries in the process table from killed-but-not-reaped processes

Fix

1. Cross-task cancellation (primary fix)

Replace sequential await taskA; await taskB with asyncio.wait(FIRST_COMPLETED):

done, pending = await asyncio.wait(
    [taskA, taskB],
    return_when=asyncio.FIRST_COMPLETED
)
for task in pending:
    task.cancel()

When either side of the proxy dies, the other is cancelled immediately. The handler returns, the WebSocket closes, and stats_disconnect() fires normally to release the slot.

2. Process reaping (secondary fix)

Add chrome_process.wait() after SIGKILL in cleanup_chrome_by_pid() to reap dead Chrome processes instead of leaving zombies in the process table. Related: #46

Testing

Deployed the patched server.py into a running dgtlmoon/sockpuppetbrowser container. After 9+ hours and 108 processed connections:

  • Active count stays at 0 when idle (previously accumulated leaked slots)
  • All Chrome processes properly reaped (reaped successfully in logs)
  • Zero CLOSE_WAIT connections to dead ports
  • Health checks passing continuously

When Chrome crashes, hereToChromeCDP detects the broken CDP WebSocket
and completes, but puppeteerToHere blocks forever on
`async for message in chrome_websocket` because the client connection
is still alive. The handler never returns, websocket.wait_closed()
never fires, and the semaphore is never released — permanently leaking
the connection slot.

Replace sequential `await taskA; await taskB` with
`asyncio.wait(FIRST_COMPLETED)` so that when either side of the proxy
dies, the other is cancelled and the handler can return normally.

Also add chrome_process.wait() after SIGKILL in cleanup_chrome_by_pid
to reap dead Chrome processes instead of leaving zombies in the process
table.
- test_chrome_crash_blocks_without_fix: proves sequential await hangs
  forever when Chrome dies while client is idle (the bug)
- test_chrome_crash_releases_slot_with_fix: proves FIRST_COMPLETED
  cancels the idle task and completes promptly (the fix)
- test_message_forwarding: verifies proxy correctness is preserved
- test_cleanup_reaps_chrome_process: verifies wait() is called after kill
- test_cleanup_handles_already_dead_process: verifies graceful fallback
- test_stats_disconnect_releases_semaphore: verifies slot accounting
The previous tests exercised the proxy pattern directly in the test
rather than through server.py, so they passed regardless of the fix.

Now test_handler_returns_after_chrome_crash and
test_slot_released_after_chrome_crash call launchPuppeteerChromeProxy
end-to-end with mocked Chrome/WebSocket deps. They fail on unfixed
code (2s timeout hit because the handler blocks forever) and pass on
the fixed code.
@marcosvrs marcosvrs closed this by deleting the head repository May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants