Fix connection slot leak when Chrome crashes while client is idle#48
Closed
marcosvrs wants to merge 3 commits into
Closed
Fix connection slot leak when Chrome crashes while client is idle#48marcosvrs wants to merge 3 commits into
marcosvrs wants to merge 3 commits into
Conversation
When Chrome crashes, hereToChromeCDP detects the broken CDP WebSocket and completes, but puppeteerToHere blocks forever on `async for message in chrome_websocket` because the client connection is still alive. The handler never returns, websocket.wait_closed() never fires, and the semaphore is never released — permanently leaking the connection slot. Replace sequential `await taskA; await taskB` with `asyncio.wait(FIRST_COMPLETED)` so that when either side of the proxy dies, the other is cancelled and the handler can return normally. Also add chrome_process.wait() after SIGKILL in cleanup_chrome_by_pid to reap dead Chrome processes instead of leaving zombies in the process table.
- test_chrome_crash_blocks_without_fix: proves sequential await hangs forever when Chrome dies while client is idle (the bug) - test_chrome_crash_releases_slot_with_fix: proves FIRST_COMPLETED cancels the idle task and completes promptly (the fix) - test_message_forwarding: verifies proxy correctness is preserved - test_cleanup_reaps_chrome_process: verifies wait() is called after kill - test_cleanup_handles_already_dead_process: verifies graceful fallback - test_stats_disconnect_releases_semaphore: verifies slot accounting
The previous tests exercised the proxy pattern directly in the test rather than through server.py, so they passed regardless of the fix. Now test_handler_returns_after_chrome_crash and test_slot_released_after_chrome_crash call launchPuppeteerChromeProxy end-to-end with mocked Chrome/WebSocket deps. They fail on unfixed code (2s timeout hit because the handler blocks forever) and pass on the fixed code.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a Chrome process crashes (OOM, segfault, etc.) while the connected client is idle, the connection slot is permanently leaked and never reclaimed. Over time this exhausts
MAX_CONCURRENT_CHROME_PROCESSESand all new connections are rejected.Root cause
The proxy creates two bridging tasks:
When Chrome crashes:
taskAdetects the broken CDP WebSocket (ConnectionClosed) -> completestaskBblocks onasync for message in chrome_websocket-- the client WebSocket is still alive, so it waits for a message that never comes -> hangs foreverawait taskBnever returns -> handler never exits ->websocket.wait_closed()callback never fires ->stats_disconnect()never runs -> semaphore never releasedThe slot is only freed if the client happens to send another message (triggering a send to the dead Chrome ws) or disconnects. If the client is idle -- common for long-running browser sessions -- the slot is gone permanently.
Evidence
After 3 days of operation with
MAX_CONCURRENT_CHROME_PROCESSES=3:Active count 3 of max 3with only 2 live Chrome processesCLOSE_WAITconnections to dead Chrome debug ports (11845, 11680, 11854, 11875) that no longer existFix
1. Cross-task cancellation (primary fix)
Replace sequential
await taskA; await taskBwithasyncio.wait(FIRST_COMPLETED):When either side of the proxy dies, the other is cancelled immediately. The handler returns, the WebSocket closes, and
stats_disconnect()fires normally to release the slot.2. Process reaping (secondary fix)
Add
chrome_process.wait()after SIGKILL incleanup_chrome_by_pid()to reap dead Chrome processes instead of leaving zombies in the process table. Related: #46Testing
Deployed the patched
server.pyinto a runningdgtlmoon/sockpuppetbrowsercontainer. After 9+ hours and 108 processed connections:reaped successfullyin logs)CLOSE_WAITconnections to dead ports