Prevent parallel test-runner deadlock instead of adding timeout enforcement#1034
Draft
tersteegh wants to merge 1 commit into
Draft
Prevent parallel test-runner deadlock instead of adding timeout enforcement#1034tersteegh wants to merge 1 commit into
tersteegh wants to merge 1 commit into
Conversation
…cement
Investigation: several Delft3D_WindowsTest_virtual builds on TeamCity were
hanging (TC_EXECUTION_TIMEOUT) with zero further log output right after
"Creating N processes to run test cases on.", CPU/RAM flat for 70+ minutes
(ruling out resource exhaustion), and no single bad commit correlating to the
hangs. A full memory dump (procdump -ma) of a live hung build (#7485927,
[fm drtc dwaves], agent c-teamcity33130) was captured and analyzed with cdb.
Full write-up: doc/investigations/2026-07-03_windows_testbench_hang_investigation.md.
Root cause: TestSetRunner.run_tests_in_parallel() throttles concurrent worker
processes via a multiprocessing.Manager Condition/Value pair, and separately
waits on Pool.join()/AsyncResult.get(). A worker (run_test_case, executing
inside the pool) never returns because the engine subprocess it launches
(e.g. dimr.exe) can hang indefinitely: TestCase's computed maxRunTime was only
ever used for logging, never applied to the actual subprocess.run() call
unless the <program> element specified its own maxRunTime, so program.py's
timeout defaulted to 0 ("no timeout") in that case. With the worker blocked
forever, the main process' slot-wait/Pool.join()/AsyncResult.get() also block
forever - exactly what the dump showed (main thread stuck in
PyThread_acquire_lock_timed with no progress).
Rather than add a policy-level timeout to test_set_runner.py that detects and
fails the build after a stuck worker (bolting a symptom-fix onto the
scheduler), this fixes the deadlock structurally so it cannot occur:
- TestCase.__initializeProgramList__: fall back program_config.max_run_time
to the test case's computed maxRunTime when the program itself doesn't
specify one, so subprocess.run() always enforces a bound. This guarantees
run_test_case() (the pool worker) always returns in bounded time, which in
turn guarantees the process-slot Condition wait and AsyncResult.get()/
Pool.join() in run_tests_in_parallel() can never block forever either.
- Program.__execute__: add an explicit except subprocess.TimeoutExpired
branch with a clear log message, and fix the generic Exception handler's
use of e.filename (which doesn't exist on TimeoutExpired or most exception
types and would raise a masking AttributeError instead of logging the real
error).
- test_set_runner.py: harden run_test_case()'s process-slot release
(in_use.value -= ...; idle_process.notify_all()) by moving it into a
finally block, so the slot is always released regardless of exit path
(including exceptions not caught by `except Exception`, or an error raised
by logger.test_finished() itself) - previously any such path would leak the
slot forever and deadlock the remaining run.
Not yet addressed (documented as follow-up in the investigation doc):
- On Windows, subprocess.run(timeout=...) does not kill grandchild processes
(e.g. MPI ranks via a cmd /c wrapper), so they may be orphaned after a
timeout fires; process-tree termination is a recommended follow-up.
- If a pool worker's OS process is killed outright (crash/OOM) rather than
hanging, raw multiprocessing.Pool has a known limitation where the
corresponding AsyncResult can never complete. Not the mechanism observed in
the dump (the worker was alive, just blocked), so out of scope here.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9226ef5 to
36fd3b3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Investigation into TeamCity
Delft3D_WindowsTest_virtualbuilds hanging (TC_EXECUTION_TIMEOUT) on Windows. Root-caused via a live memory dump (procdump -ma) of a hung build'spython.exeprocess, analyzed withcdb/WinDbg thread-stack inspection.Root cause:
TestSetRunner.run_tests_in_parallel()throttles concurrent worker processes with amultiprocessing.ManagerCondition/Valuepair, and separately waits onPool.join()/AsyncResult.get(). A worker (run_test_case) never returns because the engine subprocess it launches (e.g.dimr.exe) can hang indefinitely -TestCase's computedmaxRunTimewas only ever used for logging, never applied to the actualsubprocess.run()call unless the<program>element specified its ownmaxRunTime. With the worker blocked forever, the main process' slot-wait/Pool.join()/AsyncResult.get()also block forever.Fix (structural, not a policy timeout):
TestCase.__initializeProgramList__now falls backprogram_config.max_run_timeto the testcase's computedmaxRunTime, sosubprocess.run()always enforces a bound - guaranteeing a pool worker always returns, which in turn guarantees the scheduler can never block forever either.Program.__execute__gets an explicitexcept subprocess.TimeoutExpiredbranch (and a fix for a latente.filenameAttributeErrorbug in the generic exception handler).test_set_runner.py's process-slot release is hardened with afinallyblock so a slot can never leak regardless of how a worker exits.No artificial build-failing timeout/enforcement was added to the scheduler - the deadlock is prevented at its source instead.
Full investigation write-up (build log analysis, perfmon CPU/RAM data, dump thread-stack analysis, root cause reasoning) is in
doc/investigations/2026-07-03_windows_testbench_hang_investigation.md.Draft PR - opening for visibility/early feedback while validation is still in progress.