Skip to content

Prevent parallel test-runner deadlock instead of adding timeout enforcement#1034

Draft
tersteegh wants to merge 1 commit into
mainfrom
all/task/DEVOPDSC-testbench-timeout-enforcement
Draft

Prevent parallel test-runner deadlock instead of adding timeout enforcement#1034
tersteegh wants to merge 1 commit into
mainfrom
all/task/DEVOPDSC-testbench-timeout-enforcement

Conversation

@tersteegh

@tersteegh tersteegh commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Investigation into TeamCity Delft3D_WindowsTest_virtual builds hanging (TC_EXECUTION_TIMEOUT) on Windows. Root-caused via a live memory dump (procdump -ma) of a hung build's python.exe process, analyzed with cdb/WinDbg thread-stack inspection.

Root cause: TestSetRunner.run_tests_in_parallel() throttles concurrent worker processes with a multiprocessing.Manager Condition/Value pair, and separately waits on Pool.join()/AsyncResult.get(). A worker (run_test_case) never returns because the engine subprocess it launches (e.g. dimr.exe) can hang indefinitely - TestCase's computed maxRunTime was only ever used for logging, never applied to the actual subprocess.run() call unless the <program> element specified its own maxRunTime. With the worker blocked forever, the main process' slot-wait/Pool.join()/AsyncResult.get() also block forever.

Fix (structural, not a policy timeout):

  • TestCase.__initializeProgramList__ now falls back program_config.max_run_time to the testcase's computed maxRunTime, so subprocess.run() always enforces a bound - guaranteeing a pool worker always returns, which in turn guarantees the scheduler can never block forever either.
  • Program.__execute__ gets an explicit except subprocess.TimeoutExpired branch (and a fix for a latent e.filename AttributeError bug in the generic exception handler).
  • test_set_runner.py's process-slot release is hardened with a finally block so a slot can never leak regardless of how a worker exits.

No artificial build-failing timeout/enforcement was added to the scheduler - the deadlock is prevented at its source instead.

Full investigation write-up (build log analysis, perfmon CPU/RAM data, dump thread-stack analysis, root cause reasoning) is in doc/investigations/2026-07-03_windows_testbench_hang_investigation.md.

Draft PR - opening for visibility/early feedback while validation is still in progress.

…cement

Investigation: several Delft3D_WindowsTest_virtual builds on TeamCity were
hanging (TC_EXECUTION_TIMEOUT) with zero further log output right after
"Creating N processes to run test cases on.", CPU/RAM flat for 70+ minutes
(ruling out resource exhaustion), and no single bad commit correlating to the
hangs. A full memory dump (procdump -ma) of a live hung build (#7485927,
[fm drtc dwaves], agent c-teamcity33130) was captured and analyzed with cdb.
Full write-up: doc/investigations/2026-07-03_windows_testbench_hang_investigation.md.

Root cause: TestSetRunner.run_tests_in_parallel() throttles concurrent worker
processes via a multiprocessing.Manager Condition/Value pair, and separately
waits on Pool.join()/AsyncResult.get(). A worker (run_test_case, executing
inside the pool) never returns because the engine subprocess it launches
(e.g. dimr.exe) can hang indefinitely: TestCase's computed maxRunTime was only
ever used for logging, never applied to the actual subprocess.run() call
unless the <program> element specified its own maxRunTime, so program.py's
timeout defaulted to 0 ("no timeout") in that case. With the worker blocked
forever, the main process' slot-wait/Pool.join()/AsyncResult.get() also block
forever - exactly what the dump showed (main thread stuck in
PyThread_acquire_lock_timed with no progress).

Rather than add a policy-level timeout to test_set_runner.py that detects and
fails the build after a stuck worker (bolting a symptom-fix onto the
scheduler), this fixes the deadlock structurally so it cannot occur:

- TestCase.__initializeProgramList__: fall back program_config.max_run_time
  to the test case's computed maxRunTime when the program itself doesn't
  specify one, so subprocess.run() always enforces a bound. This guarantees
  run_test_case() (the pool worker) always returns in bounded time, which in
  turn guarantees the process-slot Condition wait and AsyncResult.get()/
  Pool.join() in run_tests_in_parallel() can never block forever either.
- Program.__execute__: add an explicit except subprocess.TimeoutExpired
  branch with a clear log message, and fix the generic Exception handler's
  use of e.filename (which doesn't exist on TimeoutExpired or most exception
  types and would raise a masking AttributeError instead of logging the real
  error).
- test_set_runner.py: harden run_test_case()'s process-slot release
  (in_use.value -= ...; idle_process.notify_all()) by moving it into a
  finally block, so the slot is always released regardless of exit path
  (including exceptions not caught by `except Exception`, or an error raised
  by logger.test_finished() itself) - previously any such path would leak the
  slot forever and deadlock the remaining run.

Not yet addressed (documented as follow-up in the investigation doc):
- On Windows, subprocess.run(timeout=...) does not kill grandchild processes
  (e.g. MPI ranks via a cmd /c wrapper), so they may be orphaned after a
  timeout fires; process-tree termination is a recommended follow-up.
- If a pool worker's OS process is killed outright (crash/OOM) rather than
  hanging, raw multiprocessing.Pool has a known limitation where the
  corresponding AsyncResult can never complete. Not the mechanism observed in
  the dump (the worker was alive, just blocked), so out of scope here.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tersteegh tersteegh force-pushed the all/task/DEVOPDSC-testbench-timeout-enforcement branch from 9226ef5 to 36fd3b3 Compare July 3, 2026 13:15
@tersteegh tersteegh changed the title Detect and fail fast on stuck parallel test workers instead of silent 90min hang Prevent parallel test-runner deadlock instead of adding timeout enforcement Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant