Fix SubprocessRunner blocking cooperative thread pool, causing permanent refresh stall by bryant24hao · Pull Request #566 · steipete/CodexBar

bryant24hao · 2026-03-18T20:31:17Z

Summary

SubprocessRunner.run() called waitUntilExit() and readDataToEndOfFile() inside Task closures on Swift's cooperative thread pool. When multiple providers refreshed concurrently, these blocking calls starved the pool, preventing the timeout mechanism (Task.sleep) from firing. A single hung subprocess permanently blocked all future refreshes via the isRefreshing guard in UsageStore.refresh().
This PR moves the blocking calls to DispatchQueue.global() via withCheckedContinuation (same pattern as KiroStatusProbe.runCommand()), freeing the cooperative pool so timeouts fire reliably.
The timeout regression test deleted in 3961770 is restored and now passes reliably — this is the direct proof the fix works.

Root cause

SubprocessRunner.run() creates 3 Tasks with blocking calls:
  Task { readDataToEndOfFile() }   ← blocks cooperative thread
  Task { readDataToEndOfFile() }   ← blocks cooperative thread
  Task { waitUntilExit() }         ← blocks cooperative thread

8 providers × 3 blocking calls = 24 threads needed > 8-12 available
→ Task.sleep (timeout) can't be scheduled → timeout never fires
→ withTaskGroup never completes → isRefreshing stays true forever
→ all subsequent refreshes silently dropped

Confirmed by maintainer in commit 3961770: "waitUntilExit() blocks the cooperative thread pool, starving the timeout task on low-core CI runners"

Changes

readDataOffPool() / waitForExitOffPool() — new helpers that run blocking calls on DispatchQueue.global() via withCheckedContinuation, keeping the cooperative pool free
Kill process inside timeout task — terminateProcess() is called before throwing .timedOut, so withThrowingTaskGroup can exit promptly (previously the kill was in the catch block after the group, creating a deadlock)
Race guard — if the timeout kills the process but the exit code arrives at group.next() first, terminationReason == .uncaughtSignal detects this and reclassifies as .timedOut
terminateProcess() helper — extracted SIGTERM→SIGKILL escalation logic to deduplicate between timeout task and catch block

Test plan

Restore throwsTimedOutWhenProcessHangs (deleted in 3961770) — now passes in ~1s instead of hanging for 30s
Add concurrentHungProcessesAllTimeOut — 8 concurrent hung subprocesses all timeout in ~2s, proving no starvation
Add concurrentCallsDoNotStarve — 20 concurrent normal calls complete without blocking each other
Existing readsLargeStdoutWithoutDeadlock continues to pass
Full test suite: 890/899 pass, 9 pre-existing ClaudeOAuth keychain test failures (unrelated, depend on local keychain state)
./Scripts/lint.sh lint passes (0 violations)

Closes #189

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 710061c3d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T20:36:55Z

Sources/CodexBarCore/Host/Process/SubprocessRunner.swift

                    try await Task.sleep(for: .seconds(timeout))
+                    // Kill the process BEFORE throwing so the exit-code task can complete
+                    // and withThrowingTaskGroup can exit promptly.
+                    self.terminateProcess(process, processGroup: processGroup)
                    throw SubprocessRunnerError.timedOut(label)


Avoid throwing timeout after the process already exited

After waitUntilExit() was moved behind DispatchQueue.global() in SubprocessRunner.run, there is now an extra scheduling hop before exitCodeTask can finish. For commands that complete right around the deadline, this timeout task can wake first, terminateProcess will no-op because process.isRunning is already false, and we still throw .timedOut. That turns successful near-threshold commands into false failures/fallbacks; the timeout branch should only throw if the process was still running or if it actually performed the kill.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T20:36:55Z

Sources/CodexBarCore/Host/Process/SubprocessRunner.swift

+            // Race guard: if the timeout task killed the process but the exit code arrived
+            // at group.next() before the .timedOut throw, the process will have been killed
+            // by a signal. Reclassify as timeout so callers get the correct error type.
+            if process.terminationReason == .uncaughtSignal {


Only treat signalled exits as timeouts when we killed them

This branch now maps every .uncaughtSignal exit to .timedOut, but nothing records that the timeout task actually fired. Any subprocess launched through SubprocessRunner that crashes or is terminated by a signal for its own reasons will now be misreported as a timeout instead of a real subprocess failure, which changes fallback/UI behavior and hides the actual failure mode from callers.

Useful? React with 👍 / 👎.

…ent refresh stall SubprocessRunner.run() called waitUntilExit() and readDataToEndOfFile() inside Task closures that ran on Swift's cooperative thread pool. When multiple providers refreshed concurrently, these blocking calls starved the pool, preventing the timeout mechanism (Task.sleep) from firing. A single hung subprocess could permanently block all future refreshes via the isRefreshing guard in UsageStore.refresh(). Changes: - Move waitUntilExit() and readDataToEndOfFile() to DispatchQueue.global() via withCheckedContinuation, freeing the cooperative pool for timeout scheduling - Kill the process inside the timeout task before throwing, so withThrowingTaskGroup can exit promptly (previously it waited for all child tasks, which couldn't complete because the process kill was in the catch block) - Add race guard: if the timeout kills the process but the exit code arrives at group.next() first, detect via terminationReason == .uncaughtSignal and reclassify as .timedOut - Extract terminateProcess() helper to deduplicate SIGTERM→SIGKILL escalation The timeout regression test deleted in 3961770 ("waitUntilExit() blocks the cooperative thread pool, starving the timeout task on low-core CI runners") is restored and now passes reliably. Closes steipete#189 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- terminateProcess() now returns Bool: true if it actually killed the process, false if it had already exited. The timeout task only throws .timedOut when the process was actually killed, avoiding false timeouts for near-threshold commands. - Race guard uses both terminationReason == .uncaughtSignal AND elapsed time >= timeout to distinguish our timeout kills from unrelated signal crashes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bryant24hao · 2026-03-19T20:18:12Z

@codex review

The two P2 suggestions from the initial review have been addressed in the second commit (eca08e4):

"Avoid throwing timeout after the process already exited" — terminateProcess() now returns Bool; the timeout task only throws .timedOut when it actually killed the process (guard self.terminateProcess(...) else { return await exitCodeTask.value }).
"Only treat signalled exits as timeouts when we killed them" — The race guard now requires both terminationReason == .uncaughtSignal AND elapsed >= timeout - 0.5, so unrelated signal crashes are not misclassified.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eca08e4a67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-19T20:21:52Z

Sources/CodexBarCore/Host/Process/SubprocessRunner.swift

+            if process.terminationReason == .uncaughtSignal,
+               Date().timeIntervalSince(start) >= timeout - 0.5


Stop inferring timeout ownership from wall-clock time

A subprocess that exits by signal on its own shortly before the deadline is still reclassified as .timedOut here, even when the timeout task never killed it. For example, sh -c 'sleep 0.9; kill -TERM $$' with timeout: 1 will satisfy both conditions and lose the real failure mode; for any timeout under 0.5s, every signalled exit is misclassified because the threshold is negative. This affects commands that intentionally self-signal or are killed externally, so the guard needs to key off an explicit “we sent SIGTERM/SIGKILL” flag rather than elapsed time.

Useful? React with 👍 / 👎.

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

bryant24hao and others added 2 commits March 19, 2026 12:48

bryant24hao force-pushed the fix/subprocess-cooperative-pool-starvation branch from 24105e5 to eca08e4 Compare March 19, 2026 04:49

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SubprocessRunner blocking cooperative thread pool, causing permanent refresh stall#566

Fix SubprocessRunner blocking cooperative thread pool, causing permanent refresh stall#566
bryant24hao wants to merge 2 commits intosteipete:mainfrom
bryant24hao:fix/subprocess-cooperative-pool-starvation

bryant24hao commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Uh oh!

bryant24hao commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if process.terminationReason == .uncaughtSignal,
		Date().timeIntervalSince(start) >= timeout - 0.5

Conversation

bryant24hao commented Mar 18, 2026

Summary

Root cause

Changes

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

bryant24hao commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant