Skip to content

Fix SubprocessRunner blocking cooperative thread pool, causing permanent refresh stall#566

Open
bryant24hao wants to merge 2 commits intosteipete:mainfrom
bryant24hao:fix/subprocess-cooperative-pool-starvation
Open

Fix SubprocessRunner blocking cooperative thread pool, causing permanent refresh stall#566
bryant24hao wants to merge 2 commits intosteipete:mainfrom
bryant24hao:fix/subprocess-cooperative-pool-starvation

Conversation

@bryant24hao
Copy link
Contributor

Summary

  • SubprocessRunner.run() called waitUntilExit() and readDataToEndOfFile() inside Task closures on Swift's cooperative thread pool. When multiple providers refreshed concurrently, these blocking calls starved the pool, preventing the timeout mechanism (Task.sleep) from firing. A single hung subprocess permanently blocked all future refreshes via the isRefreshing guard in UsageStore.refresh().
  • This PR moves the blocking calls to DispatchQueue.global() via withCheckedContinuation (same pattern as KiroStatusProbe.runCommand()), freeing the cooperative pool so timeouts fire reliably.
  • The timeout regression test deleted in 3961770 is restored and now passes reliably — this is the direct proof the fix works.

Root cause

SubprocessRunner.run() creates 3 Tasks with blocking calls:
  Task { readDataToEndOfFile() }   ← blocks cooperative thread
  Task { readDataToEndOfFile() }   ← blocks cooperative thread
  Task { waitUntilExit() }         ← blocks cooperative thread

8 providers × 3 blocking calls = 24 threads needed > 8-12 available
→ Task.sleep (timeout) can't be scheduled → timeout never fires
→ withTaskGroup never completes → isRefreshing stays true forever
→ all subsequent refreshes silently dropped

Confirmed by maintainer in commit 3961770: "waitUntilExit() blocks the cooperative thread pool, starving the timeout task on low-core CI runners"

Changes

  1. readDataOffPool() / waitForExitOffPool() — new helpers that run blocking calls on DispatchQueue.global() via withCheckedContinuation, keeping the cooperative pool free
  2. Kill process inside timeout taskterminateProcess() is called before throwing .timedOut, so withThrowingTaskGroup can exit promptly (previously the kill was in the catch block after the group, creating a deadlock)
  3. Race guard — if the timeout kills the process but the exit code arrives at group.next() first, terminationReason == .uncaughtSignal detects this and reclassifies as .timedOut
  4. terminateProcess() helper — extracted SIGTERM→SIGKILL escalation logic to deduplicate between timeout task and catch block

Test plan

  • Restore throwsTimedOutWhenProcessHangs (deleted in 3961770) — now passes in ~1s instead of hanging for 30s
  • Add concurrentHungProcessesAllTimeOut — 8 concurrent hung subprocesses all timeout in ~2s, proving no starvation
  • Add concurrentCallsDoNotStarve — 20 concurrent normal calls complete without blocking each other
  • Existing readsLargeStdoutWithoutDeadlock continues to pass
  • Full test suite: 890/899 pass, 9 pre-existing ClaudeOAuth keychain test failures (unrelated, depend on local keychain state)
  • ./Scripts/lint.sh lint passes (0 violations)

Closes #189

🤖 Generated with Claude Code

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 710061c3d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 139 to 143
try await Task.sleep(for: .seconds(timeout))
// Kill the process BEFORE throwing so the exit-code task can complete
// and withThrowingTaskGroup can exit promptly.
self.terminateProcess(process, processGroup: processGroup)
throw SubprocessRunnerError.timedOut(label)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid throwing timeout after the process already exited

After waitUntilExit() was moved behind DispatchQueue.global() in SubprocessRunner.run, there is now an extra scheduling hop before exitCodeTask can finish. For commands that complete right around the deadline, this timeout task can wake first, terminateProcess will no-op because process.isRunning is already false, and we still throw .timedOut. That turns successful near-threshold commands into false failures/fallbacks; the timeout branch should only throw if the process was still running or if it actually performed the kill.

Useful? React with 👍 / 👎.

// Race guard: if the timeout task killed the process but the exit code arrived
// at group.next() before the .timedOut throw, the process will have been killed
// by a signal. Reclassify as timeout so callers get the correct error type.
if process.terminationReason == .uncaughtSignal {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Only treat signalled exits as timeouts when we killed them

This branch now maps every .uncaughtSignal exit to .timedOut, but nothing records that the timeout task actually fired. Any subprocess launched through SubprocessRunner that crashes or is terminated by a signal for its own reasons will now be misreported as a timeout instead of a real subprocess failure, which changes fallback/UI behavior and hides the actual failure mode from callers.

Useful? React with 👍 / 👎.

bryant24hao and others added 2 commits March 19, 2026 12:48
…ent refresh stall

SubprocessRunner.run() called waitUntilExit() and readDataToEndOfFile() inside
Task closures that ran on Swift's cooperative thread pool. When multiple providers
refreshed concurrently, these blocking calls starved the pool, preventing the
timeout mechanism (Task.sleep) from firing. A single hung subprocess could
permanently block all future refreshes via the isRefreshing guard in
UsageStore.refresh().

Changes:
- Move waitUntilExit() and readDataToEndOfFile() to DispatchQueue.global() via
  withCheckedContinuation, freeing the cooperative pool for timeout scheduling
- Kill the process inside the timeout task before throwing, so
  withThrowingTaskGroup can exit promptly (previously it waited for all child
  tasks, which couldn't complete because the process kill was in the catch block)
- Add race guard: if the timeout kills the process but the exit code arrives at
  group.next() first, detect via terminationReason == .uncaughtSignal and
  reclassify as .timedOut
- Extract terminateProcess() helper to deduplicate SIGTERM→SIGKILL escalation

The timeout regression test deleted in 3961770 ("waitUntilExit() blocks the
cooperative thread pool, starving the timeout task on low-core CI runners") is
restored and now passes reliably.

Closes steipete#189

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- terminateProcess() now returns Bool: true if it actually killed the
  process, false if it had already exited. The timeout task only throws
  .timedOut when the process was actually killed, avoiding false timeouts
  for near-threshold commands.
- Race guard uses both terminationReason == .uncaughtSignal AND elapsed
  time >= timeout to distinguish our timeout kills from unrelated signal
  crashes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bryant24hao bryant24hao force-pushed the fix/subprocess-cooperative-pool-starvation branch from 24105e5 to eca08e4 Compare March 19, 2026 04:49
@bryant24hao
Copy link
Contributor Author

@codex review

The two P2 suggestions from the initial review have been addressed in the second commit (eca08e4):

  1. "Avoid throwing timeout after the process already exited"terminateProcess() now returns Bool; the timeout task only throws .timedOut when it actually killed the process (guard self.terminateProcess(...) else { return await exitCodeTask.value }).

  2. "Only treat signalled exits as timeouts when we killed them" — The race guard now requires both terminationReason == .uncaughtSignal AND elapsed >= timeout - 0.5, so unrelated signal crashes are not misclassified.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eca08e4a67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +161 to +162
if process.terminationReason == .uncaughtSignal,
Date().timeIntervalSince(start) >= timeout - 0.5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop inferring timeout ownership from wall-clock time

A subprocess that exits by signal on its own shortly before the deadline is still reclassified as .timedOut here, even when the timeout task never killed it. For example, sh -c 'sleep 0.9; kill -TERM $$' with timeout: 1 will satisfy both conditions and lose the real failure mode; for any timeout under 0.5s, every signalled exit is misclassified because the threshold is negative. This affects commands that intentionally self-signal or are killed externally, so the guard needs to key off an explicit “we sent SIGTERM/SIGKILL” flag rather than elapsed time.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Antigravity is not refreshing after a few hours

1 participant