Fix SubprocessRunner blocking cooperative thread pool, causing permanent refresh stall#566
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 710061c3d4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| try await Task.sleep(for: .seconds(timeout)) | ||
| // Kill the process BEFORE throwing so the exit-code task can complete | ||
| // and withThrowingTaskGroup can exit promptly. | ||
| self.terminateProcess(process, processGroup: processGroup) | ||
| throw SubprocessRunnerError.timedOut(label) |
There was a problem hiding this comment.
Avoid throwing timeout after the process already exited
After waitUntilExit() was moved behind DispatchQueue.global() in SubprocessRunner.run, there is now an extra scheduling hop before exitCodeTask can finish. For commands that complete right around the deadline, this timeout task can wake first, terminateProcess will no-op because process.isRunning is already false, and we still throw .timedOut. That turns successful near-threshold commands into false failures/fallbacks; the timeout branch should only throw if the process was still running or if it actually performed the kill.
Useful? React with 👍 / 👎.
| // Race guard: if the timeout task killed the process but the exit code arrived | ||
| // at group.next() before the .timedOut throw, the process will have been killed | ||
| // by a signal. Reclassify as timeout so callers get the correct error type. | ||
| if process.terminationReason == .uncaughtSignal { |
There was a problem hiding this comment.
Only treat signalled exits as timeouts when we killed them
This branch now maps every .uncaughtSignal exit to .timedOut, but nothing records that the timeout task actually fired. Any subprocess launched through SubprocessRunner that crashes or is terminated by a signal for its own reasons will now be misreported as a timeout instead of a real subprocess failure, which changes fallback/UI behavior and hides the actual failure mode from callers.
Useful? React with 👍 / 👎.
…ent refresh stall SubprocessRunner.run() called waitUntilExit() and readDataToEndOfFile() inside Task closures that ran on Swift's cooperative thread pool. When multiple providers refreshed concurrently, these blocking calls starved the pool, preventing the timeout mechanism (Task.sleep) from firing. A single hung subprocess could permanently block all future refreshes via the isRefreshing guard in UsageStore.refresh(). Changes: - Move waitUntilExit() and readDataToEndOfFile() to DispatchQueue.global() via withCheckedContinuation, freeing the cooperative pool for timeout scheduling - Kill the process inside the timeout task before throwing, so withThrowingTaskGroup can exit promptly (previously it waited for all child tasks, which couldn't complete because the process kill was in the catch block) - Add race guard: if the timeout kills the process but the exit code arrives at group.next() first, detect via terminationReason == .uncaughtSignal and reclassify as .timedOut - Extract terminateProcess() helper to deduplicate SIGTERM→SIGKILL escalation The timeout regression test deleted in 3961770 ("waitUntilExit() blocks the cooperative thread pool, starving the timeout task on low-core CI runners") is restored and now passes reliably. Closes steipete#189 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- terminateProcess() now returns Bool: true if it actually killed the process, false if it had already exited. The timeout task only throws .timedOut when the process was actually killed, avoiding false timeouts for near-threshold commands. - Race guard uses both terminationReason == .uncaughtSignal AND elapsed time >= timeout to distinguish our timeout kills from unrelated signal crashes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
24105e5 to
eca08e4
Compare
|
@codex review The two P2 suggestions from the initial review have been addressed in the second commit (eca08e4):
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eca08e4a67
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if process.terminationReason == .uncaughtSignal, | ||
| Date().timeIntervalSince(start) >= timeout - 0.5 |
There was a problem hiding this comment.
Stop inferring timeout ownership from wall-clock time
A subprocess that exits by signal on its own shortly before the deadline is still reclassified as .timedOut here, even when the timeout task never killed it. For example, sh -c 'sleep 0.9; kill -TERM $$' with timeout: 1 will satisfy both conditions and lose the real failure mode; for any timeout under 0.5s, every signalled exit is misclassified because the threshold is negative. This affects commands that intentionally self-signal or are killed externally, so the guard needs to key off an explicit “we sent SIGTERM/SIGKILL” flag rather than elapsed time.
Useful? React with 👍 / 👎.
Summary
SubprocessRunner.run()calledwaitUntilExit()andreadDataToEndOfFile()insideTaskclosures on Swift's cooperative thread pool. When multiple providers refreshed concurrently, these blocking calls starved the pool, preventing the timeout mechanism (Task.sleep) from firing. A single hung subprocess permanently blocked all future refreshes via theisRefreshingguard inUsageStore.refresh().DispatchQueue.global()viawithCheckedContinuation(same pattern asKiroStatusProbe.runCommand()), freeing the cooperative pool so timeouts fire reliably.Root cause
Confirmed by maintainer in commit 3961770: "waitUntilExit() blocks the cooperative thread pool, starving the timeout task on low-core CI runners"
Changes
readDataOffPool()/waitForExitOffPool()— new helpers that run blocking calls onDispatchQueue.global()viawithCheckedContinuation, keeping the cooperative pool freeterminateProcess()is called before throwing.timedOut, sowithThrowingTaskGroupcan exit promptly (previously the kill was in thecatchblock after the group, creating a deadlock)group.next()first,terminationReason == .uncaughtSignaldetects this and reclassifies as.timedOutterminateProcess()helper — extracted SIGTERM→SIGKILL escalation logic to deduplicate between timeout task and catch blockTest plan
throwsTimedOutWhenProcessHangs(deleted in 3961770) — now passes in ~1s instead of hanging for 30sconcurrentHungProcessesAllTimeOut— 8 concurrent hung subprocesses all timeout in ~2s, proving no starvationconcurrentCallsDoNotStarve— 20 concurrent normal calls complete without blocking each otherreadsLargeStdoutWithoutDeadlockcontinues to pass./Scripts/lint.sh lintpasses (0 violations)Closes #189
🤖 Generated with Claude Code