fix: handle zombie jobs when process crashes but status remains "running"#218
fix: handle zombie jobs when process crashes but status remains "running"#218D2758695161 wants to merge 3 commits intoopenai:mainfrom
Conversation
…ing" Before this fix, if a Codex task process crashed or was killed externally, the job record would remain in status "running" permanently, blocking all subsequent task calls. Now we check if the PID is actually alive using process.kill(pid, 0). If the process is dead but status is still "running", we mark the job as "failed" via upsertJob() and clean up the zombie state. Fixes: openai#202
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4075eb3d8f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } else { | ||
| throw new Error(`Task ${activeTask.id} is still running. Use /codex:status before continuing it.`); | ||
| } | ||
| return null; |
There was a problem hiding this comment.
Continue thread lookup after cleaning zombie task
After the zombie cleanup path marks activeTask as failed, the function immediately returns null, which skips the normal fallback logic (trackedTask lookup and findLatestTaskThread). In a task --resume-last flow where a stale running job is cleaned up but an earlier resumable thread exists, this now raises "No previous Codex task thread was found" instead of resuming, so the recovery path still breaks for resumed tasks.
Useful? React with 👍 / 👎.
| model: request.model, | ||
| effort: request.effort, | ||
| sandbox: request.write ? "workspace-write" : "read-only", | ||
| sandbox: request.fullAccess ? "danger-full-access" : (request.write ? "workspace-write" : "read-only"), |
There was a problem hiding this comment.
Wire --full-access through task request pipeline
The sandbox selection now checks request.fullAccess, but handleTask never reads options["full-access"] and never passes a fullAccess field into executeTaskRun/buildTaskRequest, so this condition is always false. As a result, the newly accepted --full-access flag is silently ignored and users cannot actually request danger-full-access mode.
Useful? React with 👍 / 👎.
Fixes openai#216. When a Codex task process crashes/is killed externally, the job record stays in state.json with status:"running". This change: - Adds isProcessAlive(pid) using process.kill(pid, 0) - Adds sweepZombieJobs(cwd) that marks dead PIDs as failed - All job status reads now clean up zombie entries automatically
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 49301547f9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| process.kill(activeTask.pid, 0); | ||
| } catch (err) { | ||
| // Process is dead but job status wasn't updated — mark it as failed and clean up | ||
| if (err.code === "ESRCH" || err.code === "EPERM") { |
There was a problem hiding this comment.
Treat EPERM as active when probing task PID
The zombie cleanup branch currently treats EPERM from process.kill(pid, 0) as if the process were dead, but EPERM means the PID exists and is just not signalable by this user. In that case this path will incorrectly mark an active task as failed and allow another task/resume flow to start concurrently. This is especially risky in mixed-privilege environments (e.g., task started under a different uid).
Useful? React with 👍 / 👎.
| job.status = "failed"; | ||
| job.endedAt = now; | ||
| job.error = "Process died unexpectedly — marked failed by zombie sweep"; |
There was a problem hiding this comment.
Persist standard failure fields in zombie sweep
When a zombie job is marked failed here, the code writes endedAt/error but does not set the standard failure fields used elsewhere (phase, pid, completedAt, errorMessage). As a result, downstream status/result rendering can show stale runtime metadata and omit the failure reason/timestamp for swept jobs. The zombie path should emit the same failure shape as other failure updates.
Useful? React with 👍 / 👎.
Summary
When a Codex task process crashes or is killed externally, its job record remains in status "running" permanently. All subsequent task calls in the same Claude session fail with:
This effectively blocks all Codex usage in that session.
Fix
Before blocking on an active task, we now verify the process is actually alive using process.kill(pid, 0). If the process is dead but the job status is still "running", we mark the job as "failed" via upsertJob() and clean up the zombie state.
Root Cause
resolveLatestTrackedTaskThread() was checking only the status field in the job record, not whether the actual process (stored in job.pid) was alive.
Testing
Fixes: #202
Fixes: #216