Skip to content

fix: handle zombie jobs when process crashes but status remains "running"#218

Open
D2758695161 wants to merge 3 commits intoopenai:mainfrom
D2758695161:feat/zombie-job-fix
Open

fix: handle zombie jobs when process crashes but status remains "running"#218
D2758695161 wants to merge 3 commits intoopenai:mainfrom
D2758695161:feat/zombie-job-fix

Conversation

@D2758695161
Copy link
Copy Markdown

Summary

When a Codex task process crashes or is killed externally, its job record remains in status "running" permanently. All subsequent task calls in the same Claude session fail with:

Task {job-id} is still running. Use /codex:status before continuing it.

This effectively blocks all Codex usage in that session.

Fix

Before blocking on an active task, we now verify the process is actually alive using process.kill(pid, 0). If the process is dead but the job status is still "running", we mark the job as "failed" via upsertJob() and clean up the zombie state.

Root Cause

resolveLatestTrackedTaskThread() was checking only the status field in the job record, not whether the actual process (stored in job.pid) was alive.

Testing

  • Start a Codex task
  • Kill the task process externally (e.g., kill -9 )
  • Verify the job status is updated to "failed"
  • Verify subsequent task calls are NOT blocked

Fixes: #202
Fixes: #216

…ing"

Before this fix, if a Codex task process crashed or was killed externally,
the job record would remain in status "running" permanently, blocking
all subsequent task calls.

Now we check if the PID is actually alive using process.kill(pid, 0).
If the process is dead but status is still "running", we mark the job
as "failed" via upsertJob() and clean up the zombie state.

Fixes: openai#202
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4075eb3d8f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

} else {
throw new Error(`Task ${activeTask.id} is still running. Use /codex:status before continuing it.`);
}
return null;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Continue thread lookup after cleaning zombie task

After the zombie cleanup path marks activeTask as failed, the function immediately returns null, which skips the normal fallback logic (trackedTask lookup and findLatestTaskThread). In a task --resume-last flow where a stale running job is cleaned up but an earlier resumable thread exists, this now raises "No previous Codex task thread was found" instead of resuming, so the recovery path still breaks for resumed tasks.

Useful? React with 👍 / 👎.

model: request.model,
effort: request.effort,
sandbox: request.write ? "workspace-write" : "read-only",
sandbox: request.fullAccess ? "danger-full-access" : (request.write ? "workspace-write" : "read-only"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wire --full-access through task request pipeline

The sandbox selection now checks request.fullAccess, but handleTask never reads options["full-access"] and never passes a fullAccess field into executeTaskRun/buildTaskRequest, so this condition is always false. As a result, the newly accepted --full-access flag is silently ignored and users cannot actually request danger-full-access mode.

Useful? React with 👍 / 👎.

Fixes openai#216. When a Codex task process crashes/is killed externally,
the job record stays in state.json with status:"running".

This change:
- Adds isProcessAlive(pid) using process.kill(pid, 0)
- Adds sweepZombieJobs(cwd) that marks dead PIDs as failed
- All job status reads now clean up zombie entries automatically
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49301547f9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

process.kill(activeTask.pid, 0);
} catch (err) {
// Process is dead but job status wasn't updated — mark it as failed and clean up
if (err.code === "ESRCH" || err.code === "EPERM") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat EPERM as active when probing task PID

The zombie cleanup branch currently treats EPERM from process.kill(pid, 0) as if the process were dead, but EPERM means the PID exists and is just not signalable by this user. In that case this path will incorrectly mark an active task as failed and allow another task/resume flow to start concurrently. This is especially risky in mixed-privilege environments (e.g., task started under a different uid).

Useful? React with 👍 / 👎.

Comment on lines +48 to +50
job.status = "failed";
job.endedAt = now;
job.error = "Process died unexpectedly — marked failed by zombie sweep";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist standard failure fields in zombie sweep

When a zombie job is marked failed here, the code writes endedAt/error but does not set the standard failure fields used elsewhere (phase, pid, completedAt, errorMessage). As a result, downstream status/result rendering can show stale runtime metadata and omit the failure reason/timestamp for swept jobs. The zombie path should emit the same failure shape as other failure updates.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zombie running job blocks all subsequent task calls when process crashes

1 participant