Skip to content

Fix #202: zombie job blocks subsequent task calls when process crashes#216

Open
D2758695161 wants to merge 2 commits intoopenai:mainfrom
D2758695161:fix/issue-202-zombie-job
Open

Fix #202: zombie job blocks subsequent task calls when process crashes#216
D2758695161 wants to merge 2 commits intoopenai:mainfrom
D2758695161:fix/issue-202-zombie-job

Conversation

@D2758695161
Copy link
Copy Markdown

Problem

When a Codex task process crashes or is killed externally, its job record remains in status running permanently. All subsequent task calls in the same Claude session fail with:

Task {job-id} is still running. Use /codex:status before continuing it.

Root Cause

In
esolveLatestTrackedTaskThread(), when an active task job is found, the function unconditionally throws an error if the job status is running - without checking whether the actual process is still alive.

Fix

In
esolveLatestTrackedTaskThread(), before blocking on an active task:

  1. Check if the PID is alive using process.kill(pid, 0)
  2. If the process is dead (EPERM or process doesn't exist), mark the job as ailed via upsertJob() and continue
  3. Only throw the blocking error if the process is actually alive

This ensures zombie jobs don't block subsequent task calls.

Testing

Minimal targeted fix - no behavioral change for healthy jobs. Only affects the zombie job edge case where the process has died but the job status wasn't updated.

Fixes #202

…crashes

When a Codex task process crashes or is killed externally, its job
record remains in status running permanently. All subsequent task
calls in the same Claude session fail with Task {job-id} is still
running.

In resolveLatestTrackedTaskThread(), before blocking on an active
task, verify the PID is actually alive using process.kill(pid, 0).
If the process is dead, mark the job as failed and continue -
rather than throwing an error that blocks all subsequent task calls.

Fixes openai#202
@D2758695161 D2758695161 requested a review from a team April 13, 2026 17:31
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d44135bb5a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

defaultPrompt: resumeThreadId ? DEFAULT_CONTINUE_PROMPT : "",
model: request.model,
effort: request.effort,
sandbox: request.fullAccess ? "danger-full-access" : (request.write ? "workspace-write" : "read-only"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate --full-access into task requests

The new sandbox selection in executeTaskRun depends on request.fullAccess, but neither foreground nor background task request construction ever sets that field (they still pass only write). This makes --full-access a no-op: even when the flag is provided, request.fullAccess stays undefined and the sandbox falls back to workspace-write/read-only.

Useful? React with 👍 / 👎.

Comment on lines +328 to +330
if (processAlive) {
throw new Error(`Task ${activeTask.id} is still running. Use /codex:status before continuing it.`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Check all active tasks before allowing --resume-last

This guard now throws only when the first active task found is alive; if that first record is a zombie, the code marks it failed and immediately proceeds. Because only one active task is inspected, a second queued/running task in the same job list can be skipped, allowing --resume-last to proceed while another real task is still active.

Useful? React with 👍 / 👎.

@D2758695161
Copy link
Copy Markdown
Author

claiming this issue.

Implementation: PR #218 — fix handles zombie jobs by checking if PID is alive with process.kill(pid, 0) before blocking. If dead, marks job as failed via upsertJob().

@D2758695161
Copy link
Copy Markdown
Author

Claiming this bug fix. The fix approach is clear: on job start, verify process liveness; if dead, upsertJob() with failed status. This prevents zombie jobs from blocking subsequent task calls.

D2758695161 added a commit to D2758695161/codex-plugin-cc that referenced this pull request Apr 14, 2026
Fixes openai#216. When a Codex task process crashes/is killed externally,
the job record stays in state.json with status:"running".

This change:
- Adds isProcessAlive(pid) using process.kill(pid, 0)
- Adds sweepZombieJobs(cwd) that marks dead PIDs as failed
- All job status reads now clean up zombie entries automatically
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zombie running job blocks all subsequent task calls when process crashes

1 participant