Skip to content

feat(6.31): boot-time orphan reap + cfcf server reap interactive verb#36

Merged
fstamatelopoulos merged 1 commit intomainfrom
iteration-6/orphan-reaper-6.31
May 8, 2026
Merged

feat(6.31): boot-time orphan reap + cfcf server reap interactive verb#36
fstamatelopoulos merged 1 commit intomainfrom
iteration-6/orphan-reaper-6.31

Conversation

@fstamatelopoulos
Copy link
Copy Markdown
Owner

Summary

Closes the hard-crash hole left by v0.20.0's signal-handler-based cleanup. When the cfcf server dies via SIGKILL or an OS panic — bypassing the SIGINT/SIGTERM handlers in start.ts — its agent children get reparented to PID 1 and keep running, tying up ollama's model runner for up to 10 minutes per orphan. v0.20.0 fixed clean-stop reaping; this PR closes the post-crash + manual-recovery cases. Marks item 6.31 as ✅ shipped in docs/plan.md.

  • Boot-time auto-reapstartServer() scans for orphans on every start, kills any it finds (best-effort, never blocks boot).
  • cfcf server reap — new interactive verb: scans, prints candidates, asks Kill these N process(es)? [y/N]. Empty case prints No zombie agent processes detected. and exits. Supports -y for non-interactive use. Runs without the cfcf server.
  • New packages/core/src/orphan-reaper.ts with three conjoined filters (PPID==1 + same effective user + cfcf-spawn command shape). Matchers are tight enough that hand-typed agent commands from another shell would not match (cfcf always pairs claude -p with --dangerously-skip-permissions, etc.).
  • 25 unit tests covering every cfcf-spawn pattern, negative cases (interactive claude, ollama serve/pull, unrelated commands), parser robustness on malformed ps output, each filter in isolation, and the SIGTERM→SIGKILL flow with mocked process.kill.

Test plan

  • bun run typecheck — clean
  • bun test packages/core — 700/700 pass (orphan-reaper.test.ts: 25/25)
  • Smoke: cfcf server reap on a clean machine → No zombie agent processes detected. exits 0
  • Smoke: cfcf server reap --help shows correct flags
  • Manual: kill server with SIGKILL while a loop is running, restart, verify orphans are reaped at boot
  • Manual: cfcf server reap interactive y/N flow against real orphans

Notes

  • Pre-existing packages/server/src/app.test.ts config-merge failure on main is not caused by this work (verified via git stash && bun test); tracks as a separate concern.
  • 6.30 + 6.32 entries unchanged here — model-compatibility and opencode-hang work belongs in their own follow-up PRs once we have the test data we're collecting now.

Closes the hard-crash hole left by v0.20.0's signal-handler-based
cleanup: when the cfcf server dies via SIGKILL or an OS panic, its
agent children get reparented to PID 1 and keep running, tying up
ollama's model runner for up to 10 minutes per orphan.

New `packages/core/src/orphan-reaper.ts` module:
- `findOrphanAgentProcesses()` scans `ps -eo pid,ppid,user,etime,command`
  with three conjoined filters: PPID==1 (orphan signature) + same
  effective user + cfcf-spawn command shape. The shape matchers are
  tight enough that a hand-typed `claude -p` from another shell would
  not match (cfcf always pairs `-p` with `--dangerously-skip-permissions`).
- `reapOrphans()` mirrors process-manager.ts's killProcessTree: group
  SIGTERM, 1.5s grace, group SIGKILL, with direct-PID fallback when
  the group target throws ESRCH.
- `classifyCommand`, `parsePsOutput`, `filterOrphans`, `formatOrphanLine`
  exported as pure helpers for unit testing.

Wired into:
- `packages/server/src/start.ts`: boot-time auto-reap after the
  stale-history-event cleanup. Best-effort — a scan failure logs and
  continues, never blocks server boot.
- `packages/cli/src/commands/server.ts`: new `cfcf server reap` verb
  that combines list + interactive y/N kill in a single mental model.
  Empty-state prints "No zombie agent processes detected." and exits.
  Supports `-y / --yes` for non-interactive use. Calls core directly
  — does NOT require the cfcf server to be running.

Tests (25 in `orphan-reaper.test.ts`): every cfcf-spawn pattern + each
negative case (interactive claude, ollama serve/pull, unrelated
commands), parser robustness on malformed input, each filter in
isolation, the full SIGTERM-then-SIGKILL flow with mocked process.kill,
and the group-then-direct fallback.

docs/plan.md: 6.31 marked ✅ shipped post-v0.20.0; iter-6 active-set
callout updated. The pre-existing `app.test.ts` config-merge failure
on main is NOT caused by this work (verified via `git stash && test`)
and tracks as a separate concern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fstamatelopoulos fstamatelopoulos merged commit c1b6c8a into main May 8, 2026
3 checks passed
@fstamatelopoulos fstamatelopoulos deleted the iteration-6/orphan-reaper-6.31 branch May 8, 2026 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant