Skip to content

fix(container-runner): self-heal missing image before spawn#2620

Open
matmartinez wants to merge 1 commit into
nanocoai:mainfrom
matmartinez:fix/container-image-self-heal
Open

fix(container-runner): self-heal missing image before spawn#2620
matmartinez wants to merge 1 commit into
nanocoai:mainfrom
matmartinez:fix/container-image-self-heal

Conversation

@matmartinez
Copy link
Copy Markdown

What

Adds a docker image inspect check at the start of spawnContainer. When the agent image is missing, the host rebuilds it before issuing docker run instead of crash-looping.

Why

Anyone running NanoClaw alongside Dokploy hits this — Dokploy ships with "Daily Docker Cleanup" enabled by default (Settings → Server). The cleanup runs, among others:

docker image prune --all --force
docker system prune --all --force

Both reap any tagged image whose containers aren't currently running. NanoClaw containers are --rm and short-lived, so the agent image is almost always evictable at cleanup time. Result: after Dokploy's nightly run, every wake exits with code=125 (Unable to find image '<tag>' locally) and the host crash-loops forever until someone manually re-runs container/build.sh.

Same hazard for users who run docker system prune -a manually, have a separate cron, or hit Docker Desktop's disk-pressure GC.

Related: #2378 / #2379 already track "container image is a fragile assumption the host doesn't defend against" — this patch addresses both for the missing-image case.

How it works

  • src/container-runtime.ts — new imageExists(tag) helper using docker image inspect.
  • src/container-runner.ts:
    • ensureImage(tag, agentGroupId) — per-tag mutex via imageRebuildLocks so concurrent wakes share one build.
    • rebuildImage — base image via container/build.sh, per-group via existing buildAgentGroupImage. If both are missing, base goes first.
    • One-line check in spawnContainer right after containerConfig is materialized.

Common case (image present): one extra image inspect per spawn (~10ms).
Missing case: synchronous rebuild, then spawn proceeds.

How it was tested

End-to-end on macOS Docker Desktop with a real Dokploy-style prune:

  1. Confirmed image present, host healthy
  2. docker rmi <agent-image> to simulate Dokploy's cleanup
  3. Sent a test iMessage
  4. Logs:
    01:01:30.588  Base container image rebuilt
    01:01:30.624  Spawning container
    ...
    01:02:43.094  Message delivered
    

Build clean (pnpm run build), tests pass (pnpm test — 328/328).

Known interaction (not caused by this patch)

The first spawn after rebuild was killed by host-sweep's claim-stuck check 5ms after spawn — because an orphan processing_ack claim from the previous (code=125-looping) container had been aging for 72s. Next sweep tick spawned cleanly and the message landed. This is a pre-existing bug: the host-sweep should treat a fresh spawn as a "reset" event for stale claims, but doesn't. The synchronous rebuild here just widens the window where it manifests.

Filing a separate issue for the host-sweep refinement — not in scope for this PR.


— tenazas, here. lived this one front to back and reviewed the patch on the way through.

When an external tool deletes the agent image — most commonly Dokploy's
"Daily Docker Cleanup" (`docker image prune --all --force` on a cron) or
a manual `docker system prune -a` — every subsequent spawn exits with
code=125 and the host crash-loops until somebody re-runs
`container/build.sh`. The agent container is `--rm` and short-lived, so
it almost never appears "in use" at cleanup time and is reaped eagerly.

Adds a ~10ms `docker image inspect` at the start of `spawnContainer`.
When the image is missing, the host rebuilds before proceeding:

  • base image (CONTAINER_IMAGE) → invoke `container/build.sh`
  • per-group image → call existing `buildAgentGroupImage`
  • if both are missing, base goes first

Concurrent wakes that all detect the same missing tag share one rebuild
via `imageRebuildLocks` instead of racing N builds.

Common case (image present): one extra inspect call per spawn, ~10ms.
Missing case (post-prune): synchronous rebuild, then spawn proceeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@matmartinez matmartinez marked this pull request as ready for review May 26, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant