fix(container-runner): self-heal missing image before spawn by matmartinez · Pull Request #2620 · nanocoai/nanoclaw

matmartinez · 2026-05-26T05:20:03Z

What

Adds a docker image inspect check at the start of spawnContainer. When the agent image is missing, the host rebuilds it before issuing docker run instead of crash-looping.

Why

Anyone running NanoClaw alongside Dokploy hits this — Dokploy ships with "Daily Docker Cleanup" enabled by default (Settings → Server). The cleanup runs, among others:

docker image prune --all --force
docker system prune --all --force

Both reap any tagged image whose containers aren't currently running. NanoClaw containers are --rm and short-lived, so the agent image is almost always evictable at cleanup time. Result: after Dokploy's nightly run, every wake exits with code=125 (Unable to find image '<tag>' locally) and the host crash-loops forever until someone manually re-runs container/build.sh.

Same hazard for users who run docker system prune -a manually, have a separate cron, or hit Docker Desktop's disk-pressure GC.

Related: #2378 / #2379 already track "container image is a fragile assumption the host doesn't defend against" — this patch addresses both for the missing-image case.

How it works

src/container-runtime.ts — new imageExists(tag) helper using docker image inspect.
src/container-runner.ts:
- ensureImage(tag, agentGroupId) — per-tag mutex via imageRebuildLocks so concurrent wakes share one build.
- rebuildImage — base image via container/build.sh, per-group via existing buildAgentGroupImage. If both are missing, base goes first.
- One-line check in spawnContainer right after containerConfig is materialized.

Common case (image present): one extra image inspect per spawn (~10ms).
Missing case: synchronous rebuild, then spawn proceeds.

How it was tested

End-to-end on macOS Docker Desktop with a real Dokploy-style prune:

Confirmed image present, host healthy
docker rmi <agent-image> to simulate Dokploy's cleanup
Sent a test iMessage

Logs:

01:01:30.588  Base container image rebuilt
01:01:30.624  Spawning container
...
01:02:43.094  Message delivered

Build clean (pnpm run build), tests pass (pnpm test — 328/328).

Known interaction (not caused by this patch)

The first spawn after rebuild was killed by host-sweep's claim-stuck check 5ms after spawn — because an orphan processing_ack claim from the previous (code=125-looping) container had been aging for 72s. Next sweep tick spawned cleanly and the message landed. This is a pre-existing bug: the host-sweep should treat a fresh spawn as a "reset" event for stale claims, but doesn't. The synchronous rebuild here just widens the window where it manifests.

Filing a separate issue for the host-sweep refinement — not in scope for this PR.

— tenazas, here. lived this one front to back and reviewed the patch on the way through.

When an external tool deletes the agent image — most commonly Dokploy's "Daily Docker Cleanup" (`docker image prune --all --force` on a cron) or a manual `docker system prune -a` — every subsequent spawn exits with code=125 and the host crash-loops until somebody re-runs `container/build.sh`. The agent container is `--rm` and short-lived, so it almost never appears "in use" at cleanup time and is reaped eagerly. Adds a ~10ms `docker image inspect` at the start of `spawnContainer`. When the image is missing, the host rebuilds before proceeding: • base image (CONTAINER_IMAGE) → invoke `container/build.sh` • per-group image → call existing `buildAgentGroupImage` • if both are missing, base goes first Concurrent wakes that all detect the same missing tag share one rebuild via `imageRebuildLocks` instead of racing N builds. Common case (image present): one extra inspect call per spawn, ~10ms. Missing case (post-prune): synchronous rebuild, then spawn proceeds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

matmartinez marked this pull request as ready for review May 26, 2026 05:21

matmartinez requested review from gabi-simons and gavrielc as code owners May 26, 2026 05:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(container-runner): self-heal missing image before spawn#2620

fix(container-runner): self-heal missing image before spawn#2620
matmartinez wants to merge 1 commit into
nanocoai:mainfrom
matmartinez:fix/container-image-self-heal

matmartinez commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matmartinez commented May 26, 2026

What

Why

How it works

How it was tested

Known interaction (not caused by this patch)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant