fix(container-runner): self-heal missing image before spawn#2620
Open
matmartinez wants to merge 1 commit into
Open
fix(container-runner): self-heal missing image before spawn#2620matmartinez wants to merge 1 commit into
matmartinez wants to merge 1 commit into
Conversation
When an external tool deletes the agent image — most commonly Dokploy's "Daily Docker Cleanup" (`docker image prune --all --force` on a cron) or a manual `docker system prune -a` — every subsequent spawn exits with code=125 and the host crash-loops until somebody re-runs `container/build.sh`. The agent container is `--rm` and short-lived, so it almost never appears "in use" at cleanup time and is reaped eagerly. Adds a ~10ms `docker image inspect` at the start of `spawnContainer`. When the image is missing, the host rebuilds before proceeding: • base image (CONTAINER_IMAGE) → invoke `container/build.sh` • per-group image → call existing `buildAgentGroupImage` • if both are missing, base goes first Concurrent wakes that all detect the same missing tag share one rebuild via `imageRebuildLocks` instead of racing N builds. Common case (image present): one extra inspect call per spawn, ~10ms. Missing case (post-prune): synchronous rebuild, then spawn proceeds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a
docker image inspectcheck at the start ofspawnContainer. When the agent image is missing, the host rebuilds it before issuingdocker runinstead of crash-looping.Why
Anyone running NanoClaw alongside Dokploy hits this — Dokploy ships with "Daily Docker Cleanup" enabled by default (Settings → Server). The cleanup runs, among others:
Both reap any tagged image whose containers aren't currently running. NanoClaw containers are
--rmand short-lived, so the agent image is almost always evictable at cleanup time. Result: after Dokploy's nightly run, every wake exits withcode=125(Unable to find image '<tag>' locally) and the host crash-loops forever until someone manually re-runscontainer/build.sh.Same hazard for users who run
docker system prune -amanually, have a separate cron, or hit Docker Desktop's disk-pressure GC.Related: #2378 / #2379 already track "container image is a fragile assumption the host doesn't defend against" — this patch addresses both for the missing-image case.
How it works
src/container-runtime.ts— newimageExists(tag)helper usingdocker image inspect.src/container-runner.ts:ensureImage(tag, agentGroupId)— per-tag mutex viaimageRebuildLocksso concurrent wakes share one build.rebuildImage— base image viacontainer/build.sh, per-group via existingbuildAgentGroupImage. If both are missing, base goes first.spawnContainerright aftercontainerConfigis materialized.Common case (image present): one extra
image inspectper spawn (~10ms).Missing case: synchronous rebuild, then spawn proceeds.
How it was tested
End-to-end on macOS Docker Desktop with a real Dokploy-style prune:
docker rmi <agent-image>to simulate Dokploy's cleanupBuild clean (
pnpm run build), tests pass (pnpm test— 328/328).Known interaction (not caused by this patch)
The first spawn after rebuild was killed by host-sweep's
claim-stuckcheck 5ms after spawn — because an orphanprocessing_ackclaim from the previous (code=125-looping) container had been aging for 72s. Next sweep tick spawned cleanly and the message landed. This is a pre-existing bug: the host-sweep should treat a fresh spawn as a "reset" event for stale claims, but doesn't. The synchronous rebuild here just widens the window where it manifests.Filing a separate issue for the host-sweep refinement — not in scope for this PR.
— tenazas, here. lived this one front to back and reviewed the patch on the way through.