feat(run): add `--mode crane` — materialize evals at run time by elronbandel · Pull Request #126 · Exgentic/eval-containers

elronbandel · 2026-06-14T12:10:46Z

What

A new, additive deployment mode. Where --mode container runs the pre-built evals/<benchmark>--<agent> combination image, --mode crane runs ONE generic core/crane-runner image that materializes the eval at run time: it crane exports the per-axis benchmarks/<b> + agents/<a> (+ models/<gateway>) images into a rootfs and fuses them in-container with bwrap/chroot — no Docker daemon, no DinD.

So the per-(benchmark, agent) combination matrix becomes an optional pre-bake, and per-task benchmarks (SWE-bench) run from one image — the runner pulls the per-task rootfs, so a single image fans out across tasks (which a per-task combination image cannot). It is the single-container analog of the generic compose artifact — one better: it composes the axes at run time rather than pulling a pre-fused product.

Additive — nothing existing changes

compose/container/job untouched; the combination matrix still builds and works (this just makes it optional); opt in with --mode crane; unused → zero impact.

Changes

cli: Mode::Crane + run_crane (mirrors run_container; passes EVAL_BENCHMARK_ENV=per-task via is_per_task_by_name) + naming::crane_runner_image (+ unit test).
containers/core/crane-runner: the runner image (Dockerfile + materialize) and a daemonless proof (crane-poc.sh).

Verified

cargo build + naming test green; dry-runs correct (shared-env; per-task adds EVAL_BENCHMARK_ENV=per-task; --agent/--local guards).
Daemonless mechanism (crane-poc.sh): crane pulls a rootfs + chroot/bwrap runs in it + agent edits the testbed — no daemon.
Real-image fusion: the real swe-bench /testbed (sympy repo) ⊎ real claude-code /opt/agent reproduce the combination layout (922 MB, byte-level).
Isolation survives the fusion (the key safety check): export preserves root-only modes/owners (/tasks 0600, /tests & /opt/gateway 0700, root) — so the existing gosu agent / env -i pipeline keeps answers unreadable. The invariant is reused, not reinvented.

🚧 Draft — definition of done (statically verified)

First cut proves the fusion. To fully replace container mode for every benchmark / every user / all the time:

A. Functional (run at all)

Pull the 3rd axis (models/<gateway> → /opt/gateway)
Run the agent's install.sh (symlinks; light)
Bake otelcol + process-compose + gosu + /usr/local/bin/{run,write-result} into the runner; extract as root, then exec /entrypoint.sh /usr/local/bin/run
Emit /output/task/result.json (via baked write-result)

B. Correctness (or scores are silently invalid)

Isolation: extract as root + run agent via gosu agent/env -i (export preserves perms — verified ✓)
Arch match: runner + node + pulled rootfs (swe-bench is x86_64; runner built here is arm64)
Digest-pin pulls; record digests in the result

C. Distribution (every end user)

Per-axis images published & pullable · [ ] registry auth · [ ] insecure/mirror support

D. Operational (all the time)

Ephemeral-disk sizing for the GB rootfs · [ ] bwrap→chroot fallback soundness · [ ] caching/mirror for big pulls · [ ] cleanup + pull retries

E. Acceptance bar

Conformance: crane(X) == container(X) (same result.json) across shared-env + per-task, in CI
Bake target + publish core/crane-runner

F. Governance

Doctrine rule: when crane applies, digest-pin, isolation invariant

The hard conceptual risk (daemonless fusion of real images) is done. What remains is mostly A (wire the existing orchestrator) + E (conformance in CI), gated by B (isolation — verified preservable via reuse). Arch (B7) is the one hard external constraint.

Where --mode container runs the pre-built evals/<benchmark>--<agent> combination image, --mode crane runs ONE generic core/crane-runner image that materializes the eval at run time. It crane-exports the per-axis benchmarks/<b> + agents/<a> images into a rootfs and fuses them in-container with bwrap/chroot: no Docker daemon, no DinD. So the per-(benchmark, agent) combination matrix becomes an optional pre-bake, and per-task benchmarks (SWE-bench) run from one image (the runner pulls the per-task rootfs). Additive: compose/container/job modes are untouched and nothing depends on the new image. Opt in with --mode crane. cli: Mode::Crane + run_crane (mirrors run_container, passes EVAL_BENCHMARK_ENV=per-task for per-task benchmarks) and naming::crane_runner_image with a unit test. containers/core/crane-runner: the runner image (Dockerfile + materialize entrypoint) plus a daemonless proof (crane-poc.sh). First cut: the fusion is proven daemonless (crane-poc.sh). Wiring the in-rootfs gateway/otel via process-compose and the root-only grader perms is the remaining build-out (see README); pin pulls by digest before shipping. Doctrine note: late- vs early-binding change; needs a rule for when crane applies plus the digest-pin requirement. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

Static verification confirmed: export preserves root-only perms (so the answer-isolation invariant holds by reusing gosu/process-compose, not by inventing isolation); the orchestration is the existing 5-process pipeline launched by /usr/local/bin/run; arch must match (swe-bench is x86_64). README status now states the precise [3] build-out (bake core, pull gateway axis, extract as root, exec entrypoint+run) and the remaining gates (digest-pin, conformance, bake target, doctrine). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

run.rs: --mode container and --mode crane duplicated ~30 lines of docker-run boilerplate; extract one docker_run_eval(image, envs, dry_run) both call. materialize: drop the unused in_root function (the [3] build-out is still a stub). Dockerfile: drop unused jq. crane-poc.sh: detect arch instead of hardcoding arm64 so it runs in any Linux container. Net -33 lines; behavior unchanged (dry-runs identical for container + crane, per-task env still injected, container mode unaffected). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

Removing the dead in_root function left bubblewrap unused (the first cut only does crane export | tar -x); tar and chroot are already in debian:stable-slim. apt install is now just crane fetch deps (ca-certificates curl). bwrap lands with the [3] build-out, when first used. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

containers/core/<x>/ holds image build sources (Dockerfile + COPYd files); crane-poc.sh was a standalone demo, COPYd nowhere and used by nothing. The daemonless mechanism is captured in the PR; dropped the file and its two references (Dockerfile header, README). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

elronbandel added 4 commits June 14, 2026 16:23

elronbandel force-pushed the elron/crane-mode branch from f39196c to fbd5187 Compare June 14, 2026 13:23

elronbandel force-pushed the main branch from 39be206 to 90d90b8 Compare June 15, 2026 10:35

elronbandel force-pushed the elron/crane-mode branch from 7b6e146 to af1d25c Compare June 15, 2026 10:35

elronbandel force-pushed the main branch from cf2320e to 8bf6d75 Compare June 15, 2026 12:24

elronbandel force-pushed the elron/crane-mode branch from af1d25c to d4f7cef Compare June 15, 2026 12:24

elronbandel force-pushed the main branch from 3f4e4b8 to 9b46aee Compare June 15, 2026 13:47

elronbandel force-pushed the elron/crane-mode branch from d4f7cef to 7338d09 Compare June 15, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(run): add `--mode crane` — materialize evals at run time#126

feat(run): add `--mode crane` — materialize evals at run time#126
elronbandel wants to merge 5 commits into
mainfrom
elron/crane-mode

elronbandel commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

elronbandel commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Additive — nothing existing changes

Changes

Verified

🚧 Draft — definition of done (statically verified)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

elronbandel commented Jun 14, 2026 •

edited

Loading