Skip to content

feat(run): add --mode crane — materialize evals at run time#126

Draft
elronbandel wants to merge 5 commits into
mainfrom
elron/crane-mode
Draft

feat(run): add --mode crane — materialize evals at run time#126
elronbandel wants to merge 5 commits into
mainfrom
elron/crane-mode

Conversation

@elronbandel

@elronbandel elronbandel commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

What

A new, additive deployment mode. Where --mode container runs the pre-built evals/<benchmark>--<agent> combination image, --mode crane runs ONE generic core/crane-runner image that materializes the eval at run time: it crane exports the per-axis benchmarks/<b> + agents/<a> (+ models/<gateway>) images into a rootfs and fuses them in-container with bwrap/chrootno Docker daemon, no DinD.

So the per-(benchmark, agent) combination matrix becomes an optional pre-bake, and per-task benchmarks (SWE-bench) run from one image — the runner pulls the per-task rootfs, so a single image fans out across tasks (which a per-task combination image cannot). It is the single-container analog of the generic compose artifact — one better: it composes the axes at run time rather than pulling a pre-fused product.

Additive — nothing existing changes

compose/container/job untouched; the combination matrix still builds and works (this just makes it optional); opt in with --mode crane; unused → zero impact.

Changes

  • cli: Mode::Crane + run_crane (mirrors run_container; passes EVAL_BENCHMARK_ENV=per-task via is_per_task_by_name) + naming::crane_runner_image (+ unit test).
  • containers/core/crane-runner: the runner image (Dockerfile + materialize) and a daemonless proof (crane-poc.sh).

Verified

  • cargo build + naming test green; dry-runs correct (shared-env; per-task adds EVAL_BENCHMARK_ENV=per-task; --agent/--local guards).
  • Daemonless mechanism (crane-poc.sh): crane pulls a rootfs + chroot/bwrap runs in it + agent edits the testbed — no daemon.
  • Real-image fusion: the real swe-bench /testbed (sympy repo) ⊎ real claude-code /opt/agent reproduce the combination layout (922 MB, byte-level).
  • Isolation survives the fusion (the key safety check): export preserves root-only modes/owners (/tasks 0600, /tests & /opt/gateway 0700, root) — so the existing gosu agent / env -i pipeline keeps answers unreadable. The invariant is reused, not reinvented.

🚧 Draft — definition of done (statically verified)

First cut proves the fusion. To fully replace container mode for every benchmark / every user / all the time:

A. Functional (run at all)

  • Pull the 3rd axis (models/<gateway>/opt/gateway)
  • Run the agent's install.sh (symlinks; light)
  • Bake otelcol + process-compose + gosu + /usr/local/bin/{run,write-result} into the runner; extract as root, then exec /entrypoint.sh /usr/local/bin/run
  • Emit /output/task/result.json (via baked write-result)

B. Correctness (or scores are silently invalid)

  • Isolation: extract as root + run agent via gosu agent/env -i (export preserves perms — verified ✓)
  • Arch match: runner + node + pulled rootfs (swe-bench is x86_64; runner built here is arm64)
  • Digest-pin pulls; record digests in the result

C. Distribution (every end user)

  • Per-axis images published & pullable · [ ] registry auth · [ ] insecure/mirror support

D. Operational (all the time)

  • Ephemeral-disk sizing for the GB rootfs · [ ] bwrap→chroot fallback soundness · [ ] caching/mirror for big pulls · [ ] cleanup + pull retries

E. Acceptance bar

  • Conformance: crane(X) == container(X) (same result.json) across shared-env + per-task, in CI
  • Bake target + publish core/crane-runner

F. Governance

  • Doctrine rule: when crane applies, digest-pin, isolation invariant

The hard conceptual risk (daemonless fusion of real images) is done. What remains is mostly A (wire the existing orchestrator) + E (conformance in CI), gated by B (isolation — verified preservable via reuse). Arch (B7) is the one hard external constraint.

Where --mode container runs the pre-built evals/<benchmark>--<agent> combination image, --mode crane runs ONE generic core/crane-runner image that materializes the eval at run time. It crane-exports the per-axis benchmarks/<b> + agents/<a> images into a rootfs and fuses them in-container with bwrap/chroot: no Docker daemon, no DinD. So the per-(benchmark, agent) combination matrix becomes an optional pre-bake, and per-task benchmarks (SWE-bench) run from one image (the runner pulls the per-task rootfs).

Additive: compose/container/job modes are untouched and nothing depends on the new image. Opt in with --mode crane.

cli: Mode::Crane + run_crane (mirrors run_container, passes EVAL_BENCHMARK_ENV=per-task for per-task benchmarks) and naming::crane_runner_image with a unit test. containers/core/crane-runner: the runner image (Dockerfile + materialize entrypoint) plus a daemonless proof (crane-poc.sh).

First cut: the fusion is proven daemonless (crane-poc.sh). Wiring the in-rootfs gateway/otel via process-compose and the root-only grader perms is the remaining build-out (see README); pin pulls by digest before shipping. Doctrine note: late- vs early-binding change; needs a rule for when crane applies plus the digest-pin requirement.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Static verification confirmed: export preserves root-only perms (so the answer-isolation invariant holds by reusing gosu/process-compose, not by inventing isolation); the orchestration is the existing 5-process pipeline launched by /usr/local/bin/run; arch must match (swe-bench is x86_64). README status now states the precise [3] build-out (bake core, pull gateway axis, extract as root, exec entrypoint+run) and the remaining gates (digest-pin, conformance, bake target, doctrine).

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
run.rs: --mode container and --mode crane duplicated ~30 lines of docker-run boilerplate; extract one docker_run_eval(image, envs, dry_run) both call. materialize: drop the unused in_root function (the [3] build-out is still a stub). Dockerfile: drop unused jq. crane-poc.sh: detect arch instead of hardcoding arm64 so it runs in any Linux container.

Net -33 lines; behavior unchanged (dry-runs identical for container + crane, per-task env still injected, container mode unaffected).

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Removing the dead in_root function left bubblewrap unused (the first cut only does crane export | tar -x); tar and chroot are already in debian:stable-slim. apt install is now just crane fetch deps (ca-certificates curl). bwrap lands with the [3] build-out, when first used.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
containers/core/<x>/ holds image build sources (Dockerfile + COPYd files); crane-poc.sh was a standalone demo, COPYd nowhere and used by nothing. The daemonless mechanism is captured in the PR; dropped the file and its two references (Dockerfile header, README).

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant