feat(run): add --mode crane — materialize evals at run time#126
Draft
elronbandel wants to merge 5 commits into
Draft
feat(run): add --mode crane — materialize evals at run time#126elronbandel wants to merge 5 commits into
--mode crane — materialize evals at run time#126elronbandel wants to merge 5 commits into
Conversation
Where --mode container runs the pre-built evals/<benchmark>--<agent> combination image, --mode crane runs ONE generic core/crane-runner image that materializes the eval at run time. It crane-exports the per-axis benchmarks/<b> + agents/<a> images into a rootfs and fuses them in-container with bwrap/chroot: no Docker daemon, no DinD. So the per-(benchmark, agent) combination matrix becomes an optional pre-bake, and per-task benchmarks (SWE-bench) run from one image (the runner pulls the per-task rootfs). Additive: compose/container/job modes are untouched and nothing depends on the new image. Opt in with --mode crane. cli: Mode::Crane + run_crane (mirrors run_container, passes EVAL_BENCHMARK_ENV=per-task for per-task benchmarks) and naming::crane_runner_image with a unit test. containers/core/crane-runner: the runner image (Dockerfile + materialize entrypoint) plus a daemonless proof (crane-poc.sh). First cut: the fusion is proven daemonless (crane-poc.sh). Wiring the in-rootfs gateway/otel via process-compose and the root-only grader perms is the remaining build-out (see README); pin pulls by digest before shipping. Doctrine note: late- vs early-binding change; needs a rule for when crane applies plus the digest-pin requirement. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Static verification confirmed: export preserves root-only perms (so the answer-isolation invariant holds by reusing gosu/process-compose, not by inventing isolation); the orchestration is the existing 5-process pipeline launched by /usr/local/bin/run; arch must match (swe-bench is x86_64). README status now states the precise [3] build-out (bake core, pull gateway axis, extract as root, exec entrypoint+run) and the remaining gates (digest-pin, conformance, bake target, doctrine). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
run.rs: --mode container and --mode crane duplicated ~30 lines of docker-run boilerplate; extract one docker_run_eval(image, envs, dry_run) both call. materialize: drop the unused in_root function (the [3] build-out is still a stub). Dockerfile: drop unused jq. crane-poc.sh: detect arch instead of hardcoding arm64 so it runs in any Linux container. Net -33 lines; behavior unchanged (dry-runs identical for container + crane, per-task env still injected, container mode unaffected). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Removing the dead in_root function left bubblewrap unused (the first cut only does crane export | tar -x); tar and chroot are already in debian:stable-slim. apt install is now just crane fetch deps (ca-certificates curl). bwrap lands with the [3] build-out, when first used. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
f39196c to
fbd5187
Compare
containers/core/<x>/ holds image build sources (Dockerfile + COPYd files); crane-poc.sh was a standalone demo, COPYd nowhere and used by nothing. The daemonless mechanism is captured in the PR; dropped the file and its two references (Dockerfile header, README). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
7b6e146 to
af1d25c
Compare
af1d25c to
d4f7cef
Compare
d4f7cef to
7338d09
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A new, additive deployment mode. Where
--mode containerruns the pre-builtevals/<benchmark>--<agent>combination image,--mode craneruns ONE genericcore/crane-runnerimage that materializes the eval at run time: itcrane exports the per-axisbenchmarks/<b>+agents/<a>(+models/<gateway>) images into a rootfs and fuses them in-container withbwrap/chroot— no Docker daemon, no DinD.So the per-(benchmark, agent) combination matrix becomes an optional pre-bake, and per-task benchmarks (SWE-bench) run from one image — the runner pulls the per-task rootfs, so a single image fans out across tasks (which a per-task combination image cannot). It is the single-container analog of the generic
composeartifact — one better: it composes the axes at run time rather than pulling a pre-fused product.Additive — nothing existing changes
compose/container/jobuntouched; the combination matrix still builds and works (this just makes it optional); opt in with--mode crane; unused → zero impact.Changes
Mode::Crane+run_crane(mirrorsrun_container; passesEVAL_BENCHMARK_ENV=per-taskviais_per_task_by_name) +naming::crane_runner_image(+ unit test).Dockerfile+materialize) and a daemonless proof (crane-poc.sh).Verified
cargo build+ naming test green; dry-runs correct (shared-env; per-task addsEVAL_BENCHMARK_ENV=per-task;--agent/--localguards).crane-poc.sh): crane pulls a rootfs + chroot/bwrap runs in it + agent edits the testbed — no daemon./testbed(sympy repo) ⊎ real claude-code/opt/agentreproduce the combination layout (922 MB, byte-level).exportpreserves root-only modes/owners (/tasks0600,/tests&/opt/gateway0700, root) — so the existinggosu agent/env -ipipeline keeps answers unreadable. The invariant is reused, not reinvented.🚧 Draft — definition of done (statically verified)
First cut proves the fusion. To fully replace
containermode for every benchmark / every user / all the time:A. Functional (run at all)
models/<gateway>→/opt/gateway)install.sh(symlinks; light)/usr/local/bin/{run,write-result}into the runner; extract as root, thenexec /entrypoint.sh /usr/local/bin/run/output/task/result.json(via bakedwrite-result)B. Correctness (or scores are silently invalid)
gosu agent/env -i(export preserves perms — verified ✓)x86_64; runner built here isarm64)C. Distribution (every end user)
D. Operational (all the time)
E. Acceptance bar
crane(X) == container(X)(sameresult.json) across shared-env + per-task, in CIcore/crane-runnerF. Governance
The hard conceptual risk (daemonless fusion of real images) is done. What remains is mostly A (wire the existing orchestrator) + E (conformance in CI), gated by B (isolation — verified preservable via reuse). Arch (B7) is the one hard external constraint.