feat(replay): record terminal-bench + skills-bench fixtures#146
feat(replay): record terminal-bench + skills-bench fixtures#146elronbandel wants to merge 1 commit into
Conversation
elronbandel
left a comment
There was a problem hiding this comment.
Review — request changes. The core fix (the podman build fallback) is the right approach; it's the base this is built on + the verification that block it.
Blocking
1. Built on the pre-#125 skills-bench → conflicts + a stale fixture.
This branch's containers/benchmarks/skills-bench/Dockerfile is the old shared-base version — it still carries the literal LABEL eval.benchmark.data_revision="312d07…" and the "We source the task's own Dockerfile for RUN pip/apt lines" grep approach. But #125 rewrote that file to the Harbor overlay (FROM ${TASK_BASE}, COPY environment/skills ./.claude/skills, the EVAL_SKILLS toggle). So:
- The PR is
CONFLICTINGwithmain; a rebase must keep #125's Dockerfile and re-apply only thereleasedlabel. - The
skills-bench-citation-checkfixture was recorded against the no-skills benchmark. On currentmainthe agent loads the skill and makes more model calls, so the recording will diverge (the replay model returnsREPLAY_EXHAUSTEDonce the agent out-runs it) —tests/replay/RULES.mdrule 2 calls that a re-record signal. The skills-bench fixture must be re-recorded against post-#125main. (terminal-bench is unaffected — #125 didn't touch it.)
2. Gates aren't green (R-8). Only DCO is passing on the PR; the rest of the suite — including the new replay test — hasn't passed, and the description itself notes cargo test --test replay can't run green on a podman/no-Rosetta host (bootstrap_core_bases → pyarrow under QEMU). The central new behaviour isn't CI-verified.
3. Rule + governed code in one submission (R-3). .agents/src/RULES.md principle 11 (the normative "falls back to podman build…") changes alongside the cli/src/build.rs that implements it. These belong in separate submissions. (The doc itself is accurate and changelogged, so R-6 is met — it's the coupling that's the issue.)
4. No issue reference (R-1). The description doesn't link the issue it resolves.
Nits
cli/src/build.rspodman_build_eval: the fallback triggers on anybake_with_enverror, but the message asserts "the builder can't resolve the local per-task base." A genuine Dockerfile/network failure would print that misleading reason and then fail again under podman. Consider matching the resolve-source-metadata error specifically, or softening the wording.tests/replay/test.rsensure_images: "skip if the eval image exists" can replay a stale local image on a dev box (fine on fresh CI; worth a one-line caveat).
Solid (credit where due)
- The
podman buildfallback is the right minimal fix — bake stays the single source of truth (docker buildx bake --print eval), scoped to per-task local builds only, mirroring the--builder ocbackend. - Fixtures are secret-clean — 0
sk-…matches in both. provenance.jsonadded pertests/replay/RULES.md6/9, honestly scoping the legacy backfill out; terminal-bench gains its missingdescription/data_revisionlabels (rule 21).
Path
Rebase onto main, re-apply the released label on #125's skills-bench Dockerfile, re-record the skills-bench fixture against current main, and get CI green. Optionally split the .agents/src/RULES.md change (R-3) and add the issue link (R-1).
461451b to
4caf356
Compare
|
Rebased onto current Blocking1. Pre-#125 skills-bench → conflicts + stale fixture — fixed. Rebased; the skills-bench Dockerfile is now #125's Harbor overlay with only the 2. Gates (R-8). The per-PR gates are daemon-free by design — 3. Rule + code (R-3). Kept together — you flagged it optional and R-6 is met (the principle-11 change is accurate + changelogged). Happy to split into a rule-first PR if you'd rather. 4. Issue (R-1). Opened #149 and linked it ( Nits
Also
|
2e5a50b to
9a41413
Compare
|
Follow-up (CLI-rules self-audit against Previously the fallback was gated on Everything else audited compliant: podman is in the tools table + changelog (the sanctioned way to add a tool), the fallback shells out (P5) from the bake-resolved spec (P3), and the printed |
Resolves #149. Per-task / built-from-source benchmarks couldn't build a combined eval image on a container-driver host: `build eval` runs `docker buildx bake`, but a docker-container buildx builder (e.g. a podman-backed Docker) can't resolve the per-task base from the local image store. Add a `podman build` fallback for the per-task eval combination, driven by the `bake --print` spec so bake stays the source of truth; CI's docker driver keeps using bake. - cli: per-task `build eval` falls back to `podman build` when buildx can't read the local store; `--dry-run` prints both the bake attempt and the podman fallback, since the fallback is what runs on a container-driver host (src/build.rs); src/RULES.md + changelog updated. - replay: ensure_images builds the per-task base+eval with --task-id (skips if the image is present); add terminal-bench + skills-bench replay_test! entries, recorded fixtures, and provenance.json. - benchmarks: mark terminal-bench + skills-bench released (rule 21a); add terminal-bench's missing description/data_revision labels (rule 21). - docs: fix the recording recipe (litellm gateway -> trajectory.jsonl). - ci: exclude large jsonl fixtures from check-added-large-files; fix the gitleaks fixture allowlist (singular `[allowlist]` + corrected path for the pinned 8.18.4) and document it honestly as a wholesale path exemption of the vetted fixture tree — 8.18.4 can't AND-combine path+field, so cleanliness is enforced at record time, not by the scanner (drop the dead per-field regexes). Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
9a41413 to
1d593a8
Compare
1d593a8 to
c3c6064
Compare
c3c6064 to
823fc3e
Compare
823fc3e to
787bd03
Compare
What
First replay fixtures for per-task / built-from-source benchmarks — terminal-bench (
configure-git-webserver) and skills-bench (citation-check) — and both markedeval.benchmark.released="true"(benchmarks/RULES.md rule 21a).Why it didn't work before
Recording a fixture needs a live run → the combined eval image →
eval-containers build eval→docker buildx bakewithFROM ${BENCHMARK_IMAGE}. On a container-driver buildx builder (a podman-backed Docker), bake can't resolve the per-task base from the local image store (build.sh/docker buildload it there, not a registry), so the eval image can't be stitched and no fixture can be recorded.Approach
build eval --task-idnow triesdocker buildx bakeand, when the builder can't resolve the local base, falls back topodman build(which reads the local store natively — the same reasonbuild.shuses podman). The fallback builds the spec fromdocker buildx bake --print eval, so bake stays the single source of truth (src/RULES.mdprinciple 3), mirroring the--builder ocbackend. CI'sdockerdriver keeps using bake; shared-env and remote---builderbuilds are untouched.Changes
cli/src/build.rs): per-task evalpodman buildfallback (podman_build_eval);.agents/src/RULES.mddocuments it + changelog row.tests/replay/test.rs):ensure_imagesbuilds the per-task base + eval with--task-id(skips if present) and pinsEVAL_REGISTRY; tworeplay_test!entries; recorded fixtures +provenance.json(rule 6/9).releasedlabel on both; terminal-bench gains the missingdescription/data_revisionlabels (rule 21).tests/LOCAL.mdrecording recipe corrected (litellm gateway →trajectory.jsonl; the default bifrost gateway emits OTeltraces.jsonl, which the replay model can't parse); skills-bench READMEbuildcommands fixed.check-added-large-filesexcludes large jsonl fixtures;.gitleaks.tomlfixture allowlist fixed (staletests/fixtures/path →tests/replay/fixtures/, and[[allowlists]]→[allowlist]for the pinned gitleaks 8.18.4, which ignores the array form).Verification
cargo build/fmt/clippyclean; fullpre-commitsuite passes.result.json).cargo test --test replaycan't run green on a podman/no-Rosetta host — itsbootstrap_core_bases()rebuildsbenchmark-base-hfandpyarrowfails under QEMU (pre-existing, unrelated to this change). Validated via the replay stack directly; should pass on amd64/Rosetta CI.OPENAI_API_KEYis not present in either fixture (0 matches); gitleaks hits areuser_api_key_hashone-way hashes (same shape as all 242 existing fixtures), exempted by the allowlist.Notes
provenance.jsoncovers the two new fixtures; the ~240 legacy fixtures remain unprovenanced (separate backfill).