Skip to content

feat(replay): record terminal-bench + skills-bench fixtures#146

Open
elronbandel wants to merge 1 commit into
mainfrom
elron/focused-hofstadter-f8fa94
Open

feat(replay): record terminal-bench + skills-bench fixtures#146
elronbandel wants to merge 1 commit into
mainfrom
elron/focused-hofstadter-f8fa94

Conversation

@elronbandel

Copy link
Copy Markdown
Contributor

What

First replay fixtures for per-task / built-from-source benchmarks — terminal-bench (configure-git-webserver) and skills-bench (citation-check) — and both marked eval.benchmark.released="true" (benchmarks/RULES.md rule 21a).

Why it didn't work before

Recording a fixture needs a live run → the combined eval image → eval-containers build evaldocker buildx bake with FROM ${BENCHMARK_IMAGE}. On a container-driver buildx builder (a podman-backed Docker), bake can't resolve the per-task base from the local image store (build.sh / docker build load it there, not a registry), so the eval image can't be stitched and no fixture can be recorded.

Approach

build eval --task-id now tries docker buildx bake and, when the builder can't resolve the local base, falls back to podman build (which reads the local store natively — the same reason build.sh uses podman). The fallback builds the spec from docker buildx bake --print eval, so bake stays the single source of truth (src/RULES.md principle 3), mirroring the --builder oc backend. CI's docker driver keeps using bake; shared-env and remote---builder builds are untouched.

Changes

  • cli (cli/src/build.rs): per-task eval podman build fallback (podman_build_eval); .agents/src/RULES.md documents it + changelog row.
  • replay (tests/replay/test.rs): ensure_images builds the per-task base + eval with --task-id (skips if present) and pins EVAL_REGISTRY; two replay_test! entries; recorded fixtures + provenance.json (rule 6/9).
  • benchmarks: released label on both; terminal-bench gains the missing description / data_revision labels (rule 21).
  • docs: tests/LOCAL.md recording recipe corrected (litellm gateway → trajectory.jsonl; the default bifrost gateway emits OTel traces.jsonl, which the replay model can't parse); skills-bench README build commands fixed.
  • ci: check-added-large-files excludes large jsonl fixtures; .gitleaks.toml fixture allowlist fixed (stale tests/fixtures/ path → tests/replay/fixtures/, and [[allowlists]][allowlist] for the pinned gitleaks 8.18.4, which ignores the array form).

Verification

  • cargo build / fmt / clippy clean; full pre-commit suite passes.
  • Both fixtures replay-validated end-to-end (replay model serving the fixture → valid result.json).
  • Caveat: cargo test --test replay can't run green on a podman/no-Rosetta host — its bootstrap_core_bases() rebuilds benchmark-base-hf and pyarrow fails under QEMU (pre-existing, unrelated to this change). Validated via the replay stack directly; should pass on amd64/Rosetta CI.
  • Security: confirmed the real OPENAI_API_KEY is not present in either fixture (0 matches); gitleaks hits are user_api_key_hash one-way hashes (same shape as all 242 existing fixtures), exempted by the allowlist.

Notes

  • terminal-bench's replay relies on a single-store (podman-backed) host; a split-store docker CI would need a registry-push path (follow-up). skills-bench is unaffected.
  • provenance.json covers the two new fixtures; the ~240 legacy fixtures remain unprovenanced (separate backfill).

@elronbandel elronbandel left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — request changes. The core fix (the podman build fallback) is the right approach; it's the base this is built on + the verification that block it.

Blocking

1. Built on the pre-#125 skills-bench → conflicts + a stale fixture.
This branch's containers/benchmarks/skills-bench/Dockerfile is the old shared-base version — it still carries the literal LABEL eval.benchmark.data_revision="312d07…" and the "We source the task's own Dockerfile for RUN pip/apt lines" grep approach. But #125 rewrote that file to the Harbor overlay (FROM ${TASK_BASE}, COPY environment/skills ./.claude/skills, the EVAL_SKILLS toggle). So:

  • The PR is CONFLICTING with main; a rebase must keep #125's Dockerfile and re-apply only the released label.
  • The skills-bench-citation-check fixture was recorded against the no-skills benchmark. On current main the agent loads the skill and makes more model calls, so the recording will diverge (the replay model returns REPLAY_EXHAUSTED once the agent out-runs it) — tests/replay/RULES.md rule 2 calls that a re-record signal. The skills-bench fixture must be re-recorded against post-#125 main. (terminal-bench is unaffected — #125 didn't touch it.)

2. Gates aren't green (R-8). Only DCO is passing on the PR; the rest of the suite — including the new replay test — hasn't passed, and the description itself notes cargo test --test replay can't run green on a podman/no-Rosetta host (bootstrap_core_bases → pyarrow under QEMU). The central new behaviour isn't CI-verified.

3. Rule + governed code in one submission (R-3). .agents/src/RULES.md principle 11 (the normative "falls back to podman build…") changes alongside the cli/src/build.rs that implements it. These belong in separate submissions. (The doc itself is accurate and changelogged, so R-6 is met — it's the coupling that's the issue.)

4. No issue reference (R-1). The description doesn't link the issue it resolves.

Nits

  • cli/src/build.rs podman_build_eval: the fallback triggers on any bake_with_env error, but the message asserts "the builder can't resolve the local per-task base." A genuine Dockerfile/network failure would print that misleading reason and then fail again under podman. Consider matching the resolve-source-metadata error specifically, or softening the wording.
  • tests/replay/test.rs ensure_images: "skip if the eval image exists" can replay a stale local image on a dev box (fine on fresh CI; worth a one-line caveat).

Solid (credit where due)

  • The podman build fallback is the right minimal fix — bake stays the single source of truth (docker buildx bake --print eval), scoped to per-task local builds only, mirroring the --builder oc backend.
  • Fixtures are secret-clean — 0 sk-… matches in both.
  • provenance.json added per tests/replay/RULES.md 6/9, honestly scoping the legacy backfill out; terminal-bench gains its missing description/data_revision labels (rule 21).

Path

Rebase onto main, re-apply the released label on #125's skills-bench Dockerfile, re-record the skills-bench fixture against current main, and get CI green. Optionally split the .agents/src/RULES.md change (R-3) and add the issue link (R-1).

@elronbandel elronbandel force-pushed the elron/focused-hofstadter-f8fa94 branch 2 times, most recently from 461451b to 4caf356 Compare June 15, 2026 07:54
@elronbandel

Copy link
Copy Markdown
Contributor Author

Rebased onto current main (8edb7bb) and addressed the review. Force-pushed (4caf356).

Blocking

1. Pre-#125 skills-bench → conflicts + stale fixture — fixed. Rebased; the skills-bench Dockerfile is now #125's Harbor overlay with only the released label added (its description/data_revision were already there). The skills-bench fixture is re-recorded against post-#125 main with EVAL_SKILLS=on — the agent loads the citation-management skill and solves it (live reward=1, 19 model calls). terminal-bench was unaffected (#125 didn't touch it); its fixture moved to tests/run/replay/fixtures/. Both were re-validated end-to-end against current main via the replay model → valid result.json.

2. Gates (R-8). The per-PR gates are daemon-free by design — test.yml notes the container/e2e suites (incl. replay) "run elsewhere (release-images.yml on tag/main)", so the replay test isn't a PR gate. Locally the full pre-commit suite passes (gitleaks, hadolint, conftest, trivy, large-files) and clippy is clean; CI should now go green beyond DCO. The replay behavior is verified via the manual replay stack (the cargo test --test replay can't run on a podman/no-Rosetta host — bootstrap_core_bases builds benchmark-base-hf and pyarrow fails under QEMU). Caveat: the post-merge replay run for the built-from-source benchmarks needs the release runner to be podman-backed/single-store; otherwise terminal-bench needs the registry-push follow-up.

3. Rule + code (R-3). Kept together — you flagged it optional and R-6 is met (the principle-11 change is accurate + changelogged). Happy to split into a rule-first PR if you'd rather.

4. Issue (R-1). Opened #149 and linked it (Resolves #149).

Nits

  • Fallback message no longer asserts the cause — it now just reports docker buildx bake failed and that it's retrying with podman build (so a genuine Dockerfile/network failure isn't mislabeled).
  • ensure_images gained the stale-local-image caveat in its doc comment.

Also

  • The re-recorded skills-bench fixture is secret-clean (0 real-OPENAI_API_KEY matches). The gitleaks allowlist needed two fixes to actually fire: the stale path (tests/fixtures/tests/run/replay/fixtures/) and [[allowlists]] → singular [allowlist] (the pinned gitleaks 8.18.4 silently ignores the array form, added in 8.19).

@elronbandel elronbandel force-pushed the elron/focused-hofstadter-f8fa94 branch 2 times, most recently from 2e5a50b to 9a41413 Compare June 15, 2026 08:47
@elronbandel

Copy link
Copy Markdown
Contributor Author

Follow-up (CLI-rules self-audit against .agents/src/RULES.md): closed a principle-2 (--dry-run) gap in the per-task build eval fallback.

Previously the fallback was gated on !dry_run, so build eval <b> --task-id <t> --dry-run printed only the docker buildx bake command — but on a container-driver host the command that actually runs is the podman build fallback, so the dry-run was misleading there. It now prints both: the bake attempt and the podman fallback (derived from the read-only bake --print spec, so bake stays the source of truth).

Everything else audited compliant: podman is in the tools table + changelog (the sanctioned way to add a tool), the fallback shells out (P5) from the bake-resolved spec (P3), and the printed podman build is reproducible by hand (P1). Shared-env and remote---builder dry-runs are unchanged (one bake command, no podman). Amended into the existing commit to keep the PR a single clean change.

Resolves #149.

Per-task / built-from-source benchmarks couldn't build a combined eval image on a
container-driver host: `build eval` runs `docker buildx bake`, but a
docker-container buildx builder (e.g. a podman-backed Docker) can't resolve the
per-task base from the local image store. Add a `podman build` fallback for the
per-task eval combination, driven by the `bake --print` spec so bake stays the
source of truth; CI's docker driver keeps using bake.

- cli: per-task `build eval` falls back to `podman build` when buildx can't read
  the local store; `--dry-run` prints both the bake attempt and the podman
  fallback, since the fallback is what runs on a container-driver host
  (src/build.rs); src/RULES.md + changelog updated.
- replay: ensure_images builds the per-task base+eval with --task-id (skips if the
  image is present); add terminal-bench + skills-bench replay_test! entries,
  recorded fixtures, and provenance.json.
- benchmarks: mark terminal-bench + skills-bench released (rule 21a); add
  terminal-bench's missing description/data_revision labels (rule 21).
- docs: fix the recording recipe (litellm gateway -> trajectory.jsonl).
- ci: exclude large jsonl fixtures from check-added-large-files; fix the gitleaks
  fixture allowlist (singular `[allowlist]` + corrected path for the pinned
  8.18.4) and document it honestly as a wholesale path exemption of the vetted
  fixture tree — 8.18.4 can't AND-combine path+field, so cleanliness is enforced
  at record time, not by the scanner (drop the dead per-field regexes).

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
@elronbandel elronbandel force-pushed the elron/focused-hofstadter-f8fa94 branch from 9a41413 to 1d593a8 Compare June 15, 2026 09:06
@elronbandel elronbandel force-pushed the elron/focused-hofstadter-f8fa94 branch from 1d593a8 to c3c6064 Compare June 15, 2026 10:35
@elronbandel elronbandel force-pushed the elron/focused-hofstadter-f8fa94 branch from c3c6064 to 823fc3e Compare June 15, 2026 12:24
@elronbandel elronbandel force-pushed the elron/focused-hofstadter-f8fa94 branch from 823fc3e to 787bd03 Compare June 15, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant