Problem
Per-task / built-from-source benchmarks (terminal-bench, skills-bench, swe-bench-pro, swe-lancer) can't get a replay fixture, so they can't pass the replay sweep or earn eval.benchmark.released="true" (benchmarks/RULES.md rule 21a).
Recording a fixture needs a live run → the combined eval image → eval-containers build eval → docker buildx bake with FROM ${BENCHMARK_IMAGE}. The per-task base is built into the local image store (build.sh / per-task docker build, not a registry). On a docker-container buildx builder (e.g. a podman-backed Docker), bake resolves FROM from a registry, not the local store, so it fails with "failed to resolve source metadata" → the eval image can't be stitched → no live run → no fixture.
Impact
terminal-bench and skills-bench (both built-from-source after #125) have no replay fixtures and aren't released.
Fix direction
Have eval-containers build eval --task-id fall back to podman build (which reads the local image store) when buildx can't resolve the local base — driving the docker buildx bake --print spec so bake stays the source of truth. Then record fixtures and mark the benchmarks released.
Problem
Per-task / built-from-source benchmarks (terminal-bench, skills-bench, swe-bench-pro, swe-lancer) can't get a replay fixture, so they can't pass the replay sweep or earn
eval.benchmark.released="true"(benchmarks/RULES.md rule 21a).Recording a fixture needs a live run → the combined eval image →
eval-containers build eval→docker buildx bakewithFROM ${BENCHMARK_IMAGE}. The per-task base is built into the local image store (build.sh/ per-taskdocker build, not a registry). On adocker-containerbuildx builder (e.g. a podman-backed Docker), bake resolvesFROMfrom a registry, not the local store, so it fails with "failed to resolve source metadata" → the eval image can't be stitched → no live run → no fixture.Impact
terminal-bench and skills-bench (both built-from-source after #125) have no replay fixtures and aren't
released.Fix direction
Have
eval-containers build eval --task-idfall back topodman build(which reads the local image store) when buildx can't resolve the local base — driving thedocker buildx bake --printspec so bake stays the source of truth. Then record fixtures and mark the benchmarks released.