Skip to content

Per-task / built-from-source benchmarks can't produce replay fixtures #149

Description

@elronbandel

Problem

Per-task / built-from-source benchmarks (terminal-bench, skills-bench, swe-bench-pro, swe-lancer) can't get a replay fixture, so they can't pass the replay sweep or earn eval.benchmark.released="true" (benchmarks/RULES.md rule 21a).

Recording a fixture needs a live run → the combined eval image → eval-containers build evaldocker buildx bake with FROM ${BENCHMARK_IMAGE}. The per-task base is built into the local image store (build.sh / per-task docker build, not a registry). On a docker-container buildx builder (e.g. a podman-backed Docker), bake resolves FROM from a registry, not the local store, so it fails with "failed to resolve source metadata" → the eval image can't be stitched → no live run → no fixture.

Impact

terminal-bench and skills-bench (both built-from-source after #125) have no replay fixtures and aren't released.

Fix direction

Have eval-containers build eval --task-id fall back to podman build (which reads the local image store) when buildx can't resolve the local base — driving the docker buildx bake --print spec so bake stays the source of truth. Then record fixtures and mark the benchmarks released.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions