Skip to content

Per-task benchmarks at scale on OC: fan out N Jobs (an Indexed Job can't vary the image per index) #83

Description

@elronbandel

Context

Per-task benchmarks (compilebench, cybench, mle-bench, swe-bench, swe-bench-pro, swe-lancer, terminal-bench) bake one eval image per taskevals/<benchmark>-<task>--<agent> (see #79, rule 24f).

The chart's scale model for shared-env benchmarks is ONE Indexed Job (datasetSizecompletionMode: Indexed, task id = $JOB_COMPLETION_INDEX): one image, N task indices.

Problem

That cannot work for per-task benchmarks — each task is a different image, and a k8s Job uses one pod template (one image) for all indices. #79 added a guard rejecting datasetSize + perTask, so per-task runs as one Job per task. There's currently no mechanism to run a full per-task benchmark (e.g. all 500 SWE-bench Verified tasks) at scale on OC.

Options to investigate

  • CLI fan-out: eval-containers run <bench> --mode job --all-tasks renders one Job per id from tasks.txt, each pinning evals/<b>-<task>--<a>, admitted by Kueue for global concurrency.
  • External sweep (the oc/ tooling): loop tasks.txthelm template … --set task=<id> --set perTask=true → apply, Kueue as the concurrency cap.
  • Result aggregation across N Jobs (each writes its own /output/<task>/result.json).

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions