Context
Per-task benchmarks (compilebench, cybench, mle-bench, swe-bench, swe-bench-pro, swe-lancer, terminal-bench) bake one eval image per task — evals/<benchmark>-<task>--<agent> (see #79, rule 24f).
The chart's scale model for shared-env benchmarks is ONE Indexed Job (datasetSize → completionMode: Indexed, task id = $JOB_COMPLETION_INDEX): one image, N task indices.
Problem
That cannot work for per-task benchmarks — each task is a different image, and a k8s Job uses one pod template (one image) for all indices. #79 added a guard rejecting datasetSize + perTask, so per-task runs as one Job per task. There's currently no mechanism to run a full per-task benchmark (e.g. all 500 SWE-bench Verified tasks) at scale on OC.
Options to investigate
- CLI fan-out:
eval-containers run <bench> --mode job --all-tasks renders one Job per id from tasks.txt, each pinning evals/<b>-<task>--<a>, admitted by Kueue for global concurrency.
- External sweep (the
oc/ tooling): loop tasks.txt → helm template … --set task=<id> --set perTask=true → apply, Kueue as the concurrency cap.
- Result aggregation across N Jobs (each writes its own
/output/<task>/result.json).
Notes
Context
Per-task benchmarks (compilebench, cybench, mle-bench, swe-bench, swe-bench-pro, swe-lancer, terminal-bench) bake one eval image per task —
evals/<benchmark>-<task>--<agent>(see #79, rule 24f).The chart's scale model for shared-env benchmarks is ONE Indexed Job (
datasetSize→completionMode: Indexed, task id =$JOB_COMPLETION_INDEX): one image, N task indices.Problem
That cannot work for per-task benchmarks — each task is a different image, and a k8s Job uses one pod template (one image) for all indices. #79 added a guard rejecting
datasetSize+perTask, so per-task runs as one Job per task. There's currently no mechanism to run a full per-task benchmark (e.g. all 500 SWE-bench Verified tasks) at scale on OC.Options to investigate
eval-containers run <bench> --mode job --all-tasksrenders one Job per id fromtasks.txt, each pinningevals/<b>-<task>--<a>, admitted by Kueue for global concurrency.oc/tooling): looptasks.txt→helm template … --set task=<id> --set perTask=true→ apply, Kueue as the concurrency cap./output/<task>/result.json).Notes
datasetSize/perTaskmodel, the existing Kueue sweep concurrency.