refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready)#26
Merged
Merged
Conversation
…hmark values.yaml) Make the Helm chart self-contained so it renders — and can be published to an OCI registry — without the repo. The benchmark is now selected with `--set benchmark=<x>` instead of `-f benchmarks/<x>/values.yaml`, and a benchmark's bespoke topology lives in the chart at `presets/<x>.yaml`, loaded via Helm's `.Files.Get` and overlaid over the chart defaults. - The 4 non-trivial benchmarks (osworld, tau-bench, visualwebarena, webarena) move to benchmarks/_chart/presets/<name>.yaml; the 98 trivial one-line values.yaml files are deleted (their identity now comes from --set benchmark). - A new eval.values helper merges the selected preset over .Values; job.yaml reads the merged result. Presets set only structural keys, so per-run axes (agent/task/model via --set) always win. - CLI run --mode job: drop the -f values, add --set benchmark=<x>. - tests/helm.rs renders via --set benchmark; check.rs no longer requires a per-benchmark values.yaml (k8s works for every benchmark with no file). - Doctrine (RULES.md 24/24b/24c/24e/25/29, add-benchmark skill+template, src/RULES.md, delivery/build) and docs retargeted at the preset model. Renders byte-identical to the prior `-f values.yaml` form for all 102 benchmarks (trivial + the 4 rich); `cargo test --test helm` is green. Chart publishing (helm package/push) lands in a follow-up.
This was referenced Jun 3, 2026
elronbandel
added a commit
that referenced
this pull request
Jun 15, 2026
refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready)
elronbandel
added a commit
that referenced
this pull request
Jun 15, 2026
refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready)
elronbandel
added a commit
that referenced
this pull request
Jun 15, 2026
refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
In
jobmode the chart was rendered ashelm template benchmarks/_chart -f benchmarks/<x>/values.yaml— both args are local paths, so deploying required cloning the repo. This makes the chart self-contained: it renders from--set benchmark=<x>alone, with each benchmark's bespoke topology bundled inside the chart. That's the prerequisite for packaging/publishing it to an OCI registry (follow-up PR).What
-f benchmarks/<x>/values.yamlto--set benchmark=<x>.benchmarks/_chart/presets/<x>.yaml, loaded with Helm's.Files.Getand overlaid over the chart defaults via a neweval.valueshelper. Standard benchmarks have no preset —.Files.Getreturns empty and the chart defaults apply unchanged.values.yamlfiles are deleted — their identity now comes from--set benchmark=<x>.--setand always win. Verified no preset touches an axis.run --mode job): drops the-f values, adds--set benchmark=<x>.tests/helm.rsrenders via--set benchmark=<x>;tests/sanity/check.rsno longer requires a per-benchmarkvalues.yaml(the k8s surface works for every benchmark with no file).RULES.mdrules 24/24b/24c/24e/25/29 + changelog, the add-benchmark skill & template,src/RULES.md,delivery/build, and the user docs (helm-chart, triple-mode, deploy guides, chart-values, README).Verification
-f values.yamlform for all 102 benchmarks (trivial + the 4 rich) — confirmed by diff.cargo test --test helm(renders + kubeconform-validates all 102) — green.cargo test --test check— the same 4 pre-existing reds asmain(agents-smoke Dockerfile, hle fixture, README count); no new failures, and the values.yaml existence/pin checks are correctly gone.cargo build+clippyclean (pre-existing warnings only).Live-cluster validation (server-side dry-run)
Validated against a real Kubernetes API server (kind), which exercises schema, defaulting, and admission — everything short of containers starting:
helm template … --set benchmark=<x> | kubectl apply --dry-run=server(the chart selection accepted by the API server for every benchmark, including the 4 rich presets).desktopDeployment + Service +wait-for-desktopinit container all render and validate alongside the Job.eval-containers run <b> --mode job --dry-run→ helm →kubectl apply --dry-run=serveraccepted for trivial and rich.--overlay deploy/values-openshift.yamlinjectsserviceAccountName: anyuid-saand validates.Not covered (needs pullable images + the
eval-secretsSecret): a container actually running a task to produceresult.json.Follow-up
Chart publishing (
helm package+helm pushto quay) is a separate small PR, per scoping. Independent of #25 (registrySuffix), which also touchesjob.yaml— whichever merges second resolves a trivial conflict on the three image lines.