refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready) by elronbandel · Pull Request #26 · Exgentic/eval-containers

elronbandel · 2026-06-03T15:14:28Z

Why

In job mode the chart was rendered as helm template benchmarks/_chart -f benchmarks/<x>/values.yaml — both args are local paths, so deploying required cloning the repo. This makes the chart self-contained: it renders from --set benchmark=<x> alone, with each benchmark's bespoke topology bundled inside the chart. That's the prerequisite for packaging/publishing it to an OCI registry (follow-up PR).

What

Benchmark selection moves from -f benchmarks/<x>/values.yaml to --set benchmark=<x>.
Bespoke topology moves into the chart at benchmarks/_chart/presets/<x>.yaml, loaded with Helm's .Files.Get and overlaid over the chart defaults via a new eval.values helper. Standard benchmarks have no preset — .Files.Get returns empty and the chart defaults apply unchanged.
The 4 non-trivial benchmarks (osworld, tau-bench, visualwebarena, webarena) became presets (git-detected as renames); the 98 trivial one-line values.yaml files are deleted — their identity now comes from --set benchmark=<x>.
Merge semantics: presets set only structural keys (sidecars, resources, extraManifests); the per-run axes (agent/task/model) come from --set and always win. Verified no preset touches an axis.
CLI (run --mode job): drops the -f values, adds --set benchmark=<x>.
Tests: tests/helm.rs renders via --set benchmark=<x>; tests/sanity/check.rs no longer requires a per-benchmark values.yaml (the k8s surface works for every benchmark with no file).
Doctrine + docs retargeted at the preset model: RULES.md rules 24/24b/24c/24e/25/29 + changelog, the add-benchmark skill & template, src/RULES.md, delivery/build, and the user docs (helm-chart, triple-mode, deploy guides, chart-values, README).

Verification

Byte-identical render vs. the prior -f values.yaml form for all 102 benchmarks (trivial + the 4 rich) — confirmed by diff.
cargo test --test helm (renders + kubeconform-validates all 102) — green.
cargo test --test check — the same 4 pre-existing reds as main (agents-smoke Dockerfile, hle fixture, README count); no new failures, and the values.yaml existence/pin checks are correctly gone.
cargo build + clippy clean (pre-existing warnings only).

Live-cluster validation (server-side dry-run)

Validated against a real Kubernetes API server (kind), which exercises schema, defaulting, and admission — everything short of containers starting:

Whole fleet: 101/101 benchmarks pass helm template … --set benchmark=<x> | kubectl apply --dry-run=server (the chart selection accepted by the API server for every benchmark, including the 4 rich presets).
Rich topology (osworld): the desktop Deployment + Service + wait-for-desktop init container all render and validate alongside the Job.
Full CLI path: eval-containers run <b> --mode job --dry-run → helm → kubectl apply --dry-run=server accepted for trivial and rich.
OpenShift overlay: --overlay deploy/values-openshift.yaml injects serviceAccountName: anyuid-sa and validates.

Not covered (needs pullable images + the eval-secrets Secret): a container actually running a task to produce result.json.

Follow-up

Chart publishing (helm package + helm push to quay) is a separate small PR, per scoping. Independent of #25 (registrySuffix), which also touches job.yaml — whichever merges second resolves a trivial conflict on the three image lines.

…hmark values.yaml) Make the Helm chart self-contained so it renders — and can be published to an OCI registry — without the repo. The benchmark is now selected with `--set benchmark=<x>` instead of `-f benchmarks/<x>/values.yaml`, and a benchmark's bespoke topology lives in the chart at `presets/<x>.yaml`, loaded via Helm's `.Files.Get` and overlaid over the chart defaults. - The 4 non-trivial benchmarks (osworld, tau-bench, visualwebarena, webarena) move to benchmarks/_chart/presets/<name>.yaml; the 98 trivial one-line values.yaml files are deleted (their identity now comes from --set benchmark). - A new eval.values helper merges the selected preset over .Values; job.yaml reads the merged result. Presets set only structural keys, so per-run axes (agent/task/model via --set) always win. - CLI run --mode job: drop the -f values, add --set benchmark=<x>. - tests/helm.rs renders via --set benchmark; check.rs no longer requires a per-benchmark values.yaml (k8s works for every benchmark with no file). - Doctrine (RULES.md 24/24b/24c/24e/25/29, add-benchmark skill+template, src/RULES.md, delivery/build) and docs retargeted at the preset model. Renders byte-identical to the prior `-f values.yaml` form for all 102 benchmarks (trivial + the 4 rich); `cargo test --test helm` is green. Chart publishing (helm package/push) lands in a follow-up.

…presets)

refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready)

elronbandel added 3 commits June 3, 2026 18:14

docs(test): update helm.rs header to the --set benchmark model

22a8b45

docs(changelog): record self-contained Helm chart (--set benchmark + …

c704aab

…presets)

elronbandel merged commit a811811 into main Jun 3, 2026
1 check failed

This was referenced Jun 3, 2026

docs: point the 4 rich benchmark READMEs at their relocated chart preset #27

Merged

docs: k8s no-clone OCI parity + consistent one---set-per-axis #28

Merged

elronbandel added a commit that referenced this pull request Jun 15, 2026

Merge pull request #26 from Exgentic/elron/chart-native-presets

6e40620

refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready)

elronbandel added a commit that referenced this pull request Jun 15, 2026

Merge pull request #26 from Exgentic/elron/chart-native-presets

1d806b5

refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready)

elronbandel added a commit that referenced this pull request Jun 15, 2026

Merge pull request #26 from Exgentic/elron/chart-native-presets

af01ceb

refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready)#26

refactor(chart): bundle benchmark presets in the chart (self-contained, OCI-ready)#26
elronbandel merged 3 commits into
mainfrom
elron/chart-native-presets

elronbandel commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

elronbandel commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Verification

Live-cluster validation (server-side dry-run)

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

elronbandel commented Jun 3, 2026 •

edited

Loading