AIN-288 · one-command delta dry-run + CI#2
Conversation
eval/run_delta.sh — single command: generate CC0 synthetic corpus (seed 7, CRN) then compute the routed-vs-baselines delta table (routed/ainfera_learned vs single_best / cheapest / round_robin / agent_baseline / ainfera_static / oracle) → preprint/results.md. Deterministic + reproducible. .github/workflows/delta-dry-run.yml — CI runs the synthetic dry-run on every change to benchmark/eval/datasets and asserts exit 0 + a delta table is emitted. Real headline number stays gated on real judged corpus (not a published claim). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.
Reviewed by Cursor Bugbot for commit e9d6894. Configure here.
| cd "$(dirname "$0")/.." | ||
| SEED="${1:-7}" | ||
| python datasets/generate.py --seed "$SEED" | ||
| python eval/delta.py |
There was a problem hiding this comment.
Seed parameter not propagated to benchmark evaluation
Low Severity
The SEED parameter is passed to datasets/generate.py but not forwarded to eval/delta.py, which calls run() in run_benchmark.py where SEED = 7 is hardcoded. Running ./eval/run_delta.sh 42 generates data with seed 42 but evaluates the benchmark with seed 7, so the user-facing [SEED] argument only controls half the random state — silently contradicting the script's "one-command reproducible" and "CRN — deterministic" documentation.
Reviewed by Cursor Bugbot for commit e9d6894. Configure here.


Wraps the existing benchmark into a one-command reproducible synthetic dry-run + a CI gate. Baselines: single_best/cheapest/round_robin/agent_baseline/ainfera_static/oracle vs routed. Headline number gated on real corpus.
Note
Low Risk
Adds CI and a shell entrypoint around existing synthetic eval scripts; no auth, data, or production routing changes.
Overview
Adds a one-command synthetic delta harness via
eval/run_delta.sh, which seeds corpus generation (datasets/generate.py, default seed 7) and runseval/delta.pyto write the routed-vs-baselines table topreprint/results.md.Introduces a GitHub Actions workflow (
delta-dry-run) that runs on changes underbenchmark/,eval/, anddatasets/: installs Python deps, executes the harness with seed 7, and fails CI unlesspreprint/results.mdexists and includesainfera_learned.Reviewed by Cursor Bugbot for commit e9d6894. Bugbot is set up for automated code reviews on this repo. Configure here.