Skip to content

AIN-288 · one-command delta dry-run + CI#2

Merged
hizrianraz merged 1 commit into
mainfrom
hizrianraz/ain-288-delta-oneshot
May 29, 2026
Merged

AIN-288 · one-command delta dry-run + CI#2
hizrianraz merged 1 commit into
mainfrom
hizrianraz/ain-288-delta-oneshot

Conversation

@hizrianraz
Copy link
Copy Markdown
Contributor

@hizrianraz hizrianraz commented May 29, 2026

Wraps the existing benchmark into a one-command reproducible synthetic dry-run + a CI gate. Baselines: single_best/cheapest/round_robin/agent_baseline/ainfera_static/oracle vs routed. Headline number gated on real corpus.


Note

Low Risk
Adds CI and a shell entrypoint around existing synthetic eval scripts; no auth, data, or production routing changes.

Overview
Adds a one-command synthetic delta harness via eval/run_delta.sh, which seeds corpus generation (datasets/generate.py, default seed 7) and runs eval/delta.py to write the routed-vs-baselines table to preprint/results.md.

Introduces a GitHub Actions workflow (delta-dry-run) that runs on changes under benchmark/, eval/, and datasets/: installs Python deps, executes the harness with seed 7, and fails CI unless preprint/results.md exists and includes ainfera_learned.

Reviewed by Cursor Bugbot for commit e9d6894. Bugbot is set up for automated code reviews on this repo. Configure here.

eval/run_delta.sh — single command: generate CC0 synthetic corpus (seed 7, CRN)
then compute the routed-vs-baselines delta table (routed/ainfera_learned vs
single_best / cheapest / round_robin / agent_baseline / ainfera_static / oracle)
→ preprint/results.md. Deterministic + reproducible.

.github/workflows/delta-dry-run.yml — CI runs the synthetic dry-run on every
change to benchmark/eval/datasets and asserts exit 0 + a delta table is emitted.
Real headline number stays gated on real judged corpus (not a published claim).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hizrianraz hizrianraz merged commit 1ecb46a into main May 29, 2026
@hizrianraz hizrianraz deleted the hizrianraz/ain-288-delta-oneshot branch May 29, 2026 06:11
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

Reviewed by Cursor Bugbot for commit e9d6894. Configure here.

Comment thread eval/run_delta.sh
cd "$(dirname "$0")/.."
SEED="${1:-7}"
python datasets/generate.py --seed "$SEED"
python eval/delta.py
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seed parameter not propagated to benchmark evaluation

Low Severity

The SEED parameter is passed to datasets/generate.py but not forwarded to eval/delta.py, which calls run() in run_benchmark.py where SEED = 7 is hardcoded. Running ./eval/run_delta.sh 42 generates data with seed 42 but evaluates the benchmark with seed 7, so the user-facing [SEED] argument only controls half the random state — silently contradicting the script's "one-command reproducible" and "CRN — deterministic" documentation.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e9d6894. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant