AIN-288 · one-command delta dry-run + CI by hizrianraz · Pull Request #2 · ainfera-ai/research

hizrianraz · 2026-05-29T06:11:20Z

Wraps the existing benchmark into a one-command reproducible synthetic dry-run + a CI gate. Baselines: single_best/cheapest/round_robin/agent_baseline/ainfera_static/oracle vs routed. Headline number gated on real corpus.

Note

Low Risk
Adds CI and a shell entrypoint around existing synthetic eval scripts; no auth, data, or production routing changes.

Overview
Adds a one-command synthetic delta harness via eval/run_delta.sh, which seeds corpus generation (datasets/generate.py, default seed 7) and runs eval/delta.py to write the routed-vs-baselines table to preprint/results.md.

Introduces a GitHub Actions workflow (delta-dry-run) that runs on changes under benchmark/, eval/, and datasets/: installs Python deps, executes the harness with seed 7, and fails CI unless preprint/results.md exists and includes ainfera_learned.

^{Reviewed by Cursor Bugbot for commit e9d6894. Bugbot is set up for automated code reviews on this repo. Configure here.}

eval/run_delta.sh — single command: generate CC0 synthetic corpus (seed 7, CRN) then compute the routed-vs-baselines delta table (routed/ainfera_learned vs single_best / cheapest / round_robin / agent_baseline / ainfera_static / oracle) → preprint/results.md. Deterministic + reproducible. .github/workflows/delta-dry-run.yml — CI runs the synthetic dry-run on every change to benchmark/eval/datasets and asserts exit 0 + a delta table is emitted. Real headline number stays gated on real judged corpus (not a published claim). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.}

^{Reviewed by Cursor Bugbot for commit e9d6894. Configure here.}

cursor · 2026-05-29T06:16:07Z

+cd "$(dirname "$0")/.."
+SEED="${1:-7}"
+python datasets/generate.py --seed "$SEED"
+python eval/delta.py


Seed parameter not propagated to benchmark evaluation

Low Severity

The SEED parameter is passed to datasets/generate.py but not forwarded to eval/delta.py, which calls run() in run_benchmark.py where SEED = 7 is hardcoded. Running ./eval/run_delta.sh 42 generates data with seed 42 but evaluates the benchmark with seed 7, so the user-facing [SEED] argument only controls half the random state — silently contradicting the script's "one-command reproducible" and "CRN — deterministic" documentation.

^{Reviewed by Cursor Bugbot for commit e9d6894. Configure here.}

hizrianraz merged commit 1ecb46a into main May 29, 2026

hizrianraz deleted the hizrianraz/ain-288-delta-oneshot branch May 29, 2026 06:11

cursor Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIN-288 · one-command delta dry-run + CI#2

AIN-288 · one-command delta dry-run + CI#2
hizrianraz merged 1 commit into
mainfrom
hizrianraz/ain-288-delta-oneshot

hizrianraz commented May 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hizrianraz commented May 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 29, 2026

Choose a reason for hiding this comment

Seed parameter not propagated to benchmark evaluation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hizrianraz commented May 29, 2026 •

edited by cursor Bot

Loading