Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/workflows/delta-dry-run.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
name: delta-dry-run
on:
push:
paths: ["benchmark/**", "eval/**", "datasets/**", ".github/workflows/delta-dry-run.yml"]
pull_request:
paths: ["benchmark/**", "eval/**", "datasets/**"]
jobs:
delta:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.12" }
- run: pip install -r benchmark/requirements.txt
- name: AIN-288 synthetic delta dry-run (must exit 0, deterministic seed=7)
run: ./eval/run_delta.sh 7
- name: assert delta table emitted
run: test -s preprint/results.md && grep -q 'ainfera_learned' preprint/results.md
15 changes: 15 additions & 0 deletions eval/run_delta.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/usr/bin/env bash
# AIN-288 · one-command reproducible delta harness (synthetic dry-run).
#
# ./eval/run_delta.sh [SEED] # default seed 7 (CRN — deterministic)
#
# Generates the CC0 synthetic corpus then computes the routed-vs-baselines
# delta table (routed/ainfera_learned vs single_best / cheapest / round_robin
# (random) / agent_baseline (agent-default) / ainfera_static / oracle) and
# writes preprint/results.md. Until a real judged corpus exists this is a
# DRY-RUN (labeled synthetic, not a published claim).
set -euo pipefail
cd "$(dirname "$0")/.."
SEED="${1:-7}"
python datasets/generate.py --seed "$SEED"
python eval/delta.py
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seed parameter not propagated to benchmark evaluation

Low Severity

The SEED parameter is passed to datasets/generate.py but not forwarded to eval/delta.py, which calls run() in run_benchmark.py where SEED = 7 is hardcoded. Running ./eval/run_delta.sh 42 generates data with seed 42 but evaluates the benchmark with seed 7, so the user-facing [SEED] argument only controls half the random state — silently contradicting the script's "one-command reproducible" and "CRN — deterministic" documentation.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e9d6894. Configure here.

Loading