title: "eval: track nightly full-eval green streak (3 consecutive days)"
labels: [eval, w2]
assignees: [SebAustin]
Problem
W1 acceptance requires the nightly workflow to stay green for three
consecutive days. Today we only have a pass/fail badge with no streak
counter or regression alert.
Acceptance criteria
References
.github/workflows/eval-nightly.yml
prompts/05_evals_full_run.md
title: "eval: track nightly full-eval green streak (3 consecutive days)"
labels: [eval, w2]
assignees: [SebAustin]
Problem
W1 acceptance requires the nightly workflow to stay green for three
consecutive days. Today we only have a pass/fail badge with no streak
counter or regression alert.
Acceptance criteria
summary.jsonartifact and posts a jobsummary with
mean_judge_score,n_pass, andtotal_cost_usd.fails twice in a row.
References
.github/workflows/eval-nightly.ymlprompts/05_evals_full_run.md