feat(evals): adopt thulr 0.1.3 — duel, review calibration, pareto, named criteria#13
feat(evals): adopt thulr 0.1.3 — duel, review calibration, pareto, named criteria#13justintime109 wants to merge 1 commit into
Conversation
…med criteria - eval:compare --pairwise runs thulr's position-swapped `duel` (relative win-rate judging; flips = judge position bias) over one self-contained trace per arm, replacing the hand-rolled in-process pairwise judge - human-review calibration: `eval:review` records SME verdicts; `eval` auto-folds them into `calibrate --reviews` (judge-vs-human TPR/TNR) - `eval:pareto` ranks failure modes across stored traces (free, no judge calls) - multi-dimension named criteria (thulr.criteria.<dim>) on the review cases (evidence_quality, impact_explanation) for score headroom; opt-in --score-guardrail=<dim> - eval:compare: configurable --timeout (subject + judge agents) and comma-separated --filter; judge.mjs no longer hardcodes a 120s judge cap - broaden the session-cache existence-check scorer (fixes a confirmed false negative where the judge was right and the regex missed the phrasing) - bridge: duelArgs/paretoArgs/reviewArgs/formatDuelSummary + tests; shared arg parser; 0.1.2 -> 0.1.3 docs/stamps evals/ is not packaged (.npmignore), so this is dev-tooling only — no version bump. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5fdf012acb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let duelReport = null; | ||
| if (pairwise && !dryRun) { | ||
| const eligible = rows.filter((r) => r.duelEligible); | ||
| if (!thulr.available()) { |
There was a problem hiding this comment.
Move thulr pairwise preflight before paid arms
When npm run eval:compare -- --pairwise is run without thulr on PATH, this availability check happens only after the loop above has already run every flows/plain arm and the absolute judges, so the command can spend tokens and wall-clock time and then merely print that the requested duel could not run. Since --pairwise is the new thulr-backed metric, fail or preflight thulr before starting the paid case loop so users do not get a successful-looking A/B run with no duel report.
Useful? React with 👍 / 👎.
Adopts thulr 0.1.3 across the opt-in eval harness (
evals/), adds three capabilities, and hardens the A/B path.evals/is not packaged (.npmignore), so this is dev-tooling only — no version bump.duel (relative A/B)
npm run eval:compare -- --pairwisenow shells out to thulr's calibrated, position-swappedduel(a win counts only when both orderings agree; opposite preferences are reported as flips = judge position bias) over one self-contained trace per arm — replacing the hand-rolled in-process pairwise judge.human-review calibration
npm run eval:reviewrecords SME verdicts (thulr review);npm run evalauto-discovers.thulr/reviews/<trace>.reviews.jsonand folds them intocalibrate --reviews(judge-vs-human TPR/TNR) on top of the deterministic-label axis.pareto
npm run eval:paretoranks failure modes across stored traces — which failure on which version to fix first (free, no judge calls).named criteria (score headroom)
The two hard review cases carry orthogonal
thulr.criteria.<dim>dimensions (evidence_quality,impact_explanation) judged alongside the completenesscriterion. Opt-in gating via--score-guardrail=<dim>.hardening
eval:comparegains--timeout(applies to subject and judge agents) and comma-separated--filter;judge.mjsno longer hardcodes a 120s judge cap.session-cacheexistence-check scorer after a confirmed false negative — the judge was right, the regex missed the "miss guard / unknown id throws" phrasing.tests / docs
npm run checkgreen.evals/README.md,CHANGELOG.md(Unreleased), and all 0.1.2 → 0.1.3 stamps updated.verification
thulr.criteria.<dim>attributes;thulr inspect-tracereports judge-grade with 0 issues.🤖 Generated with Claude Code