feat(evals): adopt thulr 0.1.3 — duel, review calibration, pareto, named criteria by justintime109 · Pull Request #13 · Thulr/pi-flows

justintime109 · 2026-06-13T17:19:42Z

Adopts thulr 0.1.3 across the opt-in eval harness (evals/), adds three capabilities, and hardens the A/B path. evals/ is not packaged (.npmignore), so this is dev-tooling only — no version bump.

duel (relative A/B)

npm run eval:compare -- --pairwise now shells out to thulr's calibrated, position-swapped duel (a win counts only when both orderings agree; opposite preferences are reported as flips = judge position bias) over one self-contained trace per arm — replacing the hand-rolled in-process pairwise judge.

human-review calibration

npm run eval:review records SME verdicts (thulr review); npm run eval auto-discovers .thulr/reviews/<trace>.reviews.json and folds them into calibrate --reviews (judge-vs-human TPR/TNR) on top of the deterministic-label axis.

pareto

npm run eval:pareto ranks failure modes across stored traces — which failure on which version to fix first (free, no judge calls).

named criteria (score headroom)

The two hard review cases carry orthogonal thulr.criteria.<dim> dimensions (evidence_quality, impact_explanation) judged alongside the completeness criterion. Opt-in gating via --score-guardrail=<dim>.

hardening

eval:compare gains --timeout (applies to subject and judge agents) and comma-separated --filter; judge.mjs no longer hardcodes a 120s judge cap.
Broadened the session-cache existence-check scorer after a confirmed false negative — the judge was right, the regex missed the "miss guard / unknown id throws" phrasing.

tests / docs

New bridge unit tests (duel/pareto/review args, duel-summary formatter, named-criteria emission, shared arg parser). 94 tests pass, npm run check green.
evals/README.md, CHANGELOG.md (Unreleased), and all 0.1.2 → 0.1.3 stamps updated.

verification

Offline: dry-run trace carries the new thulr.criteria.<dim> attributes; thulr inspect-trace reports judge-grade with 0 issues.
Live: duel ran end-to-end (zero-token fake-judge smoke + real A/B); named criteria judged into their own dimensions.

🤖 Generated with Claude Code

…med criteria - eval:compare --pairwise runs thulr's position-swapped `duel` (relative win-rate judging; flips = judge position bias) over one self-contained trace per arm, replacing the hand-rolled in-process pairwise judge - human-review calibration: `eval:review` records SME verdicts; `eval` auto-folds them into `calibrate --reviews` (judge-vs-human TPR/TNR) - `eval:pareto` ranks failure modes across stored traces (free, no judge calls) - multi-dimension named criteria (thulr.criteria.<dim>) on the review cases (evidence_quality, impact_explanation) for score headroom; opt-in --score-guardrail=<dim> - eval:compare: configurable --timeout (subject + judge agents) and comma-separated --filter; judge.mjs no longer hardcodes a 120s judge cap - broaden the session-cache existence-check scorer (fixes a confirmed false negative where the judge was right and the regex missed the phrasing) - bridge: duelArgs/paretoArgs/reviewArgs/formatDuelSummary + tests; shared arg parser; 0.1.2 -> 0.1.3 docs/stamps evals/ is not packaged (.npmignore), so this is dev-tooling only — no version bump. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5fdf012acb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-13T17:24:38Z

+	let duelReport = null;
+	if (pairwise && !dryRun) {
+		const eligible = rows.filter((r) => r.duelEligible);
+		if (!thulr.available()) {


Move thulr pairwise preflight before paid arms

When npm run eval:compare -- --pairwise is run without thulr on PATH, this availability check happens only after the loop above has already run every flows/plain arm and the absolute judges, so the command can spend tokens and wall-clock time and then merely print that the requested duel could not run. Since --pairwise is the new thulr-backed metric, fail or preflight thulr before starting the paid case loop so users do not get a successful-looking A/B run with no duel report.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): adopt thulr 0.1.3 — duel, review calibration, pareto, named criteria#13

feat(evals): adopt thulr 0.1.3 — duel, review calibration, pareto, named criteria#13
justintime109 wants to merge 1 commit into
mainfrom
feat/evals-thulr-0.1.3

justintime109 commented Jun 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justintime109 commented Jun 13, 2026

duel (relative A/B)

human-review calibration

pareto

named criteria (score headroom)

hardening

tests / docs

verification

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant