Skip to content

feat(evals): adopt thulr 0.1.3 — duel, review calibration, pareto, named criteria#13

Open
justintime109 wants to merge 1 commit into
mainfrom
feat/evals-thulr-0.1.3
Open

feat(evals): adopt thulr 0.1.3 — duel, review calibration, pareto, named criteria#13
justintime109 wants to merge 1 commit into
mainfrom
feat/evals-thulr-0.1.3

Conversation

@justintime109

Copy link
Copy Markdown
Contributor

Adopts thulr 0.1.3 across the opt-in eval harness (evals/), adds three capabilities, and hardens the A/B path. evals/ is not packaged (.npmignore), so this is dev-tooling only — no version bump.

duel (relative A/B)

npm run eval:compare -- --pairwise now shells out to thulr's calibrated, position-swapped duel (a win counts only when both orderings agree; opposite preferences are reported as flips = judge position bias) over one self-contained trace per arm — replacing the hand-rolled in-process pairwise judge.

human-review calibration

npm run eval:review records SME verdicts (thulr review); npm run eval auto-discovers .thulr/reviews/<trace>.reviews.json and folds them into calibrate --reviews (judge-vs-human TPR/TNR) on top of the deterministic-label axis.

pareto

npm run eval:pareto ranks failure modes across stored traces — which failure on which version to fix first (free, no judge calls).

named criteria (score headroom)

The two hard review cases carry orthogonal thulr.criteria.<dim> dimensions (evidence_quality, impact_explanation) judged alongside the completeness criterion. Opt-in gating via --score-guardrail=<dim>.

hardening

  • eval:compare gains --timeout (applies to subject and judge agents) and comma-separated --filter; judge.mjs no longer hardcodes a 120s judge cap.
  • Broadened the session-cache existence-check scorer after a confirmed false negative — the judge was right, the regex missed the "miss guard / unknown id throws" phrasing.

tests / docs

  • New bridge unit tests (duel/pareto/review args, duel-summary formatter, named-criteria emission, shared arg parser). 94 tests pass, npm run check green.
  • evals/README.md, CHANGELOG.md (Unreleased), and all 0.1.2 → 0.1.3 stamps updated.

verification

  • Offline: dry-run trace carries the new thulr.criteria.<dim> attributes; thulr inspect-trace reports judge-grade with 0 issues.
  • Live: duel ran end-to-end (zero-token fake-judge smoke + real A/B); named criteria judged into their own dimensions.

🤖 Generated with Claude Code

…med criteria

- eval:compare --pairwise runs thulr's position-swapped `duel` (relative win-rate
  judging; flips = judge position bias) over one self-contained trace per arm,
  replacing the hand-rolled in-process pairwise judge
- human-review calibration: `eval:review` records SME verdicts; `eval` auto-folds
  them into `calibrate --reviews` (judge-vs-human TPR/TNR)
- `eval:pareto` ranks failure modes across stored traces (free, no judge calls)
- multi-dimension named criteria (thulr.criteria.<dim>) on the review cases
  (evidence_quality, impact_explanation) for score headroom; opt-in
  --score-guardrail=<dim>
- eval:compare: configurable --timeout (subject + judge agents) and
  comma-separated --filter; judge.mjs no longer hardcodes a 120s judge cap
- broaden the session-cache existence-check scorer (fixes a confirmed false
  negative where the judge was right and the regex missed the phrasing)
- bridge: duelArgs/paretoArgs/reviewArgs/formatDuelSummary + tests; shared arg
  parser; 0.1.2 -> 0.1.3 docs/stamps

evals/ is not packaged (.npmignore), so this is dev-tooling only — no version bump.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5fdf012acb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread evals/compare.mjs
let duelReport = null;
if (pairwise && !dryRun) {
const eligible = rows.filter((r) => r.duelEligible);
if (!thulr.available()) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move thulr pairwise preflight before paid arms

When npm run eval:compare -- --pairwise is run without thulr on PATH, this availability check happens only after the loop above has already run every flows/plain arm and the absolute judges, so the command can spend tokens and wall-clock time and then merely print that the requested duel could not run. Since --pairwise is the new thulr-backed metric, fail or preflight thulr before starting the paid case loop so users do not get a successful-looking A/B run with no duel report.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant