Thulr · justintime109 · Jun 13, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -29,6 +29,16 @@ that must agree are `package.json`, `PI_FLOWS_VERSION` in
   summary now prints thulr's numeric score, pass-rate, and efficiency deltas from
   `thulr gate --json` before the human gate report, and `--noise-band=<n>` makes
   guardrail tolerance explicit.
+- Evals: adopt thulr 0.1.3. `npm run eval:compare -- --pairwise` now runs thulr's
+  calibrated, position-swapped **`duel`** (relative win-rate judging, flips reported
+  as judge position bias) over one self-contained trace per arm, replacing the
+  harness's hand-rolled in-process pairwise judge. `npm run eval:review` records
+  human SME verdicts and `npm run eval` folds them into calibration as a second
+  ground-truth axis (`--reviews`; judge-vs-human TPR/TNR), auto-discovering
+  `.thulr/reviews/<trace>.reviews.json`. `npm run eval:pareto` ranks failure modes
+  across stored traces (which failure on which prompt/config version to fix first).
+  Calibration also surfaces thulr 0.1.3's judge-trust gate: a judge blind in either
+  direction downgrades a clean gate PASS to WARN.
 - Vote/orchestrate quality: same-agent/model voters now receive complementary
   stances so ballots are not identical prompt replays, and orchestrate workers
   now see the overall goal/contract alongside their assigned subtask before

diff --git a/evals/README.md b/evals/README.md
@@ -38,6 +38,7 @@ npm run eval -- --judge-model=anthropic/claude-opus-4-8   # thulr judge model (d
 npm run eval -- --judge-bin=/path/to/judge-wrapper   # override thulr's judge command
 npm run eval -- --samples=3        # judge each case 3×: majority verdict, mean score, flake warnings (3× judge spend)
 npm run eval -- --eval-set=.thulr/eval-sets/release.json   # overlay promoted criteria / guardrail authority
+npm run eval -- --reviews=.thulr/reviews/thulr-trace.reviews.json   # fold human SME verdicts into calibration (judge-vs-human TPR/TNR)
 npm run eval -- --efficiency-guardrail=cost_usd --efficiency-guardrail=tokens   # fail on spend/token regressions
 npm run eval -- --noise-band=0.10  # regression tolerance for score/pass-rate/efficiency guardrails (default 0.05)
 npm run eval -- --cap=1.00         # per-case USD ceiling on flow delegations (default 0.50)
@@ -76,7 +77,7 @@ verifies the binary, workspace, store, and that thulr's judge binary `pi` resolv
 3. `thulr label-failures --trace <file>` applies thulr's failure-mode ontology
    and writes labels for calibration/triage.
 4. `thulr judge --trace <file>` grades each case's answer against its inline
-   `criterion` → an EvalRun. thulr (0.1.2) reads everything from the trace — no
+   `criterion` → an EvalRun. thulr (0.1.3) reads everything from the trace — no
    separate cases-manifest or labels files. With `--samples=N` each case is judged
    N times and aggregated (majority verdict, ties fail safe; mean score) — the
    EvalRun's `score_stddev` then reports **judge noise** instead of cross-case
@@ -86,6 +87,12 @@ verifies the binary, workspace, store, and that thulr's judge binary `pi` resolv
    — how well the judge's verdicts track the inline deterministic labels, with
    failure labels included in the report. (An uncalibrated judge can silently
    certify regressions; this is the calibration the old single-judge setup lacked.)
+   Record human SME verdicts with `npm run eval:review` and the harness folds them
+   in as a second ground-truth axis (`--reviews`; judge-vs-human TPR/TNR) — see
+   [Human review & failure triage](#human-review--failure-triage). thulr 0.1.3 also
+   queues every judge/ground-truth disagreement onto `thulr queue` and feeds this
+   calibration into the gate: a judge blind in either direction (TPR or TNR 0% over
+   labeled cases) downgrades a clean PASS to WARN with the dimension named.
 6. Before gating, pi-flows writes `.thulr/runs/candidate.gate.json`, which is the
    judged EvalRun with calibration canaries filtered out and summaries
    recomputed. `thulr gate` compares that gate candidate to
@@ -224,7 +231,7 @@ and the same cross-model judge:
 
 ```bash
 npm run eval:compare                    # all cases, both arms
-npm run eval:compare -- --pairwise      # add order-controlled pairwise judging (the sensitive metric)
+npm run eval:compare -- --pairwise      # add thulr's relative duel (the sensitive metric)
 npm run eval:compare -- --filter=vote   # scope to keep cost down (runs both arms per case)
 npm run eval:compare -- --write=evals/compare.json
 npm run eval:compare -- --dry-run       # wiring smoke, no model
@@ -235,14 +242,50 @@ PI_FLOWS_TRACE_FILE=/tmp/ab.jsonl npm run eval:compare -- --pairwise --write=eva
 npm run trace:report -- /tmp/ab.jsonl
 ```
 
-`eval:compare` keeps its own **order-controlled pairwise** judge (run twice with
-positions swapped, scored a win only when both orderings agree, told *not* to
-reward length) — the sensitive head-to-head metric for small gaps that thulr's
-absolute per-dimension scoring can't resolve. A few objective checks are
-pi-flows-only by construction (route dispatch, the same-model vote warning); plain
-pi can't satisfy them, so read those as *capabilities flows adds*, not plain losses.
-Give a case a `baselinePrompt` when its flow params encode goal info outside `task`
-(e.g. a return contract) so the plain arm is graded on the same goal.
+With `--pairwise` the harness emits one self-contained trace per arm and shells out
+to **`thulr duel`** (0.1.3) — thulr's calibrated relative judge. It pairs the arms
+by case id, judges each shared case **twice with the answers swapped**, and counts a
+win only when both orderings agree; opposite preferences are a **flip** (judge
+position bias), reported as judge noise and excluded from the win rate. This is the
+sensitive head-to-head metric for small gaps that thulr's absolute per-dimension
+scoring can't resolve — and it replaces the harness's old in-process pairwise judge.
+The duel spends two judge calls per eligible case (both arms must have reached the
+model) and persists a `thulr.duel_report.v1` at `.thulr/runs/compare-duel.json`. A
+few objective checks are pi-flows-only by construction (route dispatch, the
+same-model vote warning); plain pi can't satisfy them, so read those as *capabilities
+flows adds*, not plain losses. Give a case a `baselinePrompt` when its flow params
+encode goal info outside `task` (e.g. a return contract) so the plain arm is graded
+on the same goal.
+
+## Human review & failure triage
+
+Two free (no judge tokens) thulr 0.1.3 workflows close the loop on judged runs.
+
+**Record human verdicts** so calibration measures the judge against a person, not
+only the deterministic labels:
+
+```bash
+npm run eval:review -- --list                              # reviewed / unreviewed case ids for the last trace
+npm run eval:review -- --case single-answer-quality-judged --verdict pass
+npm run eval:review -- --case route-classifies-bug-to-recon --verdict fail \
+  --failure-mode routing.wrong_agent --note "should have gone to recon"
+```
+
+Verdicts land in `.thulr/reviews/thulr-trace.reviews.json` — the path the next
+`npm run eval` auto-discovers — so a recorded verdict needs no flag on the next run.
+`calibrate` then reports a **human** section (judge-vs-human TPR/TNR), and human
+verdicts take precedence over auto labels for the cases they cover. Point at an
+explicit set with `npm run eval -- --reviews=<path>`.
+
+**Rank failure modes across every stored trace** — which failure on which prompt or
+config version to fix first, joining deterministic labels, human reviews, and stored
+EvalRun scores:
+
+```bash
+npm run eval:pareto                         # rank by prompt version over evals/thulr-trace.jsonl
+npm run eval:pareto -- --by=config-version  # split by subject config instead
+npm run eval:pareto -- --limit=10           # top N rows
+```
 
 ## Experiments: champion/challenger (and the optimizer)
 
@@ -302,6 +345,9 @@ Append to `cases.mjs`:
   params: { agent: "recon", task: "…" },   // the flow tool input
   cwd: "/optional/working/dir",
   criterion: "One strict, literal statement a correct answer must satisfy.",  // graded by thulr's judge
+  namedCriteria: {                           // optional: extra judge dimensions (0.1.3)
+    evidence_quality: "Each claim cites the specific code it refers to.",
+  },
   score(result, ctx) {                       // objective, deterministic check
     const ok = /expected/.test(result.content[0].text);
     return { pass: ok, score: ok ? 1 : 0, notes: "…" };
@@ -316,6 +362,15 @@ single literal statement of what a correct answer must say; thulr grades the ans
 text against it on a different vendor than the subject. Always provide a `mock` so
 `--dry-run` can exercise the runner — and the artifact emission — offline.
 
+**Named criteria (`namedCriteria`)** add thulr 0.1.3 multi-dimension judging: each
+`{ dimension: "criterion text" }` entry is emitted as `thulr.criteria.<dimension>`
+on the graded span and judged into **its own dimension** alongside the required
+`criterion` — with its own pass-rate, score delta, and calibration. Use them for
+*orthogonal* quality axes (e.g. `evidence_quality`, `impact_explanation`) so a
+near-saturated case still produces a gradient. Dimension names must be non-empty,
+whitespace-free, and not `criterion`. They are observed by default; gate one with
+`--score-guardrail=<dimension>` once it looks stable.
+
 ### Hard cases (`hard: true`)
 
 For **score-tracked** cases — ones that intentionally land mid-scale so a better
@@ -327,7 +382,10 @@ the run to be green — only a regression in their mean score blocks. Keep the `
 a *complete* answer so `--dry-run` stays green. See `review-finds-all-webhook-defects`
 (4 defects) and `review-finds-session-cache-defects` (3 defects) — multi-defect code
 reviews where a typical pass misses the subtler ones (signature verification, TTL
-validation), so a sharper prompt has room to climb.
+validation), so a sharper prompt has room to climb. Both also carry `namedCriteria`
+(`evidence_quality`, `impact_explanation`) so the judge grades *how well* each defect
+is explained, not just whether all were found — extra headroom on cases that would
+otherwise saturate at "found them all."
 
 A *frontier* subject model exhausts these small fixtures (it finds every defect), so
 the score pins at 1.0 with no headroom. Rather than pin a different model per case,

diff --git a/evals/args.mjs b/evals/args.mjs
@@ -0,0 +1,27 @@
+// Tiny argv parser for the eval CLI wrappers (review.mjs, pareto.mjs). Accepts both
+// `--name value` (the style thulr's own CLI uses) and `--name=value` (the style the
+// rest of the harness uses), plus bare boolean flags (`--list`, `--json`). Returns a
+// plain object keyed by flag name. A token that itself starts with `--` is never
+// consumed as a value, so a bare flag immediately before another flag stays boolean.
+// Repeated flags keep the last value; positionals are ignored.
+export function parseArgs(argv) {
+	const opts = {};
+	for (let i = 0; i < argv.length; i++) {
+		const a = argv[i];
+		if (!a.startsWith("--")) continue;
+		const eq = a.indexOf("=");
+		if (eq !== -1) {
+			opts[a.slice(2, eq)] = a.slice(eq + 1);
+			continue;
+		}
+		const name = a.slice(2);
+		const next = argv[i + 1];
+		if (next !== undefined && !next.startsWith("--")) {
+			opts[name] = next;
+			i += 1;
+		} else {
+			opts[name] = true;
+		}
+	}
+	return opts;
+}
diff --git a/evals/cases.mjs b/evals/cases.mjs
@@ -148,6 +148,13 @@ export const CASES = [
 		cwd: fixturesRepo,
 		baselinePrompt: "Review billing-webhook.js for ALL production-correctness defects, not just the most obvious one. Name each distinct defect and why it matters.",
 		criterion: "The review identifies ALL FOUR distinct defects: (1) recordPayment references `ledger`, which is never declared/initialized, so every call throws a ReferenceError (500); (2) no idempotency/deduplication, so a duplicate or retried delivery double-counts the payment; (3) no verification of the webhook's signature/authenticity, so a forged request is accepted as a real payment; (4) no input validation or error handling, so a malformed `req.body.data.object` throws unhandled and 500s. Fewer than four is incomplete.",
+		// Orthogonal quality dimensions (0.1.3 multi-dimension judging) — graded
+		// alongside the completeness `criterion` above so a review that names defects
+		// but explains them shallowly scores lower here, giving the suite headroom.
+		namedCriteria: {
+			evidence_quality: "Every defect named is pinned to the specific code that causes it (e.g. `recordPayment`/the undeclared `ledger`, `req.body.data.object`, the missing signature check) rather than described in vague or generic terms.",
+			impact_explanation: "Every defect states its concrete production impact (e.g. a ReferenceError 500 on every call, double-counted payments on a retried webhook delivery, a forged request accepted as a real payment, an unhandled 500 on a malformed body), not merely that something is wrong.",
+		},
 		score(r) {
 			const body = text(r);
 			const ledger = /ledger/i.test(body) && /(never (declared|defined|initiali)|undeclared|undefined|referenceerror|not (declared|defined|initiali))/i.test(body);
@@ -166,9 +173,20 @@ export const CASES = [
 		cwd: fixturesRepo,
 		baselinePrompt: "Review session-cache.js for ALL correctness and reliability defects, not just the most obvious one. Name each distinct defect and why it matters.",
 		criterion: "The review identifies ALL THREE distinct defects: (1) getSession reads `entry.expiresAt` without checking the id exists, so an unknown/missing id dereferences `undefined` and throws a TypeError; (2) expired entries are never evicted (getSession returns null but leaves them), so the store grows unbounded — a memory leak; (3) ttlSeconds is never validated, so a missing, NaN, or negative TTL produces a broken/garbage expiry. Fewer than three is incomplete.",
+		// Orthogonal quality dimensions (0.1.3 multi-dimension judging) — graded
+		// alongside the completeness `criterion` above so a shallow-but-complete
+		// review scores lower here, giving the suite headroom.
+		namedCriteria: {
+			evidence_quality: "Every defect named is pinned to the specific code that causes it (e.g. `getSession` dereferencing `entry.expiresAt`, the never-evicted `sessions` entries, the unvalidated `ttlSeconds`) rather than described in vague or generic terms.",
+			impact_explanation: "Every defect states its concrete impact (e.g. a TypeError when the id is unknown, unbounded memory growth from expired entries never being evicted, a broken expiry from a missing/NaN/negative TTL), not merely that something is wrong.",
+		},
 		score(r) {
 			const body = text(r);
-			const existence = /(entry|session|id)[^.]{0,40}(undefined|missing|absent|does(n'?t| not) exist|not (found|present|exist)|no[^.]{0,10}(existence|null|presence) check)|throws?[^.]{0,30}(unknown|missing|absent|undefined|no .{0,8}(id|session|entry))|typeerror|crash[^.]{0,20}(missing|unknown|absent|undefined)/i.test(body);
+			// Broadened after a real false negative: the model wrote "without a miss
+			// guard / unknown id throws / cache miss can crash", none of which the old
+			// pattern matched. Match the concept (a missing/unknown id or cache miss
+			// dereferences/throws, or a missing existence guard), not one phrasing.
+			const existence = /\btypeerror\b|(unknown|missing|absent|non-?existent|invalid|unrecogni[sz]ed)[^.]{0,30}\b(id|key|entry|session|lookup)\b|\bcache[- ]?miss\b|(entry|session|getsession)[^.]{0,40}(undefined|null|throw|crash|deref|not[^.]{0,8}(exist|found|present))|(no|missing|without|lacks?|add|needs?)[^.]{0,25}(existence|presence|null|miss|nil)?[- ]?(guard|check)|\bmiss[- ]?guard\b/i.test(body);
 			const leak = /memory leak|never (evict|delet|remov|clean|free|purg)|unbounded|grow[^.]{0,16}(forever|unbounded|indefinit|without bound)|not[^.]{0,8}(evict|delet|remov|clean|purg)|\bleak/i.test(body);
 			const ttl = /ttlseconds|\bttl\b/i.test(body) && /validat|negativ|\bnan\b|invalid|unchecked|non-numeric|immortal|never expir/i.test(body);
 			const found = [existence && "no-existence-check", leak && "memory-leak", ttl && "no-ttl-validation"].filter(Boolean);