Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,16 @@ that must agree are `package.json`, `PI_FLOWS_VERSION` in
summary now prints thulr's numeric score, pass-rate, and efficiency deltas from
`thulr gate --json` before the human gate report, and `--noise-band=<n>` makes
guardrail tolerance explicit.
- Evals: adopt thulr 0.1.3. `npm run eval:compare -- --pairwise` now runs thulr's
calibrated, position-swapped **`duel`** (relative win-rate judging, flips reported
as judge position bias) over one self-contained trace per arm, replacing the
harness's hand-rolled in-process pairwise judge. `npm run eval:review` records
human SME verdicts and `npm run eval` folds them into calibration as a second
ground-truth axis (`--reviews`; judge-vs-human TPR/TNR), auto-discovering
`.thulr/reviews/<trace>.reviews.json`. `npm run eval:pareto` ranks failure modes
across stored traces (which failure on which prompt/config version to fix first).
Calibration also surfaces thulr 0.1.3's judge-trust gate: a judge blind in either
direction downgrades a clean gate PASS to WARN.
- Vote/orchestrate quality: same-agent/model voters now receive complementary
stances so ballots are not identical prompt replays, and orchestrate workers
now see the overall goal/contract alongside their assigned subtask before
Expand Down
80 changes: 69 additions & 11 deletions evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ npm run eval -- --judge-model=anthropic/claude-opus-4-8 # thulr judge model (d
npm run eval -- --judge-bin=/path/to/judge-wrapper # override thulr's judge command
npm run eval -- --samples=3 # judge each case 3×: majority verdict, mean score, flake warnings (3× judge spend)
npm run eval -- --eval-set=.thulr/eval-sets/release.json # overlay promoted criteria / guardrail authority
npm run eval -- --reviews=.thulr/reviews/thulr-trace.reviews.json # fold human SME verdicts into calibration (judge-vs-human TPR/TNR)
npm run eval -- --efficiency-guardrail=cost_usd --efficiency-guardrail=tokens # fail on spend/token regressions
npm run eval -- --noise-band=0.10 # regression tolerance for score/pass-rate/efficiency guardrails (default 0.05)
npm run eval -- --cap=1.00 # per-case USD ceiling on flow delegations (default 0.50)
Expand Down Expand Up @@ -76,7 +77,7 @@ verifies the binary, workspace, store, and that thulr's judge binary `pi` resolv
3. `thulr label-failures --trace <file>` applies thulr's failure-mode ontology
and writes labels for calibration/triage.
4. `thulr judge --trace <file>` grades each case's answer against its inline
`criterion` → an EvalRun. thulr (0.1.2) reads everything from the trace — no
`criterion` → an EvalRun. thulr (0.1.3) reads everything from the trace — no
separate cases-manifest or labels files. With `--samples=N` each case is judged
N times and aggregated (majority verdict, ties fail safe; mean score) — the
EvalRun's `score_stddev` then reports **judge noise** instead of cross-case
Expand All @@ -86,6 +87,12 @@ verifies the binary, workspace, store, and that thulr's judge binary `pi` resolv
— how well the judge's verdicts track the inline deterministic labels, with
failure labels included in the report. (An uncalibrated judge can silently
certify regressions; this is the calibration the old single-judge setup lacked.)
Record human SME verdicts with `npm run eval:review` and the harness folds them
in as a second ground-truth axis (`--reviews`; judge-vs-human TPR/TNR) — see
[Human review & failure triage](#human-review--failure-triage). thulr 0.1.3 also
queues every judge/ground-truth disagreement onto `thulr queue` and feeds this
calibration into the gate: a judge blind in either direction (TPR or TNR 0% over
labeled cases) downgrades a clean PASS to WARN with the dimension named.
6. Before gating, pi-flows writes `.thulr/runs/candidate.gate.json`, which is the
judged EvalRun with calibration canaries filtered out and summaries
recomputed. `thulr gate` compares that gate candidate to
Expand Down Expand Up @@ -224,7 +231,7 @@ and the same cross-model judge:

```bash
npm run eval:compare # all cases, both arms
npm run eval:compare -- --pairwise # add order-controlled pairwise judging (the sensitive metric)
npm run eval:compare -- --pairwise # add thulr's relative duel (the sensitive metric)
npm run eval:compare -- --filter=vote # scope to keep cost down (runs both arms per case)
npm run eval:compare -- --write=evals/compare.json
npm run eval:compare -- --dry-run # wiring smoke, no model
Expand All @@ -235,14 +242,50 @@ PI_FLOWS_TRACE_FILE=/tmp/ab.jsonl npm run eval:compare -- --pairwise --write=eva
npm run trace:report -- /tmp/ab.jsonl
```

`eval:compare` keeps its own **order-controlled pairwise** judge (run twice with
positions swapped, scored a win only when both orderings agree, told *not* to
reward length) — the sensitive head-to-head metric for small gaps that thulr's
absolute per-dimension scoring can't resolve. A few objective checks are
pi-flows-only by construction (route dispatch, the same-model vote warning); plain
pi can't satisfy them, so read those as *capabilities flows adds*, not plain losses.
Give a case a `baselinePrompt` when its flow params encode goal info outside `task`
(e.g. a return contract) so the plain arm is graded on the same goal.
With `--pairwise` the harness emits one self-contained trace per arm and shells out
to **`thulr duel`** (0.1.3) — thulr's calibrated relative judge. It pairs the arms
by case id, judges each shared case **twice with the answers swapped**, and counts a
win only when both orderings agree; opposite preferences are a **flip** (judge
position bias), reported as judge noise and excluded from the win rate. This is the
sensitive head-to-head metric for small gaps that thulr's absolute per-dimension
scoring can't resolve — and it replaces the harness's old in-process pairwise judge.
The duel spends two judge calls per eligible case (both arms must have reached the
model) and persists a `thulr.duel_report.v1` at `.thulr/runs/compare-duel.json`. A
few objective checks are pi-flows-only by construction (route dispatch, the
same-model vote warning); plain pi can't satisfy them, so read those as *capabilities
flows adds*, not plain losses. Give a case a `baselinePrompt` when its flow params
encode goal info outside `task` (e.g. a return contract) so the plain arm is graded
on the same goal.

## Human review & failure triage

Two free (no judge tokens) thulr 0.1.3 workflows close the loop on judged runs.

**Record human verdicts** so calibration measures the judge against a person, not
only the deterministic labels:

```bash
npm run eval:review -- --list # reviewed / unreviewed case ids for the last trace
npm run eval:review -- --case single-answer-quality-judged --verdict pass
npm run eval:review -- --case route-classifies-bug-to-recon --verdict fail \
--failure-mode routing.wrong_agent --note "should have gone to recon"
```

Verdicts land in `.thulr/reviews/thulr-trace.reviews.json` — the path the next
`npm run eval` auto-discovers — so a recorded verdict needs no flag on the next run.
`calibrate` then reports a **human** section (judge-vs-human TPR/TNR), and human
verdicts take precedence over auto labels for the cases they cover. Point at an
explicit set with `npm run eval -- --reviews=<path>`.

**Rank failure modes across every stored trace** — which failure on which prompt or
config version to fix first, joining deterministic labels, human reviews, and stored
EvalRun scores:

```bash
npm run eval:pareto # rank by prompt version over evals/thulr-trace.jsonl
npm run eval:pareto -- --by=config-version # split by subject config instead
npm run eval:pareto -- --limit=10 # top N rows
```

## Experiments: champion/challenger (and the optimizer)

Expand Down Expand Up @@ -302,6 +345,9 @@ Append to `cases.mjs`:
params: { agent: "recon", task: "…" }, // the flow tool input
cwd: "/optional/working/dir",
criterion: "One strict, literal statement a correct answer must satisfy.", // graded by thulr's judge
namedCriteria: { // optional: extra judge dimensions (0.1.3)
evidence_quality: "Each claim cites the specific code it refers to.",
},
score(result, ctx) { // objective, deterministic check
const ok = /expected/.test(result.content[0].text);
return { pass: ok, score: ok ? 1 : 0, notes: "…" };
Expand All @@ -316,6 +362,15 @@ single literal statement of what a correct answer must say; thulr grades the ans
text against it on a different vendor than the subject. Always provide a `mock` so
`--dry-run` can exercise the runner — and the artifact emission — offline.

**Named criteria (`namedCriteria`)** add thulr 0.1.3 multi-dimension judging: each
`{ dimension: "criterion text" }` entry is emitted as `thulr.criteria.<dimension>`
on the graded span and judged into **its own dimension** alongside the required
`criterion` — with its own pass-rate, score delta, and calibration. Use them for
*orthogonal* quality axes (e.g. `evidence_quality`, `impact_explanation`) so a
near-saturated case still produces a gradient. Dimension names must be non-empty,
whitespace-free, and not `criterion`. They are observed by default; gate one with
`--score-guardrail=<dimension>` once it looks stable.

### Hard cases (`hard: true`)

For **score-tracked** cases — ones that intentionally land mid-scale so a better
Expand All @@ -327,7 +382,10 @@ the run to be green — only a regression in their mean score blocks. Keep the `
a *complete* answer so `--dry-run` stays green. See `review-finds-all-webhook-defects`
(4 defects) and `review-finds-session-cache-defects` (3 defects) — multi-defect code
reviews where a typical pass misses the subtler ones (signature verification, TTL
validation), so a sharper prompt has room to climb.
validation), so a sharper prompt has room to climb. Both also carry `namedCriteria`
(`evidence_quality`, `impact_explanation`) so the judge grades *how well* each defect
is explained, not just whether all were found — extra headroom on cases that would
otherwise saturate at "found them all."

A *frontier* subject model exhausts these small fixtures (it finds every defect), so
the score pins at 1.0 with no headroom. Rather than pin a different model per case,
Expand Down
27 changes: 27 additions & 0 deletions evals/args.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
// Tiny argv parser for the eval CLI wrappers (review.mjs, pareto.mjs). Accepts both
// `--name value` (the style thulr's own CLI uses) and `--name=value` (the style the
// rest of the harness uses), plus bare boolean flags (`--list`, `--json`). Returns a
// plain object keyed by flag name. A token that itself starts with `--` is never
// consumed as a value, so a bare flag immediately before another flag stays boolean.
// Repeated flags keep the last value; positionals are ignored.
export function parseArgs(argv) {
const opts = {};
for (let i = 0; i < argv.length; i++) {
const a = argv[i];
if (!a.startsWith("--")) continue;
const eq = a.indexOf("=");
if (eq !== -1) {
opts[a.slice(2, eq)] = a.slice(eq + 1);
continue;
}
const name = a.slice(2);
const next = argv[i + 1];
if (next !== undefined && !next.startsWith("--")) {
opts[name] = next;
i += 1;
} else {
opts[name] = true;
}
}
return opts;
}
20 changes: 19 additions & 1 deletion evals/cases.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,13 @@ export const CASES = [
cwd: fixturesRepo,
baselinePrompt: "Review billing-webhook.js for ALL production-correctness defects, not just the most obvious one. Name each distinct defect and why it matters.",
criterion: "The review identifies ALL FOUR distinct defects: (1) recordPayment references `ledger`, which is never declared/initialized, so every call throws a ReferenceError (500); (2) no idempotency/deduplication, so a duplicate or retried delivery double-counts the payment; (3) no verification of the webhook's signature/authenticity, so a forged request is accepted as a real payment; (4) no input validation or error handling, so a malformed `req.body.data.object` throws unhandled and 500s. Fewer than four is incomplete.",
// Orthogonal quality dimensions (0.1.3 multi-dimension judging) — graded
// alongside the completeness `criterion` above so a review that names defects
// but explains them shallowly scores lower here, giving the suite headroom.
namedCriteria: {
evidence_quality: "Every defect named is pinned to the specific code that causes it (e.g. `recordPayment`/the undeclared `ledger`, `req.body.data.object`, the missing signature check) rather than described in vague or generic terms.",
impact_explanation: "Every defect states its concrete production impact (e.g. a ReferenceError 500 on every call, double-counted payments on a retried webhook delivery, a forged request accepted as a real payment, an unhandled 500 on a malformed body), not merely that something is wrong.",
},
score(r) {
const body = text(r);
const ledger = /ledger/i.test(body) && /(never (declared|defined|initiali)|undeclared|undefined|referenceerror|not (declared|defined|initiali))/i.test(body);
Expand All @@ -166,9 +173,20 @@ export const CASES = [
cwd: fixturesRepo,
baselinePrompt: "Review session-cache.js for ALL correctness and reliability defects, not just the most obvious one. Name each distinct defect and why it matters.",
criterion: "The review identifies ALL THREE distinct defects: (1) getSession reads `entry.expiresAt` without checking the id exists, so an unknown/missing id dereferences `undefined` and throws a TypeError; (2) expired entries are never evicted (getSession returns null but leaves them), so the store grows unbounded — a memory leak; (3) ttlSeconds is never validated, so a missing, NaN, or negative TTL produces a broken/garbage expiry. Fewer than three is incomplete.",
// Orthogonal quality dimensions (0.1.3 multi-dimension judging) — graded
// alongside the completeness `criterion` above so a shallow-but-complete
// review scores lower here, giving the suite headroom.
namedCriteria: {
evidence_quality: "Every defect named is pinned to the specific code that causes it (e.g. `getSession` dereferencing `entry.expiresAt`, the never-evicted `sessions` entries, the unvalidated `ttlSeconds`) rather than described in vague or generic terms.",
impact_explanation: "Every defect states its concrete impact (e.g. a TypeError when the id is unknown, unbounded memory growth from expired entries never being evicted, a broken expiry from a missing/NaN/negative TTL), not merely that something is wrong.",
},
score(r) {
const body = text(r);
const existence = /(entry|session|id)[^.]{0,40}(undefined|missing|absent|does(n'?t| not) exist|not (found|present|exist)|no[^.]{0,10}(existence|null|presence) check)|throws?[^.]{0,30}(unknown|missing|absent|undefined|no .{0,8}(id|session|entry))|typeerror|crash[^.]{0,20}(missing|unknown|absent|undefined)/i.test(body);
// Broadened after a real false negative: the model wrote "without a miss
// guard / unknown id throws / cache miss can crash", none of which the old
// pattern matched. Match the concept (a missing/unknown id or cache miss
// dereferences/throws, or a missing existence guard), not one phrasing.
const existence = /\btypeerror\b|(unknown|missing|absent|non-?existent|invalid|unrecogni[sz]ed)[^.]{0,30}\b(id|key|entry|session|lookup)\b|\bcache[- ]?miss\b|(entry|session|getsession)[^.]{0,40}(undefined|null|throw|crash|deref|not[^.]{0,8}(exist|found|present))|(no|missing|without|lacks?|add|needs?)[^.]{0,25}(existence|presence|null|miss|nil)?[- ]?(guard|check)|\bmiss[- ]?guard\b/i.test(body);
const leak = /memory leak|never (evict|delet|remov|clean|free|purg)|unbounded|grow[^.]{0,16}(forever|unbounded|indefinit|without bound)|not[^.]{0,8}(evict|delet|remov|clean|purg)|\bleak/i.test(body);
const ttl = /ttlseconds|\bttl\b/i.test(body) && /validat|negativ|\bnan\b|invalid|unchecked|non-numeric|immortal|never expir/i.test(body);
const found = [existence && "no-existence-check", leak && "memory-leak", ttl && "no-ttl-validation"].filter(Boolean);
Expand Down
Loading