feat: judge reliability - calibration, uncertainty, and bias debiasing (v0.3.0) by LesterALeong · Pull Request #2 · LesterALeong/llm-evalgate

LesterALeong · 2026-06-04T22:12:59Z

Summary

An LLM judge is itself a model, with variance and bias. This release adds the tools to measure and correct for that, so a judge can be defended in a design review instead of trusted on faith. All three pieces use the existing Dimension interface and stay offline-testable via injected callables.

Uncertainty: self-consistency (`SelfConsistencyJudge`)

Samples a judge N times and reports the mean, spread, and a 95% confidence interval, so you know whether a score is solid or a coin flip. JudgeDimension.score() was added as an un-cached single-shot scorer (the existing evaluate was refactored to delegate to it; cache and graceful-failure behavior unchanged).

Position-bias debiasing: `PairwiseJudge`

Runs a pairwise comparison in both answer orders and only trusts a verdict that survives the swap; if the judge flips, it returns a tie with consistent=False. position_bias_rate() quantifies how often a judge flips across a set of pairs.

Calibration: `calibrate_judge`

Measures judge-vs-human agreement: Pearson/Spearman correlation and MAE against human scores, plus accuracy and Cohen's kappa against human pass/fail. verbosity_bias() checks the failure mode where a judge just rewards length. New pearson/spearman/mae stats added to bench.metrics.

Quality

118 tests, 98% coverage, ruff clean.
All 4 examples run offline (new examples/judge_calibration.py demos all three tools).
Wheel builds at 0.3.0 with the golden dataset still bundled and loadable from an installed context.
Backwards compatible: no change to existing gates, judge, agentic, or bench behavior.

Deferred to v0.3.1 (reviewer follow-ups, non-blocking)

Store the verbatim model response in PairwiseResult.raw (currently reconstructed).
Remove the double prompt-render in JudgeDimension.evaluate.

Packaging

Version 0.2.0 -> 0.3.0

Extract an un-cached single-shot score() from JudgeDimension.evaluate and refactor evaluate to reuse it (cache behavior unchanged). Add a SelfConsistencyJudge that samples a base judge N times via score() and reports the score distribution (mean, pstdev, normal-approx 95% CI). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Implements PairwiseJudge that runs A-vs-B comparisons in both orders and only trusts a verdict consistent across the swap, falling back to a tie otherwise. Adds parse_pairwise, position_bias_rate diagnostic, default prompt template, and an offline test suite. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Append pearson/spearman/mae to bench.metrics and add judge.calibration (CalibrationSample, CalibrationReport, calibrate_judge, verbosity_bias) measuring score-level and label-level agreement plus verbosity bias. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- Top-level + judge/ + bench/ __init__ exports for SelfConsistencyJudge, ScoreDistribution, PairwiseJudge, PairwiseResult, position_bias_rate, calibrate_judge, CalibrationSample, CalibrationReport, verbosity_bias, and the pearson/spearman/mae stats. - examples/judge_calibration.py: offline demo of uncertainty, position-bias debiasing, calibration, and verbosity-bias. - README: "Judge reliability" section. - version 0.2.0 -> 0.3.0.

…sed) Reviewer follow-up (MINOR). Other reviewer suggestions deferred to v0.3.1: verbatim raw in PairwiseResult, and the evaluate/score double-render.

LesterALeong and others added 5 commits June 4, 2026 16:26

docs: note self-consistency cost (N model calls per text, cache bypas…

fb47421

…sed) Reviewer follow-up (MINOR). Other reviewer suggestions deferred to v0.3.1: verbatim raw in PairwiseResult, and the evaluate/score double-render.

LesterALeong merged commit 1fc1df2 into master Jun 4, 2026
4 checks passed

LesterALeong deleted the feat/v0.3.0-judge-reliability branch June 4, 2026 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: judge reliability - calibration, uncertainty, and bias debiasing (v0.3.0)#2

feat: judge reliability - calibration, uncertainty, and bias debiasing (v0.3.0)#2
LesterALeong merged 5 commits into
masterfrom
feat/v0.3.0-judge-reliability

LesterALeong commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LesterALeong commented Jun 4, 2026

Summary

Uncertainty: self-consistency (SelfConsistencyJudge)

Position-bias debiasing: PairwiseJudge

Calibration: calibrate_judge

Quality

Deferred to v0.3.1 (reviewer follow-ups, non-blocking)

Packaging

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uncertainty: self-consistency (`SelfConsistencyJudge`)

Position-bias debiasing: `PairwiseJudge`

Calibration: `calibrate_judge`