Skip to content

feat: judge reliability - calibration, uncertainty, and bias debiasing (v0.3.0)#2

Merged
LesterALeong merged 5 commits into
masterfrom
feat/v0.3.0-judge-reliability
Jun 4, 2026
Merged

feat: judge reliability - calibration, uncertainty, and bias debiasing (v0.3.0)#2
LesterALeong merged 5 commits into
masterfrom
feat/v0.3.0-judge-reliability

Conversation

@LesterALeong

Copy link
Copy Markdown
Owner

Summary

An LLM judge is itself a model, with variance and bias. This release adds the tools to measure and correct for that, so a judge can be defended in a design review instead of trusted on faith. All three pieces use the existing Dimension interface and stay offline-testable via injected callables.

Uncertainty: self-consistency (SelfConsistencyJudge)

Samples a judge N times and reports the mean, spread, and a 95% confidence interval, so you know whether a score is solid or a coin flip. JudgeDimension.score() was added as an un-cached single-shot scorer (the existing evaluate was refactored to delegate to it; cache and graceful-failure behavior unchanged).

Position-bias debiasing: PairwiseJudge

Runs a pairwise comparison in both answer orders and only trusts a verdict that survives the swap; if the judge flips, it returns a tie with consistent=False. position_bias_rate() quantifies how often a judge flips across a set of pairs.

Calibration: calibrate_judge

Measures judge-vs-human agreement: Pearson/Spearman correlation and MAE against human scores, plus accuracy and Cohen's kappa against human pass/fail. verbosity_bias() checks the failure mode where a judge just rewards length. New pearson/spearman/mae stats added to bench.metrics.

Quality

  • 118 tests, 98% coverage, ruff clean.
  • All 4 examples run offline (new examples/judge_calibration.py demos all three tools).
  • Wheel builds at 0.3.0 with the golden dataset still bundled and loadable from an installed context.
  • Backwards compatible: no change to existing gates, judge, agentic, or bench behavior.

Deferred to v0.3.1 (reviewer follow-ups, non-blocking)

  • Store the verbatim model response in PairwiseResult.raw (currently reconstructed).
  • Remove the double prompt-render in JudgeDimension.evaluate.

Packaging

  • Version 0.2.0 -> 0.3.0

LesterALeong and others added 5 commits June 4, 2026 16:26
Extract an un-cached single-shot score() from JudgeDimension.evaluate and
refactor evaluate to reuse it (cache behavior unchanged). Add a
SelfConsistencyJudge that samples a base judge N times via score() and
reports the score distribution (mean, pstdev, normal-approx 95% CI).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements PairwiseJudge that runs A-vs-B comparisons in both orders and
only trusts a verdict consistent across the swap, falling back to a tie
otherwise. Adds parse_pairwise, position_bias_rate diagnostic, default
prompt template, and an offline test suite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Append pearson/spearman/mae to bench.metrics and add judge.calibration
(CalibrationSample, CalibrationReport, calibrate_judge, verbosity_bias)
measuring score-level and label-level agreement plus verbosity bias.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Top-level + judge/ + bench/ __init__ exports for SelfConsistencyJudge,
  ScoreDistribution, PairwiseJudge, PairwiseResult, position_bias_rate,
  calibrate_judge, CalibrationSample, CalibrationReport, verbosity_bias,
  and the pearson/spearman/mae stats.
- examples/judge_calibration.py: offline demo of uncertainty, position-bias
  debiasing, calibration, and verbosity-bias.
- README: "Judge reliability" section.
- version 0.2.0 -> 0.3.0.
…sed)

Reviewer follow-up (MINOR). Other reviewer suggestions deferred to v0.3.1:
verbatim raw in PairwiseResult, and the evaluate/score double-render.
@LesterALeong LesterALeong merged commit 1fc1df2 into master Jun 4, 2026
4 checks passed
@LesterALeong LesterALeong deleted the feat/v0.3.0-judge-reliability branch June 4, 2026 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant