Skip to content

WatchTree-19/llm-judge-calibration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-judge-calibration

Measure how much your LLM judges actually agree.

When you use LLM-as-a-judge for evals, the silent failure mode is that your judges disagree with each other and you do not notice. The headline score still moves; the underlying signal is noise. This library runs N judge models on the same items, computes inter-judge agreement, and surfaces where the calibration breaks down.

What it does

Given a benchmark of (prompt, candidate_response) pairs and a set of judge models:

  1. Runs each judge on each item, producing a score per (item, judge) pair.
  2. Computes agreement across judges using Cohen's kappa, Krippendorff's alpha, intraclass correlation, and pairwise percentage agreement.
  3. Surfaces per-item disagreement so you can audit individual cases.
  4. Reports per-judge bias (mean score offset) and per-judge variance.

The output is one JSON report per run plus a printable summary.

Why it exists

LLM-as-judge is the default eval pattern for open-ended outputs (RAG quality, code review, reasoning traces). Two failures are common:

  • One bad judge drags the score. If you average four judges and one is systematically harsh on a class of answers, the headline metric is a lie.
  • Judges agree at the population level but disagree per item. Average looks fine; individual decisions are random. Useless for ranking.

This library makes both visible. It does not try to pick a "correct" judge. It surfaces the disagreement so you can decide what to do.

Install

pip install llm-judge-calibration

Or from source:

git clone https://github.com/WatchTree-19/llm-judge-calibration
cd llm-judge-calibration
pip install -e .

Quick start

from llm_judge_calibration import Benchmark, JudgeConfig, JudgeRunner, CalibrationReport

bench = Benchmark.from_jsonl("examples/data/mini_eval.jsonl")

judges = [
    JudgeConfig(model="gpt-4o-mini", scale=(1, 5)),
    JudgeConfig(model="claude-3-5-sonnet-20240620", scale=(1, 5)),
    JudgeConfig(model="gpt-4o", scale=(1, 5)),
]

runner = JudgeRunner(bench=bench, judges=judges, rubric="rate the response from 1 to 5 on factual accuracy")
scores = runner.run()  # async under the hood; uses litellm for provider unification

report = CalibrationReport.from_scores(scores)
report.print_summary()
report.to_json("out/report.json")

Or via CLI:

llm-judge-calibration run \
    --bench examples/data/mini_eval.jsonl \
    --judge gpt-4o-mini --judge claude-3-5-sonnet-20240620 --judge gpt-4o \
    --rubric "rate the response from 1 to 5 on factual accuracy" \
    --out report.json

What the report tells you

======================================================================
calibration report
======================================================================
n items   : 30
n judges  : 3
rubric    : rate the response from 1 to 5 on factual accuracy

inter-judge agreement
  cohen's kappa (pairwise mean)   : 0.41  (moderate)
  krippendorff's alpha            : 0.46
  intraclass correlation (ICC2k)  : 0.55
  exact-match agreement           : 0.40

per-judge bias (mean score)
  gpt-4o-mini                     : 3.10
  claude-3-5-sonnet-20240620      : 3.67
  gpt-4o                          : 3.43

per-judge variance
  gpt-4o-mini                     : 0.93
  claude-3-5-sonnet-20240620      : 0.71
  gpt-4o                          : 0.85

top disagreement items
  #14   stdev=1.41   gpt-4o-mini=2  claude=4  gpt-4o=3
  #07   stdev=1.41   gpt-4o-mini=4  claude=2  gpt-4o=3
  #22   stdev=1.15   gpt-4o-mini=3  claude=5  gpt-4o=3

======================================================================

Higher kappa / alpha / ICC means the judges are more aligned. Below 0.4 usually means the judges are basically guessing relative to each other.

Benchmark format

A Benchmark is a JSONL file where each line is:

{
  "id": "001",
  "prompt": "What is the capital of France?",
  "candidate_response": "Paris is the capital of France."
}

id and candidate_response are required. prompt is optional (judges work fine on response-only rubrics like "is this output hostile?"). Anything else in the line is preserved as metadata.

Supported providers

Anything litellm supports - OpenAI, Anthropic, Cohere, Mistral, Together, Groq, local vLLM endpoints, and so on. Set the relevant env vars (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc) before running.

What this library is not

  • Not a replacement for human-labelled ground truth. If you have human labels, use them. This is for the common case where you don't.
  • Not an automatic "pick the best judge" tool. It surfaces disagreement; you decide what to do with the information.
  • Not a benchmark itself. Bring your own items.

License

MIT. See LICENSE.

About

Measure how much your LLM judges actually agree. Inter-judge agreement metrics for LLM-as-a-judge evaluations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages