Measure how much your LLM judges actually agree.
When you use LLM-as-a-judge for evals, the silent failure mode is that your judges disagree with each other and you do not notice. The headline score still moves; the underlying signal is noise. This library runs N judge models on the same items, computes inter-judge agreement, and surfaces where the calibration breaks down.
Given a benchmark of (prompt, candidate_response) pairs and a set of judge models:
- Runs each judge on each item, producing a score per
(item, judge)pair. - Computes agreement across judges using Cohen's kappa, Krippendorff's alpha, intraclass correlation, and pairwise percentage agreement.
- Surfaces per-item disagreement so you can audit individual cases.
- Reports per-judge bias (mean score offset) and per-judge variance.
The output is one JSON report per run plus a printable summary.
LLM-as-judge is the default eval pattern for open-ended outputs (RAG quality, code review, reasoning traces). Two failures are common:
- One bad judge drags the score. If you average four judges and one is systematically harsh on a class of answers, the headline metric is a lie.
- Judges agree at the population level but disagree per item. Average looks fine; individual decisions are random. Useless for ranking.
This library makes both visible. It does not try to pick a "correct" judge. It surfaces the disagreement so you can decide what to do.
pip install llm-judge-calibrationOr from source:
git clone https://github.com/WatchTree-19/llm-judge-calibration
cd llm-judge-calibration
pip install -e .from llm_judge_calibration import Benchmark, JudgeConfig, JudgeRunner, CalibrationReport
bench = Benchmark.from_jsonl("examples/data/mini_eval.jsonl")
judges = [
JudgeConfig(model="gpt-4o-mini", scale=(1, 5)),
JudgeConfig(model="claude-3-5-sonnet-20240620", scale=(1, 5)),
JudgeConfig(model="gpt-4o", scale=(1, 5)),
]
runner = JudgeRunner(bench=bench, judges=judges, rubric="rate the response from 1 to 5 on factual accuracy")
scores = runner.run() # async under the hood; uses litellm for provider unification
report = CalibrationReport.from_scores(scores)
report.print_summary()
report.to_json("out/report.json")Or via CLI:
llm-judge-calibration run \
--bench examples/data/mini_eval.jsonl \
--judge gpt-4o-mini --judge claude-3-5-sonnet-20240620 --judge gpt-4o \
--rubric "rate the response from 1 to 5 on factual accuracy" \
--out report.json======================================================================
calibration report
======================================================================
n items : 30
n judges : 3
rubric : rate the response from 1 to 5 on factual accuracy
inter-judge agreement
cohen's kappa (pairwise mean) : 0.41 (moderate)
krippendorff's alpha : 0.46
intraclass correlation (ICC2k) : 0.55
exact-match agreement : 0.40
per-judge bias (mean score)
gpt-4o-mini : 3.10
claude-3-5-sonnet-20240620 : 3.67
gpt-4o : 3.43
per-judge variance
gpt-4o-mini : 0.93
claude-3-5-sonnet-20240620 : 0.71
gpt-4o : 0.85
top disagreement items
#14 stdev=1.41 gpt-4o-mini=2 claude=4 gpt-4o=3
#07 stdev=1.41 gpt-4o-mini=4 claude=2 gpt-4o=3
#22 stdev=1.15 gpt-4o-mini=3 claude=5 gpt-4o=3
======================================================================
Higher kappa / alpha / ICC means the judges are more aligned. Below 0.4 usually means the judges are basically guessing relative to each other.
A Benchmark is a JSONL file where each line is:
{
"id": "001",
"prompt": "What is the capital of France?",
"candidate_response": "Paris is the capital of France."
}id and candidate_response are required. prompt is optional (judges work fine on response-only rubrics like "is this output hostile?"). Anything else in the line is preserved as metadata.
Anything litellm supports - OpenAI, Anthropic, Cohere, Mistral, Together, Groq, local vLLM endpoints, and so on. Set the relevant env vars (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc) before running.
- Not a replacement for human-labelled ground truth. If you have human labels, use them. This is for the common case where you don't.
- Not an automatic "pick the best judge" tool. It surfaces disagreement; you decide what to do with the information.
- Not a benchmark itself. Bring your own items.
MIT. See LICENSE.