llm-judge-calibration

Measure how much your LLM judges actually agree.

When you use LLM-as-a-judge for evals, the silent failure mode is that your judges disagree with each other and you do not notice. The headline score still moves; the underlying signal is noise. This library runs N judge models on the same items, computes inter-judge agreement, and surfaces where the calibration breaks down.

What it does

Given a benchmark of (prompt, candidate_response) pairs and a set of judge models:

Runs each judge on each item, producing a score per (item, judge) pair.
Computes agreement across judges using Cohen's kappa, Krippendorff's alpha, intraclass correlation, and pairwise percentage agreement.
Surfaces per-item disagreement so you can audit individual cases.
Reports per-judge bias (mean score offset) and per-judge variance.

The output is one JSON report per run plus a printable summary.

Why it exists

LLM-as-judge is the default eval pattern for open-ended outputs (RAG quality, code review, reasoning traces). Two failures are common:

One bad judge drags the score. If you average four judges and one is systematically harsh on a class of answers, the headline metric is a lie.
Judges agree at the population level but disagree per item. Average looks fine; individual decisions are random. Useless for ranking.

This library makes both visible. It does not try to pick a "correct" judge. It surfaces the disagreement so you can decide what to do.

Install

pip install llm-judge-calibration

Or from source:

git clone https://github.com/WatchTree-19/llm-judge-calibration
cd llm-judge-calibration
pip install -e .

Quick start

from llm_judge_calibration import Benchmark, JudgeConfig, JudgeRunner, CalibrationReport

bench = Benchmark.from_jsonl("examples/data/mini_eval.jsonl")

judges = [
    JudgeConfig(model="gpt-4o-mini", scale=(1, 5)),
    JudgeConfig(model="claude-3-5-sonnet-20240620", scale=(1, 5)),
    JudgeConfig(model="gpt-4o", scale=(1, 5)),
]

runner = JudgeRunner(bench=bench, judges=judges, rubric="rate the response from 1 to 5 on factual accuracy")
scores = runner.run()  # async under the hood; uses litellm for provider unification

report = CalibrationReport.from_scores(scores)
report.print_summary()
report.to_json("out/report.json")

Or via CLI:

llm-judge-calibration run \
    --bench examples/data/mini_eval.jsonl \
    --judge gpt-4o-mini --judge claude-3-5-sonnet-20240620 --judge gpt-4o \
    --rubric "rate the response from 1 to 5 on factual accuracy" \
    --out report.json

What the report tells you

======================================================================
calibration report
======================================================================
n items   : 30
n judges  : 3
rubric    : rate the response from 1 to 5 on factual accuracy

inter-judge agreement
  cohen's kappa (pairwise mean)   : 0.41  (moderate)
  krippendorff's alpha            : 0.46
  intraclass correlation (ICC2k)  : 0.55
  exact-match agreement           : 0.40

per-judge bias (mean score)
  gpt-4o-mini                     : 3.10
  claude-3-5-sonnet-20240620      : 3.67
  gpt-4o                          : 3.43

per-judge variance
  gpt-4o-mini                     : 0.93
  claude-3-5-sonnet-20240620      : 0.71
  gpt-4o                          : 0.85

top disagreement items
  #14   stdev=1.41   gpt-4o-mini=2  claude=4  gpt-4o=3
  #07   stdev=1.41   gpt-4o-mini=4  claude=2  gpt-4o=3
  #22   stdev=1.15   gpt-4o-mini=3  claude=5  gpt-4o=3

======================================================================

Higher kappa / alpha / ICC means the judges are more aligned. Below 0.4 usually means the judges are basically guessing relative to each other.

Benchmark format

A Benchmark is a JSONL file where each line is:

{
  "id": "001",
  "prompt": "What is the capital of France?",
  "candidate_response": "Paris is the capital of France."
}

id and candidate_response are required. prompt is optional (judges work fine on response-only rubrics like "is this output hostile?"). Anything else in the line is preserved as metadata.

Supported providers

Anything litellm supports - OpenAI, Anthropic, Cohere, Mistral, Together, Groq, local vLLM endpoints, and so on. Set the relevant env vars (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc) before running.

What this library is not

Not a replacement for human-labelled ground truth. If you have human labels, use them. This is for the common case where you don't.
Not an automatic "pick the best judge" tool. It surfaces disagreement; you decide what to do with the information.
Not a benchmark itself. Bring your own items.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
src/llm_judge_calibration		src/llm_judge_calibration
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-judge-calibration

What it does

Why it exists

Install

Quick start

What the report tells you

Benchmark format

Supported providers

What this library is not

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-judge-calibration

What it does

Why it exists

Install

Quick start

What the report tells you

Benchmark format

Supported providers

What this library is not

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages