JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

A framework for quantifying prompt sensitivity in LLM-as-a-Judge evaluation systems.

Overview

Large language models are increasingly deployed as automated judges to evaluate the outputs of other models, yet the reliability of these systems remains poorly understood. JudgeSense is a reproducible benchmark that quantifies prompt sensitivity in LLM-as-a-Judge systems via the Judge Sensitivity Score (JSS), a metric measuring how often a judge's evaluation decision changes when prompt phrasing varies while evaluation intent stays constant. We evaluate 13 LLM judges across 4 evaluation tasks (factuality, coherence, preference, relevance) with 500 hand-validated prompt pairs and 3 independent runs each, and uncover systematic sensitivity driven by prompt polarity inversion. Our analysis reveals that polarity-inverted templates can reduce apparent agreement by up to 37 percentage points, that coherence JSS varies by more than 0.6 units across judges (range 0.39 to 0.99), and that sensitivity does not track model scale or recency.

This repository contains the full reproducible codebase, datasets, and evaluation artifacts accompanying the paper.

Key contributions

JSS metric: A novel, formally defined score for judge decision consistency across semantically equivalent prompts.
Public dataset: 500 semantically equivalent prompt pairs across 4 evaluation task types.
Empirical evaluation: Thirteen LLM judges (GPT-5.5, GPT-4o, GPT-4o-mini, Claude Opus 4.7, Claude Sonnet 4.5, Claude Haiku 4.5, Gemini 2.5 Flash, LLaMA-3.1-70B, Mistral-7B, DeepSeek-R1, Qwen-2.5-72B, Qwen 3.6 Flash, DeepSeek-V4 Flash) tested across 4 task types; coherence JSS ranges from 0.39 to 0.99 and does not correlate with model scale or recency.
Human validation: All 500 prompt pairs hand-validated by a primary annotator and independently re-reviewed by a second annotator with full agreement; 50 polarity-inverted Template 4 pairs labeled non-equivalent by both annotators and excluded from the published dataset.
Full reproducibility: All code, data, and results released under open licenses.

Installation

git clone https://github.com/rohithreddybc/judgesense.git
cd judgesense
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

Or install directly via pip (metrics only, minimal dependencies):

pip install judgesense
# For full evaluation capabilities (API clients, datasets):
pip install "judgesense[full]"

Quickstart

Copy .env.example to .env and add your API keys:

cp .env.example .env
# Edit .env with your keys

Run the full evaluation:

python src/evaluate.py --model gpt-4o-mini --task factuality --runs 3

Compute JSS from results:

python src/metrics.py --results data/results/raw_outputs/

Dataset

This project includes the JudgeSense benchmark dataset — 500 validated paraphrase pairs across 4 evaluation task types, released for prompt sensitivity research.

HuggingFace: Rohithreddybc/judgesense-benchmark
License: CC-BY-4.0
Size: 500 prompt pairs, 4 task types, 125 pairs per task

Key Insight: Prompt formulation often dominates model architecture in determining apparent judge consistency.

Quick usage

from datasets import load_dataset

ds = load_dataset("Rohithreddybc/judgesense-benchmark")
pairs = ds["factuality"]
print(f"{len(pairs)} factuality pairs loaded")

# Compute JSS from your judge's decisions
from judgesense import compute_jss
jss = compute_jss(decisions_a, decisions_b)
print(f"JSS: {jss:.3f}")

Schema

{
  "pair_id": "fact_001",
  "task_type": "factuality",
  "source_benchmark": "TruthfulQA",
  "prompt_a": "Is this factually correct? Answer YES or NO only.\n\nResponse: ...",
  "prompt_b": "Fact-check this response. Reply YES (correct) or NO (incorrect).\n\nResponse: ...",
  "response_being_judged": "The Earth orbits around the Sun.",
  "ground_truth_label": "accurate",
  "semantic_equivalence_score": 1.0
}

Key findings

Factuality (polarity-correction results)

Model	JSS (raw)	JSS (T4-corrected)	Delta
GPT-4o	0.63	0.98	+0.35
GPT-4o-mini	0.63	0.96	+0.33
Claude Haiku 4.5	0.63	0.97	+0.34
Claude Sonnet 4.5	0.63	0.97	+0.34
DeepSeek-R1	0.63	0.96	+0.33
LLaMA-3.1-70B	0.63	0.99	+0.36
Gemini 2.5 Flash	0.63	0.98	+0.35
Qwen-2.5-72B	0.63	0.98	+0.35
Mistral-7B	0.71	0.89	+0.18
GPT-5.5	0.63	0.98	+0.35
Claude Opus 4.7	0.63	0.99	+0.36
Qwen 3.6 Flash	0.63	0.97	+0.34
DeepSeek-V4 Flash	0.62	0.95	+0.33

Finding: Polarity-inverted prompt templates (T4) reduce raw JSS by 18 to 36 pp across all models. After T4 correction, all 13 judges achieve factuality JSS in [0.89, 0.99], demonstrating that prompt sensitivity in this task is primarily attributable to template polarity. Mistral-7B exhibits the highest residual sensitivity (JSS = 0.89 post-correction).

Coherence (most discriminating task)

Model	JSS (coherence)	Cohen's kappa
Claude Sonnet 4.5	0.99	0.986
Qwen-2.5-72B	0.92	0.842
GPT-4o	0.91	0.828
GPT-5.5	0.83	0.694
GPT-4o-mini	0.78	0.627
Claude Haiku 4.5	0.73	0.583
Claude Opus 4.7	0.70	0.580
LLaMA-3.1-70B	0.55	0.338
DeepSeek-R1	0.53	0.332
Qwen 3.6 Flash	0.51	0.372
DeepSeek-V4 Flash	0.50	0.349
Mistral-7B	0.48	-0.082
Gemini 2.5 Flash	0.39	-0.057

Finding: Coherence JSS spans 0.605 units across 13 judges and does not track model scale or release recency. Claude Opus 4.7 (0.70) scores lower than Claude Haiku 4.5 (0.73); GPT-5.5 (0.83) scores lower than GPT-4o (0.91). Two judges (Mistral-7B and Gemini 2.5 Flash) produce negative kappa, indicating systematic anti-agreement.

Reproducing paper results

Exact commands to replicate every number in the paper:

# 1. Build the prompt pair dataset
python src/dataset_builder.py --output data/prompt_pairs/

# 2. Run evaluations on all models
bash scripts/run_all_evals.sh

# 3. Compute metrics
python src/metrics.py --summarize

# 4. Run factuality JSS analysis (T4 polarity-corrected)
python analysis/factuality_jss_fixed.py

# 5. Per-template JSS breakdown
python analysis/per_template_factuality.py

# 6. Pair-level flip overlap
python analysis/factuality_pair_overlap.py

# 7. Generate publication figures (outputs/fig1, fig2, fig4)
python analysis/generate_figures.py

Repository structure

judgesense/
├── data/
│   ├── prompt_pairs/          # 4 JSONL files, one per task type
│   ├── results/               # Raw judge outputs + computed metrics
│   └── validation/manual/     # Human annotation results (500 pairs, 4 tasks)
├── src/
│   ├── dataset_builder.py     # Generates the prompt pair dataset
│   ├── models.py              # API wrappers (OpenAI, Anthropic, Google, Alibaba Cloud, Novita AI, HuggingFace)
│   ├── evaluate.py            # Main evaluation runner
│   ├── metrics.py             # JSS + decision flip rate + Cohen's kappa
│   └── utils.py               # Shared helpers
├── notebooks/
│   ├── 01_dataset_analysis.ipynb
│   ├── 02_results_analysis.ipynb
│   └── 03_figures.ipynb
├── analysis/
│   ├── factuality_jss_fixed.py    # Recompute JSS with T4 polarity correction
│   ├── per_template_factuality.py # Per-template JSS breakdown
│   ├── factuality_pair_overlap.py # Pair-level flip overlap analysis
│   └── generate_figures.py        # Publication-ready PDF figures
├── outputs/               # CSV results + publication-ready PDF figures
├── figures/               # Paper-ready PDF/PNG figures
├── tests/                 # Unit tests for metrics and dataset (29 tests)
├── requirements.txt
├── .env.example
└── README.md

Citation

If you use JudgeSense in your research, please cite:

@misc{bellibatlu2026judgesense,
      title={JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems}, 
      author={Rohith Reddy Bellibatlu},
      year={2026},
      eprint={2604.23478},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.23478}, 
}

License

Code: MIT License (see LICENSE)
Dataset: CC-BY-4.0

Contact

Rohith Reddy Bellibatlu — ORCID 0009-0003-6083-0364

This work is part of an independent research portfolio. All evaluations were conducted on public benchmarks and APIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

Overview

Key contributions

Installation

Quickstart

Dataset

Quick usage

Schema

Key findings

Factuality (polarity-correction results)

Coherence (most discriminating task)

Reproducing paper results

Repository structure

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
analysis		analysis
data		data
figures		figures
judgesense-benchmark		judgesense-benchmark
notebooks		notebooks
outputs		outputs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dataset_builder.py		dataset_builder.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_parallel.bat		run_parallel.bat
upload_to_hf.py		upload_to_hf.py

Folders and files

Latest commit

History

Repository files navigation

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

Overview

Key contributions

Installation

Quickstart

Dataset

Quick usage

Schema

Key findings

Factuality (polarity-correction results)

Coherence (most discriminating task)

Reproducing paper results

Repository structure

Citation

License

Contact

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages