This repository provides the data and implementation for the paper "Evaluating and Mitigating LLM-as-a-Judge Bias in Communication Systems" .
This repository contains the following main components:
- Evaluation Datasets: Bias-injected and clean datasets covering various communication scenarios. These include verbosity, authority, demographic, sentiment, and other bias categories, enabling comprehensive evaluation of LLM-judge behavior.
- Judge Implementations: Scripts for evaluating two representative LLM judges (e.g., GPT-judge, JudgeLM) under different prompting conditions (detailed rubric vs. minimal prompt).
- Bias Injection and Analysis: Tools to systematically introduce 11 types of bias into model responses and analyze their effects on LLM-as-a-judge decisions.
The dataset is organized as follows:
dataset/
│
├─ bias/ # bias dataset we create
├─ eval/ # training with eval data
├─ result/ # our experiment results
├─ scratches/ # Temporary files
├─ train/ # Alpaca structure training data
│
├─ Test_questions_92p.jsonl # Test dataset
├─ gpqa.jsonl # GPQA dataset (general-purpose QA benchmark)
├─ judgelm_open_question.jsonl # Open-ended question set for judge LLM evaluation
├─ mmlu.jsonl # MMLU benchmark dataset for evaluation
- Bias Types: verbosity, authority, demographic, sentiment, popularity, factual error, distraction, compassion-fade, chain-of-thought, etc.
- Benchmarks: MMLU-Pro, GPQA, JudgeLM evaluation tasks.
Follow the steps below to set up the environment and run this project locally:
git clone https://github.com/Xxxxsir/Score-Judge.git
cd Score-JudgeWe provide a pre-configured conda environment file. You can install all dependencies with:
conda env create -f environment.yml
conda activate judgeBefore running any experiments, you need to specify your API keys.Edit the following file judge_agent/llm_core/apikey.py
and fill in your keys, for example:
OPENAI_API_KEY = "your_openai_key_here"
HUGGINGFACE_API_KEY = "your_hf_key_here"
We evaluate 2 target LLMs and Two judge LLM:
| Model name | Link |
|---|---|
| Llama-3.1-8B-Instruct | 🤗[Huggingface] |
| Llama-3.1-8B | 🤗[Huggingface] |
| JudgeLM | 🤗[Hugging Face] |
| GPT-4o | 🔗[OpenAI API] |
We provide a pipeline to automatically generate and optimize bias-related responses for evaluation and analysis.
You can run the pipeline by using the following Python script:
run_pipeline(
input_path=input_file_path,
output_path=output_file_path,
model_name=model_name,
prompt_template=prompt_template_dict,
score_aspects=aspects,
max_retries=5,
min_accepted_score=9,
**score_config["0-10"]
)
We provide scripts to generate bias-injected responses for benchmarking,use train.py:
torchrun --nproc_per_node · --master-port 29501 train.py \
--model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
--dataset_name_or_path path/to/your/dataset \
--output_dir path/to/output_dir \
--num_train_epochs 4 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 2e-4 \
--full_finetune False \
--bf16 True \
--bits 4 \
--lora_r 16 \
--lora_alpha 32 \
--lora_modules all \
--lora_dropout 0.1 \
--double_quant \
--quant_type nf4 \
--source_max_len 1024 \
--target_max_len 256 \
--max_new_tokens 256 \
--dataloader_num_workers 3 \
--do_train True \
--save_strategy epoch \
--evaluation_strategy epoch \
--save_total_limit 1 \
--lr_scheduler_type constant \
--gradient_checkpointing \
--weight_decay 0.01 \
--max_grad_norm 1.0 \
--warmup_ratio 0.05 \
--seed 42 \
--cache_dir ./data \
--deepspeed config/ds_config_zero2.json
Run the evaluation script to analyze judge bias across benchmarks,use inference.py to generate model response and app.py to judge the answers.
model, tokenizer = load_model(
model_name="meta-llama/Llama-3.1-8B-Instruct",
hf_token="hf_xxx",
use_peft_model=True,
adapter_model_path="path/to/your/adapter",
)
generate_answers_from_file(
file_path="path/to/your/eval_file",
out_file_path="path/to/your/output_file",
model=model,
tokenizer=tokenizer,
prompt_template=prompt_alpaca,
generate_fn=generate,
)
You can specify different judge prompt in judge_agent/prompt.py:
run_llm_judge(
input_path=input_file_path,
output_path=input_file_path,
model_name=model_name,
prompt_template=dialogue_judge_prompt,
score_aspects=aspects,
**score_config["0-10"],
)
In addition to the main experiments described above, we have conducted several extended studies to further explore the capabilities and limitations of LLM-as-a-judge systems.
We explored a novel setting where multiple types of bias are intentionally fused into a single model response. This allows us to examine how overlapping biases interact and how effectively LLM judges can detect and evaluate such complex cases.
We also evaluated model judgment performance in a multi-turn conversation scenario, where the LLM agent engages in interactive dialogues rather than single-turn QA. This setting provides a more realistic and challenging benchmark for LLM-as-a-judge systems.
📂 All datasets, experimental results related to these extended studies can be found in the data/ours directory.
If you find our work useful, please cite:
@misc{gao2025evaluatingmitigatingllmasajudgebias,
title = {Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems},
author = {Jiaxin Gao and Chen Chen and Yanwen Jia and Xueluan Gong and Kwok-Yan Lam and Qian Wang},
year = {2025},
eprint = {2510.12462},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2510.12462},
}