Skip to content

Xxxxsir/LLM-as-Judge-Bias

Repository files navigation

🔍 Evaluating and Mitigating LLM-as-a-Judge Bias in Communication Systems

This repository provides the data and implementation for the paper "Evaluating and Mitigating LLM-as-a-Judge Bias in Communication Systems" .

arXiv License

This repository contains the following main components:

  • Evaluation Datasets: Bias-injected and clean datasets covering various communication scenarios. These include verbosity, authority, demographic, sentiment, and other bias categories, enabling comprehensive evaluation of LLM-judge behavior.
  • Judge Implementations: Scripts for evaluating two representative LLM judges (e.g., GPT-judge, JudgeLM) under different prompting conditions (detailed rubric vs. minimal prompt).
  • Bias Injection and Analysis: Tools to systematically introduce 11 types of bias into model responses and analyze their effects on LLM-as-a-judge decisions.

Dataset Structure

The dataset is organized as follows:

dataset/
│
├─ bias/                         # bias dataset we create
├─ eval/                         # training with eval data
├─ result/                       # our experiment results
├─ scratches/                    # Temporary files 
├─ train/                        # Alpaca structure training data
│
├─ Test_questions_92p.jsonl      # Test dataset
├─ gpqa.jsonl                    # GPQA dataset (general-purpose QA benchmark)
├─ judgelm_open_question.jsonl   # Open-ended question set for judge LLM evaluation
├─ mmlu.jsonl                    # MMLU benchmark dataset for evaluation
  • Bias Types: verbosity, authority, demographic, sentiment, popularity, factual error, distraction, compassion-fade, chain-of-thought, etc.
  • Benchmarks: MMLU-Pro, GPQA, JudgeLM evaluation tasks.

⚙️ Set Up

Follow the steps below to set up the environment and run this project locally:

1️⃣ Clone the Repository

git clone https://github.com/Xxxxsir/Score-Judge.git
cd Score-Judge

2️⃣ Create and Activate the Environment

We provide a pre-configured conda environment file. You can install all dependencies with:

conda env create -f environment.yml
conda activate judge

3️⃣ Create and Activate the Environment

Before running any experiments, you need to specify your API keys.Edit the following file judge_agent/llm_core/apikey.py and fill in your keys, for example:

OPENAI_API_KEY = "your_openai_key_here"
HUGGINGFACE_API_KEY = "your_hf_key_here"

Model Preparation

We evaluate 2 target LLMs and Two judge LLM:

Model name Link
Llama-3.1-8B-Instruct 🤗[Huggingface]
Llama-3.1-8B 🤗[Huggingface]
JudgeLM 🤗[Hugging Face]
GPT-4o 🔗[OpenAI API]

Bias Dataset Generation

We provide a pipeline to automatically generate and optimize bias-related responses for evaluation and analysis.
You can run the pipeline by using the following Python script:

run_pipeline(
    input_path=input_file_path,
    output_path=output_file_path,
    model_name=model_name,
    prompt_template=prompt_template_dict,
    score_aspects=aspects,
    max_retries=5,
    min_accepted_score=9,
    **score_config["0-10"]
)

Bias Injection Fine-tune with LoRA

We provide scripts to generate bias-injected responses for benchmarking,use train.py:

torchrun --nproc_per_node · --master-port 29501 train.py \
  --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
  --dataset_name_or_path path/to/your/dataset \
  --output_dir path/to/output_dir \
  --num_train_epochs 4 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --learning_rate 2e-4 \
  --full_finetune False \
  --bf16 True \
  --bits 4 \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_modules all \
  --lora_dropout 0.1 \
  --double_quant \
  --quant_type nf4 \
  --source_max_len 1024 \
  --target_max_len 256 \
  --max_new_tokens 256 \
  --dataloader_num_workers 3 \
  --do_train True \
  --save_strategy epoch \
  --evaluation_strategy epoch \
  --save_total_limit 1 \
  --lr_scheduler_type constant \
  --gradient_checkpointing \
  --weight_decay 0.01 \
  --max_grad_norm 1.0 \
  --warmup_ratio 0.05 \
  --seed 42 \
  --cache_dir ./data \
  --deepspeed config/ds_config_zero2.json

Bias Evaluation

Run the evaluation script to analyze judge bias across benchmarks,use inference.py to generate model response and app.py to judge the answers.

model, tokenizer = load_model(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    hf_token="hf_xxx",
    use_peft_model=True,
    adapter_model_path="path/to/your/adapter",
)

generate_answers_from_file(
    file_path="path/to/your/eval_file",
    out_file_path="path/to/your/output_file",
    model=model,
    tokenizer=tokenizer,
    prompt_template=prompt_alpaca,
    generate_fn=generate,
)

You can specify different judge prompt in judge_agent/prompt.py:

run_llm_judge(
    input_path=input_file_path,
    output_path=input_file_path,
    model_name=model_name,
    prompt_template=dialogue_judge_prompt,
    score_aspects=aspects,
    **score_config["0-10"],
) 

Extended Studies

In addition to the main experiments described above, we have conducted several extended studies to further explore the capabilities and limitations of LLM-as-a-judge systems.

🔄 Bias Fusion Evaluation

We explored a novel setting where multiple types of bias are intentionally fused into a single model response. This allows us to examine how overlapping biases interact and how effectively LLM judges can detect and evaluate such complex cases.

🤖 Multi-turn Agent Evaluation

We also evaluated model judgment performance in a multi-turn conversation scenario, where the LLM agent engages in interactive dialogues rather than single-turn QA. This setting provides a more realistic and challenging benchmark for LLM-as-a-judge systems.

📂 All datasets, experimental results related to these extended studies can be found in the data/ours directory.

📚 Citation

If you find our work useful, please cite:

@misc{gao2025evaluatingmitigatingllmasajudgebias,
  title         = {Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems},
  author        = {Jiaxin Gao and Chen Chen and Yanwen Jia and Xueluan Gong and Kwok-Yan Lam and Qian Wang},
  year          = {2025},
  eprint        = {2510.12462},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2510.12462},
}

About

Score-Judge is a comprehensive evaluation framework designed to detect and mitigate preference bias in large language model (LLM) outputs. It provides a complete end-to-end pipeline — from data preprocessing and model fine-tuning to inference, scoring, and rewriting

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages