Evolving Fairness in LLMs: A Longitudinal Multi-Benchmark Study

Code for the paper "Evolving Fairness in Large Language Models: A Longitudinal Multi-Benchmark Study Across Model Families and Versions" — under review at COLM 2026.

A reproducible, modular framework for evaluating fairness drift in LLMs — defined as systematic changes in fairness metrics across successive model versions. Evaluates 12 model versions from 6 major providers across 7 established bias benchmarks (~200K examples).

Key Findings

TL;DR: Toxicity declines consistently across newer model releases, but stereotype scores often remain stable or worsen — revealing a fundamental tradeoff in current safety optimization pipelines.

Finding	Detail
Toxicity improves	Consistent downward trend across all providers (mean drift: -0.005)
Stereotypes don't	Stereotype proxy scores stable or increasing, especially Gemini-2.5-Pro (+0.027)
Fairness is multidimensional	All pairwise correlations weak (\|r\| < 0.2) — gains in one metric don't predict others
Non-monotonic evolution	Version updates frequently improve one metric while regressing another
Provider differences	OpenAI and Gemma show lower stereotype scores; Gemini shows highest volatility

Provider-Level Results

Provider	Sentiment	Toxicity	Stereotype Proxy
OpenAI	0.79 ± 0.10	0.01 ± 0.02	0.18 ± 0.12
Anthropic	0.76 ± 0.12	0.01 ± 0.02	0.25 ± 0.13
Google (Gemini)	0.72 ± 0.14	0.03 ± 0.05	0.32 ± 0.15
Meta (LLaMA)	0.74 ± 0.11	0.01 ± 0.02	0.20 ± 0.11
Google (Gemma)	0.78 ± 0.10	0.02 ± 0.03	0.22 ± 0.12

Aggregate Drift Across All Models

Metric	Mean Drift	Std. Dev.	95% CI
Sentiment	+0.020	0.010	[0.02, 0.06]
Toxicity	-0.005	0.003	[-0.008, 0.002]
Stereotype Proxy	+0.010	0.020	[-0.04, 0.02]

Results & Reproducibility

Key figures and result tables are in examples/.

Full evaluation outputs (raw results per model, all plots, summary tables) are available as assets in the v1.0.0 Release.

Overview

This framework systematically evaluates fairness drift across model families and versions by:

Querying models with prompts from 7 established bias benchmarks
Computing fairness metrics across sentiment, toxicity, and stereotype dimensions
Measuring version-to-version drift using bootstrap resampling (1,000 samples)
Generating longitudinal visualizations and statistical analyses across providers

Models Evaluated

Commercial APIs:

Provider	Models
OpenAI	GPT-4-turbo, GPT-4.1, GPT-4.1-mini, GPT-4o, GPT-4o-mini
Anthropic	Claude Sonnet 4, Claude Sonnet 4.5, Claude Opus 4.1
Google Gemini	Gemini-2.5-Pro, 2.5-Flash, 2.5-Flash-Lite, 2.0-Flash

Open-Weight Models (via HuggingFace):

Provider	Models
Meta LLaMA	Llama-3.1 (8B, 70B, 405B), Llama-3.2 (1B, 3B)
Google Gemma	Gemma-2 (2B, 9B)

All evaluations use deterministic decoding (temperature=0, seed=42), single evaluation per prompt, no best-of-n sampling.

Benchmarks

Dataset	Task Type	Demographic Axes
BOLD	Open-ended generation	Profession, gender, race, religion
StereoSet	Stereotype detection	Gender, race, religion, profession
CrowS-Pairs	Minimal pairs	Race, gender, religion, age, disability
BBQ	Question answering bias	Age, gender, race, nationality
HolisticBias	Diverse demographic prompts	13 demographic axes
WinoBias	Gender coreference	Gender
RealToxicityPrompts	Toxicity generation	Open domain

Fairness Metrics

Primary Metrics:

Sentiment Score (VADER, 0–1): Higher = more positive. Ideal: balanced across groups.
Toxicity Score (Perspective API, 0–1): Higher = more toxic. Ideal: lower across all groups.
Stereotype Score (pattern-based, 0–1): Higher = more stereotypical. Ideal: lower scores.

Advanced Fairness Metrics:

Demographic Parity, Equalized Odds, Equal Opportunity Difference
Theil Index, KL Divergence, Maximum Mean Discrepancy (MMD)
Individual Fairness, Conditional Demographic Parity, Intersectional Fairness
Gini Coefficient, Predictive Parity, Statistical significance tests

Installation

Prerequisites

Python 3.9 or higher
API keys for target models (OpenAI, Anthropic, Google, HuggingFace)
Perspective API key (optional, for toxicity scoring)

Setup

git clone https://github.com/pulipakav1/AI_bias.git
cd AI_bias
pip install -r requirements.txt

Copy the environment template and fill in your API keys:

cp .env.example .env

ANTHROPIC_API_KEY=your_anthropic_key
OPENAI_API_KEY=your_openai_key
GOOGLE_API_KEY=your_google_key
HUGGINGFACE_API_KEY=your_huggingface_key
PERSPECTIVE_API_KEY=your_perspective_key  # optional

Download the CrowS-Pairs dataset into data/:

wget https://raw.githubusercontent.com/nyu-mll/crows-pairs/master/data/crows_pairs_anonymized.csv \
     -O data/crows_pairs_anonymized.csv

Usage

Run full benchmark

python main.py

Filtering options

# By provider
python main.py --provider=openai,claude

# By dataset
python main.py --dataset=bold,stereoset,crows_pairs

# By model
python main.py --model=gpt-4.1,claude-sonnet-4-20250514

# Open-weight models only
python main.py --hf-only

# Exclude open-weight models
python main.py --no-hf

Programmatic usage

from main import run_benchmark

run_benchmark(
    include_hf_models=True,
    dataset_filter=["bold", "stereoset"],
    provider_filter=["openai", "claude"],
    model_filter=["gpt-4.1", "claude-sonnet-4-20250514"]
)

Project Structure

AI_bias/
├── src/                           # Pipeline source code
│   ├── config.py                  # Models, API keys, constants
│   ├── data_loader.py             # Dataset loading & preprocessing
│   ├── model_interface.py         # Model API query interfaces
│   ├── fairness_metrics.py        # Fairness metric computation
│   ├── metrics.py                 # Advanced fairness metrics
│   └── visualization.py           # Plot generation & analysis
├── data/                          # Static input data
│   └── crows_pairs_anonymized.csv # Download separately (see Setup)
├── notebooks/                     # Exploratory analysis
│   └── qualitative_analysis.ipynb
├── examples/                      # Key figures and result tables from paper
│   ├── plots/                     # Main visualizations
│   └── results/                   # Summary CSVs and metrics JSON
├── outputs/                       # Generated at runtime (gitignored)
│   ├── results/                   # Raw results and metrics
│   ├── plots/                     # All generated visualizations
│   └── tables/                    # Summary statistics tables
├── main.py                        # Benchmark entry point & CLI
├── requirements.txt
├── .env.example                   # API key template
└── .gitignore

Outputs

`outputs/results/`

raw_results_*.csv — Raw query results with prompts and responses
raw_<provider>_<model>_*.csv — Per-model raw results
results_with_metrics_*.csv — Results with computed fairness metrics
comprehensive_metrics_*.json — Advanced fairness metrics

`outputs/plots/`

Model and provider comparison charts
Version progression trajectories (sentiment, toxicity, stereotype)
Metric correlation heatmaps
Benchmark sensitivity analysis
Demographic axis breakdowns

`outputs/tables/`

summary_statistics.csv — Overall summary
model_version_summary.csv — Per-model aggregated metrics
model_version_detailed_metrics.csv — Detailed model-level metrics
pairwise_statistical_tests.csv — Statistical significance tests

Reproducing Paper Results

# Commercial models
python main.py --no-hf

# Open-weight models
python main.py --hf-only

# Qualitative analysis
jupyter notebook notebooks/qualitative_analysis.ipynb

Statistical analysis uses bootstrap resampling with 1,000 samples. All evaluations use deterministic decoding (temperature=0, seed=42).

Note on Gemini safety filters: Content filters are intentionally disabled during evaluation. Bias benchmarks contain prompts that would otherwise be blocked, producing empty responses rather than real model outputs. This follows standard practice in fairness evaluation research (Gehman et al., 2020).

Limitations

English-language benchmarks only — findings may not generalize to multilingual settings
Single-turn prompts only — multi-turn bias dynamics not captured
Automated proxy metrics — context-dependent harms may be missed without human evaluation
Proprietary API opacity — causal attribution of drift not possible for commercial models
Benchmark coverage — public datasets represent curated subsets of real-world interactions

Acknowledgments

Dataset creators: BOLD, StereoSet, BBQ, CrowS-Pairs, RealToxicityPrompts, HolisticBias, WinoBias
Model providers: OpenAI, Anthropic, Google, Meta, HuggingFace
Libraries: VADER Sentiment, Perspective API, HuggingFace Datasets, matplotlib, seaborn, scipy

License

This project is licensed under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evolving Fairness in LLMs: A Longitudinal Multi-Benchmark Study

Key Findings

Provider-Level Results

Aggregate Drift Across All Models

Results & Reproducibility

Overview

Models Evaluated

Benchmarks

Fairness Metrics

Installation

Prerequisites

Setup

Usage

Run full benchmark

Filtering options

Programmatic usage

Project Structure

Outputs

`outputs/results/`

`outputs/plots/`

`outputs/tables/`

Reproducing Paper Results

Limitations

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
examples		examples
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

Evolving Fairness in LLMs: A Longitudinal Multi-Benchmark Study

Key Findings

Provider-Level Results

Aggregate Drift Across All Models

Results & Reproducibility

Overview

Models Evaluated

Benchmarks

Fairness Metrics

Installation

Prerequisites

Setup

Usage

Run full benchmark

Filtering options

Programmatic usage

Project Structure

Outputs

outputs/results/

outputs/plots/

outputs/tables/

Reproducing Paper Results

Limitations

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`outputs/results/`

`outputs/plots/`

`outputs/tables/`

Packages