Code for the paper "Evolving Fairness in Large Language Models: A Longitudinal Multi-Benchmark Study Across Model Families and Versions" — under review at COLM 2026.
A reproducible, modular framework for evaluating fairness drift in LLMs — defined as systematic changes in fairness metrics across successive model versions. Evaluates 12 model versions from 6 major providers across 7 established bias benchmarks (~200K examples).
TL;DR: Toxicity declines consistently across newer model releases, but stereotype scores often remain stable or worsen — revealing a fundamental tradeoff in current safety optimization pipelines.
| Finding | Detail |
|---|---|
| Toxicity improves | Consistent downward trend across all providers (mean drift: -0.005) |
| Stereotypes don't | Stereotype proxy scores stable or increasing, especially Gemini-2.5-Pro (+0.027) |
| Fairness is multidimensional | All pairwise correlations weak (|r| < 0.2) — gains in one metric don't predict others |
| Non-monotonic evolution | Version updates frequently improve one metric while regressing another |
| Provider differences | OpenAI and Gemma show lower stereotype scores; Gemini shows highest volatility |
| Provider | Sentiment | Toxicity | Stereotype Proxy |
|---|---|---|---|
| OpenAI | 0.79 ± 0.10 | 0.01 ± 0.02 | 0.18 ± 0.12 |
| Anthropic | 0.76 ± 0.12 | 0.01 ± 0.02 | 0.25 ± 0.13 |
| Google (Gemini) | 0.72 ± 0.14 | 0.03 ± 0.05 | 0.32 ± 0.15 |
| Meta (LLaMA) | 0.74 ± 0.11 | 0.01 ± 0.02 | 0.20 ± 0.11 |
| Google (Gemma) | 0.78 ± 0.10 | 0.02 ± 0.03 | 0.22 ± 0.12 |
| Metric | Mean Drift | Std. Dev. | 95% CI |
|---|---|---|---|
| Sentiment | +0.020 | 0.010 | [0.02, 0.06] |
| Toxicity | -0.005 | 0.003 | [-0.008, 0.002] |
| Stereotype Proxy | +0.010 | 0.020 | [-0.04, 0.02] |
Key figures and result tables are in examples/.
Full evaluation outputs (raw results per model, all plots, summary tables) are available as assets in the v1.0.0 Release.
This framework systematically evaluates fairness drift across model families and versions by:
- Querying models with prompts from 7 established bias benchmarks
- Computing fairness metrics across sentiment, toxicity, and stereotype dimensions
- Measuring version-to-version drift using bootstrap resampling (1,000 samples)
- Generating longitudinal visualizations and statistical analyses across providers
Commercial APIs:
| Provider | Models |
|---|---|
| OpenAI | GPT-4-turbo, GPT-4.1, GPT-4.1-mini, GPT-4o, GPT-4o-mini |
| Anthropic | Claude Sonnet 4, Claude Sonnet 4.5, Claude Opus 4.1 |
| Google Gemini | Gemini-2.5-Pro, 2.5-Flash, 2.5-Flash-Lite, 2.0-Flash |
Open-Weight Models (via HuggingFace):
| Provider | Models |
|---|---|
| Meta LLaMA | Llama-3.1 (8B, 70B, 405B), Llama-3.2 (1B, 3B) |
| Google Gemma | Gemma-2 (2B, 9B) |
All evaluations use deterministic decoding (temperature=0, seed=42), single evaluation per prompt, no best-of-n sampling.
| Dataset | Task Type | Demographic Axes |
|---|---|---|
| BOLD | Open-ended generation | Profession, gender, race, religion |
| StereoSet | Stereotype detection | Gender, race, religion, profession |
| CrowS-Pairs | Minimal pairs | Race, gender, religion, age, disability |
| BBQ | Question answering bias | Age, gender, race, nationality |
| HolisticBias | Diverse demographic prompts | 13 demographic axes |
| WinoBias | Gender coreference | Gender |
| RealToxicityPrompts | Toxicity generation | Open domain |
Primary Metrics:
- Sentiment Score (VADER, 0–1): Higher = more positive. Ideal: balanced across groups.
- Toxicity Score (Perspective API, 0–1): Higher = more toxic. Ideal: lower across all groups.
- Stereotype Score (pattern-based, 0–1): Higher = more stereotypical. Ideal: lower scores.
Advanced Fairness Metrics:
- Demographic Parity, Equalized Odds, Equal Opportunity Difference
- Theil Index, KL Divergence, Maximum Mean Discrepancy (MMD)
- Individual Fairness, Conditional Demographic Parity, Intersectional Fairness
- Gini Coefficient, Predictive Parity, Statistical significance tests
- Python 3.9 or higher
- API keys for target models (OpenAI, Anthropic, Google, HuggingFace)
- Perspective API key (optional, for toxicity scoring)
git clone https://github.com/pulipakav1/AI_bias.git
cd AI_bias
pip install -r requirements.txtCopy the environment template and fill in your API keys:
cp .env.example .envANTHROPIC_API_KEY=your_anthropic_key
OPENAI_API_KEY=your_openai_key
GOOGLE_API_KEY=your_google_key
HUGGINGFACE_API_KEY=your_huggingface_key
PERSPECTIVE_API_KEY=your_perspective_key # optionalDownload the CrowS-Pairs dataset into data/:
wget https://raw.githubusercontent.com/nyu-mll/crows-pairs/master/data/crows_pairs_anonymized.csv \
-O data/crows_pairs_anonymized.csvpython main.py# By provider
python main.py --provider=openai,claude
# By dataset
python main.py --dataset=bold,stereoset,crows_pairs
# By model
python main.py --model=gpt-4.1,claude-sonnet-4-20250514
# Open-weight models only
python main.py --hf-only
# Exclude open-weight models
python main.py --no-hffrom main import run_benchmark
run_benchmark(
include_hf_models=True,
dataset_filter=["bold", "stereoset"],
provider_filter=["openai", "claude"],
model_filter=["gpt-4.1", "claude-sonnet-4-20250514"]
)AI_bias/
├── src/ # Pipeline source code
│ ├── config.py # Models, API keys, constants
│ ├── data_loader.py # Dataset loading & preprocessing
│ ├── model_interface.py # Model API query interfaces
│ ├── fairness_metrics.py # Fairness metric computation
│ ├── metrics.py # Advanced fairness metrics
│ └── visualization.py # Plot generation & analysis
├── data/ # Static input data
│ └── crows_pairs_anonymized.csv # Download separately (see Setup)
├── notebooks/ # Exploratory analysis
│ └── qualitative_analysis.ipynb
├── examples/ # Key figures and result tables from paper
│ ├── plots/ # Main visualizations
│ └── results/ # Summary CSVs and metrics JSON
├── outputs/ # Generated at runtime (gitignored)
│ ├── results/ # Raw results and metrics
│ ├── plots/ # All generated visualizations
│ └── tables/ # Summary statistics tables
├── main.py # Benchmark entry point & CLI
├── requirements.txt
├── .env.example # API key template
└── .gitignore
raw_results_*.csv— Raw query results with prompts and responsesraw_<provider>_<model>_*.csv— Per-model raw resultsresults_with_metrics_*.csv— Results with computed fairness metricscomprehensive_metrics_*.json— Advanced fairness metrics
- Model and provider comparison charts
- Version progression trajectories (sentiment, toxicity, stereotype)
- Metric correlation heatmaps
- Benchmark sensitivity analysis
- Demographic axis breakdowns
summary_statistics.csv— Overall summarymodel_version_summary.csv— Per-model aggregated metricsmodel_version_detailed_metrics.csv— Detailed model-level metricspairwise_statistical_tests.csv— Statistical significance tests
# Commercial models
python main.py --no-hf
# Open-weight models
python main.py --hf-only
# Qualitative analysis
jupyter notebook notebooks/qualitative_analysis.ipynbStatistical analysis uses bootstrap resampling with 1,000 samples. All evaluations use deterministic decoding (temperature=0, seed=42).
Note on Gemini safety filters: Content filters are intentionally disabled during evaluation. Bias benchmarks contain prompts that would otherwise be blocked, producing empty responses rather than real model outputs. This follows standard practice in fairness evaluation research (Gehman et al., 2020).
- English-language benchmarks only — findings may not generalize to multilingual settings
- Single-turn prompts only — multi-turn bias dynamics not captured
- Automated proxy metrics — context-dependent harms may be missed without human evaluation
- Proprietary API opacity — causal attribution of drift not possible for commercial models
- Benchmark coverage — public datasets represent curated subsets of real-world interactions
- Dataset creators: BOLD, StereoSet, BBQ, CrowS-Pairs, RealToxicityPrompts, HolisticBias, WinoBias
- Model providers: OpenAI, Anthropic, Google, Meta, HuggingFace
- Libraries: VADER Sentiment, Perspective API, HuggingFace Datasets, matplotlib, seaborn, scipy
This project is licensed under the MIT License. See LICENSE for details.