Skip to content

pulipakav1/AI_bias

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evolving Fairness in LLMs: A Longitudinal Multi-Benchmark Study

Python License Models Benchmarks Examples

Code for the paper "Evolving Fairness in Large Language Models: A Longitudinal Multi-Benchmark Study Across Model Families and Versions" — under review at COLM 2026.

A reproducible, modular framework for evaluating fairness drift in LLMs — defined as systematic changes in fairness metrics across successive model versions. Evaluates 12 model versions from 6 major providers across 7 established bias benchmarks (~200K examples).


Key Findings

TL;DR: Toxicity declines consistently across newer model releases, but stereotype scores often remain stable or worsen — revealing a fundamental tradeoff in current safety optimization pipelines.

Finding Detail
Toxicity improves Consistent downward trend across all providers (mean drift: -0.005)
Stereotypes don't Stereotype proxy scores stable or increasing, especially Gemini-2.5-Pro (+0.027)
Fairness is multidimensional All pairwise correlations weak (|r| < 0.2) — gains in one metric don't predict others
Non-monotonic evolution Version updates frequently improve one metric while regressing another
Provider differences OpenAI and Gemma show lower stereotype scores; Gemini shows highest volatility

Provider-Level Results

Provider Sentiment Toxicity Stereotype Proxy
OpenAI 0.79 ± 0.10 0.01 ± 0.02 0.18 ± 0.12
Anthropic 0.76 ± 0.12 0.01 ± 0.02 0.25 ± 0.13
Google (Gemini) 0.72 ± 0.14 0.03 ± 0.05 0.32 ± 0.15
Meta (LLaMA) 0.74 ± 0.11 0.01 ± 0.02 0.20 ± 0.11
Google (Gemma) 0.78 ± 0.10 0.02 ± 0.03 0.22 ± 0.12

Aggregate Drift Across All Models

Metric Mean Drift Std. Dev. 95% CI
Sentiment +0.020 0.010 [0.02, 0.06]
Toxicity -0.005 0.003 [-0.008, 0.002]
Stereotype Proxy +0.010 0.020 [-0.04, 0.02]

Results & Reproducibility

Key figures and result tables are in examples/.

Full evaluation outputs (raw results per model, all plots, summary tables) are available as assets in the v1.0.0 Release.


Overview

This framework systematically evaluates fairness drift across model families and versions by:

  • Querying models with prompts from 7 established bias benchmarks
  • Computing fairness metrics across sentiment, toxicity, and stereotype dimensions
  • Measuring version-to-version drift using bootstrap resampling (1,000 samples)
  • Generating longitudinal visualizations and statistical analyses across providers

Models Evaluated

Commercial APIs:

Provider Models
OpenAI GPT-4-turbo, GPT-4.1, GPT-4.1-mini, GPT-4o, GPT-4o-mini
Anthropic Claude Sonnet 4, Claude Sonnet 4.5, Claude Opus 4.1
Google Gemini Gemini-2.5-Pro, 2.5-Flash, 2.5-Flash-Lite, 2.0-Flash

Open-Weight Models (via HuggingFace):

Provider Models
Meta LLaMA Llama-3.1 (8B, 70B, 405B), Llama-3.2 (1B, 3B)
Google Gemma Gemma-2 (2B, 9B)

All evaluations use deterministic decoding (temperature=0, seed=42), single evaluation per prompt, no best-of-n sampling.


Benchmarks

Dataset Task Type Demographic Axes
BOLD Open-ended generation Profession, gender, race, religion
StereoSet Stereotype detection Gender, race, religion, profession
CrowS-Pairs Minimal pairs Race, gender, religion, age, disability
BBQ Question answering bias Age, gender, race, nationality
HolisticBias Diverse demographic prompts 13 demographic axes
WinoBias Gender coreference Gender
RealToxicityPrompts Toxicity generation Open domain

Fairness Metrics

Primary Metrics:

  • Sentiment Score (VADER, 0–1): Higher = more positive. Ideal: balanced across groups.
  • Toxicity Score (Perspective API, 0–1): Higher = more toxic. Ideal: lower across all groups.
  • Stereotype Score (pattern-based, 0–1): Higher = more stereotypical. Ideal: lower scores.

Advanced Fairness Metrics:

  • Demographic Parity, Equalized Odds, Equal Opportunity Difference
  • Theil Index, KL Divergence, Maximum Mean Discrepancy (MMD)
  • Individual Fairness, Conditional Demographic Parity, Intersectional Fairness
  • Gini Coefficient, Predictive Parity, Statistical significance tests

Installation

Prerequisites

  • Python 3.9 or higher
  • API keys for target models (OpenAI, Anthropic, Google, HuggingFace)
  • Perspective API key (optional, for toxicity scoring)

Setup

git clone https://github.com/pulipakav1/AI_bias.git
cd AI_bias
pip install -r requirements.txt

Copy the environment template and fill in your API keys:

cp .env.example .env
ANTHROPIC_API_KEY=your_anthropic_key
OPENAI_API_KEY=your_openai_key
GOOGLE_API_KEY=your_google_key
HUGGINGFACE_API_KEY=your_huggingface_key
PERSPECTIVE_API_KEY=your_perspective_key  # optional

Download the CrowS-Pairs dataset into data/:

wget https://raw.githubusercontent.com/nyu-mll/crows-pairs/master/data/crows_pairs_anonymized.csv \
     -O data/crows_pairs_anonymized.csv

Usage

Run full benchmark

python main.py

Filtering options

# By provider
python main.py --provider=openai,claude

# By dataset
python main.py --dataset=bold,stereoset,crows_pairs

# By model
python main.py --model=gpt-4.1,claude-sonnet-4-20250514

# Open-weight models only
python main.py --hf-only

# Exclude open-weight models
python main.py --no-hf

Programmatic usage

from main import run_benchmark

run_benchmark(
    include_hf_models=True,
    dataset_filter=["bold", "stereoset"],
    provider_filter=["openai", "claude"],
    model_filter=["gpt-4.1", "claude-sonnet-4-20250514"]
)

Project Structure

AI_bias/
├── src/                           # Pipeline source code
│   ├── config.py                  # Models, API keys, constants
│   ├── data_loader.py             # Dataset loading & preprocessing
│   ├── model_interface.py         # Model API query interfaces
│   ├── fairness_metrics.py        # Fairness metric computation
│   ├── metrics.py                 # Advanced fairness metrics
│   └── visualization.py           # Plot generation & analysis
├── data/                          # Static input data
│   └── crows_pairs_anonymized.csv # Download separately (see Setup)
├── notebooks/                     # Exploratory analysis
│   └── qualitative_analysis.ipynb
├── examples/                      # Key figures and result tables from paper
│   ├── plots/                     # Main visualizations
│   └── results/                   # Summary CSVs and metrics JSON
├── outputs/                       # Generated at runtime (gitignored)
│   ├── results/                   # Raw results and metrics
│   ├── plots/                     # All generated visualizations
│   └── tables/                    # Summary statistics tables
├── main.py                        # Benchmark entry point & CLI
├── requirements.txt
├── .env.example                   # API key template
└── .gitignore

Outputs

outputs/results/

  • raw_results_*.csv — Raw query results with prompts and responses
  • raw_<provider>_<model>_*.csv — Per-model raw results
  • results_with_metrics_*.csv — Results with computed fairness metrics
  • comprehensive_metrics_*.json — Advanced fairness metrics

outputs/plots/

  • Model and provider comparison charts
  • Version progression trajectories (sentiment, toxicity, stereotype)
  • Metric correlation heatmaps
  • Benchmark sensitivity analysis
  • Demographic axis breakdowns

outputs/tables/

  • summary_statistics.csv — Overall summary
  • model_version_summary.csv — Per-model aggregated metrics
  • model_version_detailed_metrics.csv — Detailed model-level metrics
  • pairwise_statistical_tests.csv — Statistical significance tests

Reproducing Paper Results

# Commercial models
python main.py --no-hf

# Open-weight models
python main.py --hf-only

# Qualitative analysis
jupyter notebook notebooks/qualitative_analysis.ipynb

Statistical analysis uses bootstrap resampling with 1,000 samples. All evaluations use deterministic decoding (temperature=0, seed=42).

Note on Gemini safety filters: Content filters are intentionally disabled during evaluation. Bias benchmarks contain prompts that would otherwise be blocked, producing empty responses rather than real model outputs. This follows standard practice in fairness evaluation research (Gehman et al., 2020).


Limitations

  • English-language benchmarks only — findings may not generalize to multilingual settings
  • Single-turn prompts only — multi-turn bias dynamics not captured
  • Automated proxy metrics — context-dependent harms may be missed without human evaluation
  • Proprietary API opacity — causal attribution of drift not possible for commercial models
  • Benchmark coverage — public datasets represent curated subsets of real-world interactions


Acknowledgments

  • Dataset creators: BOLD, StereoSet, BBQ, CrowS-Pairs, RealToxicityPrompts, HolisticBias, WinoBias
  • Model providers: OpenAI, Anthropic, Google, Meta, HuggingFace
  • Libraries: VADER Sentiment, Perspective API, HuggingFace Datasets, matplotlib, seaborn, scipy

License

This project is licensed under the MIT License. See LICENSE for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors