TempPerturb-Eval

TempPerturb-Eval is a framework for analyzing how RAG performance changes under:

internal variation: generation temperature
external noise: context perturbations (original, replace_half, remove_half, ner)

It is designed for controlled robustness evaluation on HotpotQA.

📄 Paper: TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness

🧾 Abstract

RAG evaluation often studies generation temperature and retrieval noise separately. TempPerturb-Eval analyzes their interaction by applying controlled context perturbations (replace_half, remove_half, ner) across temperatures and models on HotpotQA. The framework supports both correctness and stability analysis, revealing that higher temperatures can amplify perturbation sensitivity in model-dependent ways.

Key Contributions

A diagnostic benchmark for RAG robustness under joint internal/external variation.
An analysis workflow for perturbation-temperature interaction effects.
Practical guidance for model/temperature selection under noisy retrieval.

✨ At a Glance

Task: RAG robustness evaluation under temperature + context perturbation
Dataset: HotpotQA (fullwiki, stratified 600-sample subset)
Main outputs:
- eval_results/comprehensive_scores/comprehensive_scores_full.csv
- eval_results/figs/temperature_trends/*.png
- eval_results/figs/variability/*.png

Pipeline Flow

Preprocess → Generate/Download Results → Evaluate → Visualize

🚀 Quick Start

🧰 Environment

Python 3.11.9 is recommended.

conda create -n TempPerturb-Eval python=3.11.9
conda activate TempPerturb-Eval
pip install -r requirements.txt

Create a local config from the template:

cp config.example.json config.json

Then edit config.json:

set python_path
add your API keys in api_keys

Important

config.json is intended for local use and is ignored by git. Do not commit real API keys.

Main configuration fields:

models
temperatures (typically 0.0 to 2.0)
q_types (bridge, comparison)
perturbation settings (original, replace_half, remove_half, ner)

📚 Data

This project uses HotpotQA fullwiki train split and a stratified subset (600 samples total).

To build local stratified data:

python ./scripts/_1_preprocess.py

This creates files under data/stratified_train/.

Notes:

The stratified setup targets 600 samples total (2 question types × 3 fact-count strata × 100).
Preprocessing also prepares processed_complete_answer references used by evaluation.

Run Pipeline

Option A: 📥 use pre-generated outputs

python ./scripts/download_results.py

Or specify the dataset repo explicitly:

python ./scripts/download_results.py --repo_id yongxin2020/TempPerturb-Eval-data --output_dir ./results

Option B: 🤖 generate outputs yourself

python ./scripts/_2_rag_systems.py \
    --model gpt-3.5-turbo \
    --q_type comparison \
    --num_facts 2 \
    --temperature 0.0 \
    --save_fp ./results/

📊 Evaluate

python ./scripts/_3_collect_all_results.py

Useful modes:

python ./scripts/_3_collect_all_results.py --test_mode
python ./scripts/_3_collect_all_results.py --models gpt-3.5-turbo --temperatures 0.0 1.0

Main output:

eval_results/comprehensive_scores/comprehensive_scores_full.csv

📈 Visualize

Use the minimal script-based pipeline:

python ./scripts/visualize.py

🧪 Metrics

The evaluation includes:

BERTScore
ROUGE-1/2/L

Auxiliary metrics (EM, F1, TTR) are still supported in the collector via --include_all_metrics for extended analysis.

Repository Structure

scripts/
├── _1_preprocess.py
├── _2_rag_systems.py
├── _3_collect_all_results.py
├── visualize.py
├── download_results.py
├── model_utils.py
└── perturbations.py

Data Availability

Pre-generated results/ are hosted at:

comprehensive_scores_full.csv is large and may be reconstructed from:

eval_results/comprehensive_scores/intermediate/

Reproducibility Notes

Seeds are fixed where applicable (random.seed(42)).
One known invalid HotpotQA sample (5a7f3f7c55429934daa2fd45) is skipped during generation.
For a reproducibility-first workflow: use downloaded results/, then run evaluation + visualization.

Notes

HotpotQA fullwiki train split is used because supporting facts are unavailable in the test split.

Citation

If you use this project, please cite:

@misc{zhou2025tempperturbevaljointeffectsinternal,
    title={TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness},
    author={Yongxin Zhou and Philippe Mulhem and Didier Schwab},
    year={2025},
    eprint={2512.01183},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2512.01183},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/stratified_train		data/stratified_train
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.example.json		config.example.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TempPerturb-Eval

🧾 Abstract

Key Contributions

✨ At a Glance

Pipeline Flow

🚀 Quick Start

🧰 Environment

📚 Data

Run Pipeline

Option A: 📥 use pre-generated outputs

Option B: 🤖 generate outputs yourself

📊 Evaluate

📈 Visualize

🧪 Metrics

Repository Structure

Data Availability

Reproducibility Notes

Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TempPerturb-Eval

🧾 Abstract

Key Contributions

✨ At a Glance

Pipeline Flow

🚀 Quick Start

🧰 Environment

📚 Data

Run Pipeline

Option A: 📥 use pre-generated outputs

Option B: 🤖 generate outputs yourself

📊 Evaluate

📈 Visualize

🧪 Metrics

Repository Structure

Data Availability

Reproducibility Notes

Notes

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages