Skip to content

yongxin2020/TempPerturb-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TempPerturb-Eval

TempPerturb-Eval is a framework for analyzing how RAG performance changes under:

  • internal variation: generation temperature
  • external noise: context perturbations (original, replace_half, remove_half, ner)

It is designed for controlled robustness evaluation on HotpotQA.

📄 Paper: TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness

🧾 Abstract

RAG evaluation often studies generation temperature and retrieval noise separately. TempPerturb-Eval analyzes their interaction by applying controlled context perturbations (replace_half, remove_half, ner) across temperatures and models on HotpotQA. The framework supports both correctness and stability analysis, revealing that higher temperatures can amplify perturbation sensitivity in model-dependent ways.

Key Contributions

  1. A diagnostic benchmark for RAG robustness under joint internal/external variation.
  2. An analysis workflow for perturbation-temperature interaction effects.
  3. Practical guidance for model/temperature selection under noisy retrieval.

✨ At a Glance

  • Task: RAG robustness evaluation under temperature + context perturbation
  • Dataset: HotpotQA (fullwiki, stratified 600-sample subset)
  • Main outputs:
    • eval_results/comprehensive_scores/comprehensive_scores_full.csv
    • eval_results/figs/temperature_trends/*.png
    • eval_results/figs/variability/*.png

Pipeline Flow

PreprocessGenerate/Download ResultsEvaluateVisualize


🚀 Quick Start

🧰 Environment

Python 3.11.9 is recommended.

conda create -n TempPerturb-Eval python=3.11.9
conda activate TempPerturb-Eval
pip install -r requirements.txt

Create a local config from the template:

cp config.example.json config.json

Then edit config.json:

  • set python_path
  • add your API keys in api_keys

Important

config.json is intended for local use and is ignored by git. Do not commit real API keys.

Main configuration fields:

  • models
  • temperatures (typically 0.0 to 2.0)
  • q_types (bridge, comparison)
  • perturbation settings (original, replace_half, remove_half, ner)

📚 Data

This project uses HotpotQA fullwiki train split and a stratified subset (600 samples total).

To build local stratified data:

python ./scripts/_1_preprocess.py

This creates files under data/stratified_train/.

Notes:

  • The stratified setup targets 600 samples total (2 question types × 3 fact-count strata × 100).
  • Preprocessing also prepares processed_complete_answer references used by evaluation.

Run Pipeline

Option A: 📥 use pre-generated outputs

python ./scripts/download_results.py

Or specify the dataset repo explicitly:

python ./scripts/download_results.py --repo_id yongxin2020/TempPerturb-Eval-data --output_dir ./results

Option B: 🤖 generate outputs yourself

python ./scripts/_2_rag_systems.py \
    --model gpt-3.5-turbo \
    --q_type comparison \
    --num_facts 2 \
    --temperature 0.0 \
    --save_fp ./results/

📊 Evaluate

python ./scripts/_3_collect_all_results.py

Useful modes:

python ./scripts/_3_collect_all_results.py --test_mode
python ./scripts/_3_collect_all_results.py --models gpt-3.5-turbo --temperatures 0.0 1.0

Main output:

  • eval_results/comprehensive_scores/comprehensive_scores_full.csv

📈 Visualize

Use the minimal script-based pipeline:

python ./scripts/visualize.py

🧪 Metrics

The evaluation includes:

  • BERTScore
  • ROUGE-1/2/L

Auxiliary metrics (EM, F1, TTR) are still supported in the collector via --include_all_metrics for extended analysis.

Repository Structure

scripts/
├── _1_preprocess.py
├── _2_rag_systems.py
├── _3_collect_all_results.py
├── visualize.py
├── download_results.py
├── model_utils.py
└── perturbations.py

Data Availability

Pre-generated results/ are hosted at:

  • Dataset on Hugging Face

comprehensive_scores_full.csv is large and may be reconstructed from:

  • eval_results/comprehensive_scores/intermediate/

Reproducibility Notes

  • Seeds are fixed where applicable (random.seed(42)).
  • One known invalid HotpotQA sample (5a7f3f7c55429934daa2fd45) is skipped during generation.
  • For a reproducibility-first workflow: use downloaded results/, then run evaluation + visualization.

Notes

  • HotpotQA fullwiki train split is used because supporting facts are unavailable in the test split.

Citation

If you use this project, please cite:

@misc{zhou2025tempperturbevaljointeffectsinternal,
    title={TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness},
    author={Yongxin Zhou and Philippe Mulhem and Didier Schwab},
    year={2025},
    eprint={2512.01183},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2512.01183},
}

About

[LREC 2026] TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages