TempPerturb-Eval is a framework for analyzing how RAG performance changes under:
- internal variation: generation temperature
- external noise: context perturbations (
original,replace_half,remove_half,ner)
It is designed for controlled robustness evaluation on HotpotQA.
RAG evaluation often studies generation temperature and retrieval noise separately. TempPerturb-Eval analyzes their interaction by applying controlled context perturbations (replace_half, remove_half, ner) across temperatures and models on HotpotQA. The framework supports both correctness and stability analysis, revealing that higher temperatures can amplify perturbation sensitivity in model-dependent ways.
- A diagnostic benchmark for RAG robustness under joint internal/external variation.
- An analysis workflow for perturbation-temperature interaction effects.
- Practical guidance for model/temperature selection under noisy retrieval.
- Task: RAG robustness evaluation under temperature + context perturbation
- Dataset: HotpotQA (
fullwiki, stratified 600-sample subset) - Main outputs:
eval_results/comprehensive_scores/comprehensive_scores_full.csveval_results/figs/temperature_trends/*.pngeval_results/figs/variability/*.png
Preprocess → Generate/Download Results → Evaluate → Visualize
Python 3.11.9 is recommended.
conda create -n TempPerturb-Eval python=3.11.9
conda activate TempPerturb-Eval
pip install -r requirements.txtCreate a local config from the template:
cp config.example.json config.jsonThen edit config.json:
- set
python_path - add your API keys in
api_keys
Important
config.json is intended for local use and is ignored by git. Do not commit real API keys.
Main configuration fields:
modelstemperatures(typically0.0to2.0)q_types(bridge,comparison)- perturbation settings (
original,replace_half,remove_half,ner)
This project uses HotpotQA fullwiki train split and a stratified subset (600 samples total).
To build local stratified data:
python ./scripts/_1_preprocess.pyThis creates files under data/stratified_train/.
Notes:
- The stratified setup targets 600 samples total (2 question types × 3 fact-count strata × 100).
- Preprocessing also prepares
processed_complete_answerreferences used by evaluation.
python ./scripts/download_results.pyOr specify the dataset repo explicitly:
python ./scripts/download_results.py --repo_id yongxin2020/TempPerturb-Eval-data --output_dir ./resultspython ./scripts/_2_rag_systems.py \
--model gpt-3.5-turbo \
--q_type comparison \
--num_facts 2 \
--temperature 0.0 \
--save_fp ./results/python ./scripts/_3_collect_all_results.pyUseful modes:
python ./scripts/_3_collect_all_results.py --test_mode
python ./scripts/_3_collect_all_results.py --models gpt-3.5-turbo --temperatures 0.0 1.0Main output:
eval_results/comprehensive_scores/comprehensive_scores_full.csv
Use the minimal script-based pipeline:
python ./scripts/visualize.pyThe evaluation includes:
- BERTScore
- ROUGE-1/2/L
Auxiliary metrics (EM, F1, TTR) are still supported in the collector via --include_all_metrics for extended analysis.
scripts/
├── _1_preprocess.py
├── _2_rag_systems.py
├── _3_collect_all_results.py
├── visualize.py
├── download_results.py
├── model_utils.py
└── perturbations.py
Pre-generated results/ are hosted at:
comprehensive_scores_full.csv is large and may be reconstructed from:
eval_results/comprehensive_scores/intermediate/
- Seeds are fixed where applicable (
random.seed(42)). - One known invalid HotpotQA sample (
5a7f3f7c55429934daa2fd45) is skipped during generation. - For a reproducibility-first workflow: use downloaded
results/, then run evaluation + visualization.
- HotpotQA
fullwikitrain split is used because supporting facts are unavailable in the test split.
If you use this project, please cite:
@misc{zhou2025tempperturbevaljointeffectsinternal,
title={TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness},
author={Yongxin Zhou and Philippe Mulhem and Didier Schwab},
year={2025},
eprint={2512.01183},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.01183},
}