End-to-end retrieval-augmented generation pipeline, built and evaluated on the rag-mini-wikipedia corpus. The project compares a naive top-1 pipeline against an enhanced pipeline with recall-oriented query rewriting and cross-encoder reranking, scoring both with deterministic SQuAD metrics over the full test split and LLM-judged RAGAS metrics on a 100-query evaluation slice.
Headline finding. Reranking lifts RAGAS context precision from 69.0% to 86.6% and faithfulness from 67.6% to 78.5%, while exact match drops from 41.5 to 33.7. Grounding quality and literal string accuracy are different axes, and a system that looks worse on string overlap can be substantially better grounded. Measuring both is the point of this project.
RAGAS metrics, 100 samples, judged with GPT-4o-mini:
| System | Context Precision | Context Recall | Faithfulness | Answer Relevancy |
|---|---|---|---|---|
| Naive, top-1 | 69.0% | 56.0% | 67.6% | 64.1% |
| Enhanced, rewrite plus rerank | 86.6% | 62.0% | 78.5% | 67.2% |
SQuAD metrics over the full 918-query test split:
| System | Exact Match | F1 |
|---|---|---|
| Naive, top-1 | 41.5 | 49.3 |
| Enhanced, rewrite plus rerank | 33.7 | 42.4 |
Wilson 95% confidence intervals and a two-proportion z-test are reported in docs/technical_report.md. Over the 918-query split the enhanced pipeline's EM is significantly lower than the naive baseline (95% CIs 38.4–44.7 vs 30.7–36.8; z=3.47, p<0.001), so the enhancement measurably trades literal accuracy for grounding, while the grounding gains are large and consistent.
Deterministic metrics such as exact match reward surface-form overlap with a reference answer. LLM-judged metrics such as faithfulness and context precision reward grounded, semantically correct answers. The enhanced pipeline retrieves and filters better evidence, then produces more verbose answers that match references less literally. Relying on a single metric family would have led to the wrong conclusion in either direction. The full analysis, including knowledge-leakage caveats where the generator answers from pretraining rather than retrieval, is in docs/technical_report.md.
Additional findings from the parameter sweeps:
- Retrieval depth hurts a small generator. With FLAN-T5-base, EM falls from 41.5 at top-1 to 21.0 at top-5 as weakly relevant passages dilute the prompt. Reranking down to top-3 from a top-10 candidate pool recovers precision.
- Bigger embeddings buy little here. MPNet at 768 dimensions edges MiniLM-L6 at 384 dimensions by under one F1 point at triple the encoding cost.
- A verification-style prompt beats the baseline prompt by roughly 2 EM points at top-1.
- Corpus:
rag-datasets/rag-mini-wikipediafrom Hugging Face, roughly 3,200 passages, cleaned and profiled before indexing. - Vector store: Milvus Lite with an IVF_FLAT index over
all-MiniLM-L6-v2embeddings. - Generator: FLAN-T5-base with swappable prompt templates.
- Enhancements: recall-oriented query rewriting plus
cross-encoder/ms-marco-MiniLM-L-6-v2reranking. - Evaluation: Hugging Face Evaluate for SQuAD EM and F1, RAGAS with GPT-4o-mini as judge, Wilson confidence intervals, and a two-proportion z-test.
All experiment parameters live in config/default.yaml and config/enhanced.yaml, so runs are repeatable without editing code or notebooks.
| Path | Purpose |
|---|---|
src/ |
Modular pipeline: naive_rag.py, enhanced_rag.py, evaluation.py, pipeline.py, utils.py, config.py |
notebooks/complete_analysis.ipynb |
Full end-to-end run with all experiments and visible outputs |
notebooks/data_exploration.ipynb |
Corpus profiling and evaluation-slice selection |
notebooks/system_evaluation.ipynb |
Smoke test of the modular src/ pipeline |
notebooks/final_analysis.ipynb |
Result visualization with confidence intervals |
config/ |
YAML experiment configuration |
results/ |
Exported metrics for every experiment, JSON and CSV |
docs/technical_report.md |
Full write-up with statistics and error analysis |
AI_USAGE_LOG.md |
Complete log of AI assistance used while building the project |
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtSet the OPENAI_API_KEY environment variable before running RAGAS cells. The first run downloads the corpus and models from Hugging Face, so it needs network access. Subsequent runs use the local cache.
Run the full scripted evaluation, nine generation configurations plus two RAGAS scoring passes:
python -m src.pipelineOutputs land in results/ with a modular_ prefix. The two RAGAS passes additionally require OPENAI_API_KEY; the committed RAGAS metrics in results/ragas_comparison.csv and results/ragas_comparison_delta.csv come from a prior keyed run, so the pipeline's modular_ragas_*.csv files are produced only on a keyed re-run and are not committed in this snapshot. The notebook route through complete_analysis.ipynb produces the unprefixed results and includes every chart and intermediate inspection step.
- FLAN-T5-base is a small generator. Results measure pipeline design, not frontier-model capability.
- The 100-query slice keeps RAGAS judging affordable, so confidence intervals are wide.
- RAGAS scores depend on the judge model. GPT-4o-mini was used throughout, and scores from a different judge would shift.
- The corpus overlaps the generator's pretraining data, so some top-1 answers succeed without retrieval. The report discusses this leakage and how to design around it.
This project began as graduate coursework at Carnegie Mellon and was built with documented AI assistance. AI_USAGE_LOG.md records every session, prompt purpose, and verification step. The headline EM/F1 and RAGAS metrics come from the committed notebook artifacts (results/naive_results.json, results/enhanced_results.json, results/ragas_comparison.csv), which score SQuAD metrics over the full 918-query test split. The scripted pipeline route (results/modular_*.json) scores EM/F1 on the 100-query RAGAS subset and so reports different point values (EM 46.0 to 37.0, F1 49.4 to 41.6) while showing the same effect: grounding improves as literal exact-match drops.
MIT