π§ π³Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering
This section describes how to run the full pipeline: build a retrieval corpus from DPR Wikipedia passages, serve the LLM, run RT-RAG on the 1k HotPotQA dev set, and evaluate.
cd RT-RAG
pip install -r requirements.txt
python -m spacy download en_core_web_smUse scratch (or another large filesystem) to avoid home-directory quota limits.
mkdir -p /scratch/$USER/wiki_corpus
cd /scratch/$USER/wiki_corpus
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
gunzip psgs_w100.tsv.gzConvert to corpus JSON (keeps first ~1.5M passages β 1 GB):
cd /path/to/RT-RAG
ln -s /scratch/$USER/wiki_corpus/psgs_w100.tsv psgs_w100.tsv # if not in repo
python main/convert_psgs_w100.py \
--input psgs_w100.tsv \
--output /scratch/$USER/wiki_corpus/wiki_psgs.json \
--max_passages 1500000Point the repo at the corpus and index on scratch:
ln -sf /scratch/$USER/wiki_corpus/wiki_psgs.json main/raw/wiki_psgs.json
mkdir -p /scratch/$USER/embedding_data
ln -sf /scratch/$USER/embedding_data embedding_data # if not alreadyBuild the BM25 index:
python main/build_bm25_index.py --dataset wiki_psgsUse scratch for the Hugging Face cache to avoid quota:
export HF_HOME=/scratch/$USER/hf_cache
export HUGGINGFACE_HUB_CACHE=/scratch/$USER/hf_cache
mkdir -p /scratch/$USER/hf_cacheStart vLLM (from the repo root or anywhere):
vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 8000Optional: higher throughput with --max-num-seqs 32 --gpu-memory-utilization 0.95.
In another terminal, with the vLLM server running:
cd /path/to/RT-RAG
python main/load_data.pyPredictions are appended to:
outputs/wiki_psgs/bm25_chunk200_topk1_45_topk2_10/1.txt
(Config in main/config.py: DATA_PATH, DATASET, METHOD, etc.)
EM & F1 (answer accuracy):
python main/evaulate.py outputs/wiki_psgs/bm25_chunk200_topk1_45_topk2_10/1.txtRetrieval & reasoning metrics (Recall@k, supporting-fact accuracy):
python main/eval_retrieval_and_reasoning.py \
--hotpot_path hotpotqa_dev_1k.json \
--raw_corpus_path main/raw/wiki_psgs.json \
--results_file outputs/wiki_psgs/bm25_chunk200_topk1_45_topk2_10/1.txt \
--k 10hotpotqa_dev_1k.json is the original HotPotQA dev JSON with supporting_facts (for gold titles). If it's not in the repo root, pass the full path.
| Step | Command / output |
|---|---|
| Corpus | convert_psgs_w100.py β main/raw/wiki_psgs.json |
| Index | build_bm25_index.py --dataset wiki_psgs β embedding_data/wiki_psgs/200_2_2/ |
| LLM | vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 8000 |
| Inference | load_data.py β outputs/wiki_psgs/.../1.txt |
| EM/F1 | evaulate.py <path-to-1.txt> |
| Retrieval/reasoning | eval_retrieval_and_reasoning.py --results_file <path-to-1.txt> |
| Area | Original behavior | Change made |
|---|---|---|
| Retrieval corpus | Corpus was built from HotPotQA dev context passages (build_corpus_from_hotpotqa.py β main/raw/hotpotqa_dev_1k.json). The same 1k examples were both searched and evaluated, so retrieval was artificially easy. |
Separated corpus from eval: retrieval now uses DPR Wikipedia passages (psgs_w100.tsv β wiki_psgs.json). The 1k HotPotQA questions are evaluation-only; they are not part of the search index. |
| Corpus pipeline | Only script to build a corpus was from HotPotQA contexts. | New main/convert_psgs_w100.py converts DPR's psgs_w100.tsv to the JSON format expected by the indexers. Optional --max_passages (default 1.5M) limits corpus size to ~1 GB for memory and disk. |
| Config | DATASET = "hotpotqa_dev_1k"; index/output paths referred to that corpus. |
DATASET = "wiki_psgs" everywhere (main config.py, build_bm25_index.py, build_dense_index/config.py). DATA_PATH still points to main/data/hotpotqa_dev_1k.jsonl for the 1k eval questions. |
| BM25 indexer | Loaded the full corpus JSON into memory (json.load()), then built the index. Large corpora (e.g. full Wikipedia) caused OOM. |
Streaming build: main/build_bm25_index.py uses ijson to stream the JSON and writes chunks to disk in phase 1, then builds BM25 in phase 2. Enables indexing ~1 GB+ corpora on limited RAM. Added ijson to requirements.txt. |
| Eval script | eval_retrieval_and_reasoning.py default --raw_corpus_path was main/raw/hotpotqa_dev_1k.json. |
Default set to main/raw/wiki_psgs.json so retrieval metrics use the same corpus as the index. |
| Speed / throughput | Defaults: 5 trees/question, height 4, 5 sampling iterations, 4 query-rewrite iterations, 8 concurrent jobs. | Tuned for faster runs: fewer trees (3), lower max height (3), fewer sampling (3) and rewrite (2) iterations, higher concurrency (16). All in main/config.py; revert if you need original quality/settings. |
| Documentation | README focused on LongBench, dense index, 14B model. | Quick Start added for HotPotQA 1k + Wikipedia BM25, scratch storage, vLLM (1.5B), and the full eval commands. |
RT-RAG systematically decomposes complex multi-hop questions into explicit binary reasoning trees. It leverages structured entity analysis and consensus-based tree selection to ensure e decomposition, clearly separating core queries, known entities, and unknown targets.
Once the tree is built, a bottom-up traversal strategy is used to iteratively rewrite and refine sub-questions. This process efficiently collects high-quality evidence while mitigating error propagation through recursive reasoning.
pip install -r requirements.txtTo serve Qwen2.5-14B-Instruct locally using vLLM with OpenAI-compatible API:
First, install vLLM:
pip install vllmThen, start the server:
vllm serve Qwen/Qwen2.5-14B-Instruct \
--dtype auto \
--api-key your-api-keyReplace
your-api-keywith a secure token. This key must match what you configure inconfig.py.
π Tip: For more details, see vLLM OpenAI-Compatible Server Docs
You can download models manually or use Hugging Face CLI:
huggingface-cli download BAAI/bge-reranker-basehuggingface-cli download Qwen/Qwen2.5-14B-InstructMake sure to login if authentication is required:
huggingface-cli loginThe preprocessed corpus is already in the raw folder.
Evaluation and retrieval data are from LongBench.
Update your configuration for embedding/index building:
| Parameter | Description |
|---|---|
raw_path |
Path to folder containing preprocessed JSON |
save_path |
Where to store FAISS index & metadata |
dataset_name |
Filename without .json |
chunk_size |
Max words per chunk (e.g., 200) |
min_sentence |
Min sentences per chunk (e.g., 2) |
overlap |
Overlapping sentences between chunks (e.g., 2) |
base_url |
API endpoint (e.g., http://localhost:8000/v1) |
api_key |
Your API key used with the embedding service |
Once main/build_dense_index/config.py is ready, build your FAISS index with:
python build_dense_index/dense_build_index.pyAfter the dense index is successfully built:
-
Configure runtime parameters in:
main/config.pyMake sure the dataset path, retrieval settings, API credentials, and output paths are correct and aligned with the built index.
-
Run the full dataset through the system:
python main/load_data.py
This step runs the entire dataset through the RT-RAG pipeline: it performs retrieval, reranking, tree generation, and LLM querying.
Once inference on the full dataset is complete, you can evaluate the generated answers using:
python main/evaulate.py /path/to/result.txtReplace
/path/to/result.txtwith the actual path to the output file generated bymain/load_data.py.
This script will compute metrics on the dataset.
The table below summarizes RT-RAG's performance across three benchmark datasets using two different backbone models:
| Model | Dataset | F1 | EM |
|---|---|---|---|
| GPT-4o-mini | MuSiQue | 54.42 | 41.50 |
| 2WikiMQA | 75.08 | 63.00 | |
| HotpotQA | 65.26 | 52.50 | |
| Average | 64.92 | 52.33 | |
| Qwen2.5-14B | MuSiQue | 50.04 | 39.00 |
| 2WikiMQA | 73.69 | 64.00 | |
| HotpotQA | 66.24 | 51.00 | |
| Average | 63.32 | 51.33 |
RT-RAG consistently outperforms all baselines across diverse multi-hop QA datasets.