Answering natural language queries over relational data often requires retrieving and reasoning over multiple tables, yet most retrievers optimize only for query-table relevance and ignore table table compatibility. We introduce REAR (Retrieve, Expand and Refine), a three-stage, LLM-free framework that separates semantic relevance from structural joinability for efficient, high-fidelity multi-table retrieval. REAR (i) retrieves query-aligned tables, (ii) expands these with structurally joinable tables via fast, precomputed column-embedding comparisons, and (iii) refines them by pruning noisy or weakly related candidates. Empirically, REAR is retriever-agnostic and consistently improves dense/sparse retrievers on complex table QA datasets (BIRD, MMQA, and Spider) by improving both multi-table retrieval quality and downstream SQL execution. Despite being LLM-free, it delivers performance competitive with state-of-the-art LLM-augmented retrieval systems (e.g.,ARM) while achieving much lower latency and cost. Ablations confirm complementary gains from expansion and refinement, underscoring REAR as a practical, scalable building block for table-based downstream tasks (e.g., Text-to-SQL).
Paper: REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval
- Clone the repository:
git clone https://github.com/<your-username>/REaR.git
cd REaR- Create a virtual environment:
python -m venv venv
source venv/bin/activate- Install dependencies:
bash setup.sh- Create a
.envfile with your API keys:
# LLM API Keys (at least one required for SQL generation)
GOOGLE_API_KEY=your-google-api-key-here
OPENAI_API_KEY=your-openai-api-key-here
DEEPINFRA_API_KEY=your-deepinfra-api-key-here-
Prepare your data: Place your table repository JSON and questions file in an accessible path (see Data Format below).
-
Build the vector indices (one-time preprocessing):
# Column-level FAISS index (for the Expand stage)
python preprocessing/create_col_store.py \
--table-repository combined_database_with_desc.json \
--embedding-model BAAI/bge-base-en-v1.5 \
--index-out vs_col.bin \
--metadata-out doc_metadata.json
# Table-level vector store (for the Retrieve stage)
python preprocessing/create_vector_score.py \
--table-repository combined_database_with_desc.json \
--output-dir storage_bge \
--embedding-model BAAI/bge-base-en-v1.5Run all four stages (Retrieve → Expand → Refine → Generate) in a single command:
python full_inference.py \
--env-file .env \
--questions-file merged_questions.json \
--vector-store-path storage_bge \
--faiss-index vs_col.bin \
--faiss-metadata doc_metadata.json \
--table-repository combined_database_with_desc.json \
--output-file e2e_results.json \
--intermediate-dir runs/# Stage 1: Retrieve — top-k tables per question via dense retrieval
python retrieve.py \
--questions-file merged_questions.json \
--vector-store-path storage_bge \
--output-file base.json
# Stage 2: Expand — discover joinable tables via column-level similarity
python expand.py \
--retrieval-file base.json \
--faiss-index vs_col.bin \
--faiss-metadata doc_metadata.json \
--output-file expansion.json
# Stage 3: Refine — prune with cross-encoder scoring
python refine.py \
--expansion-file expansion.json \
--table-repository combined_database_with_desc.json \
--output-file pruned.json
# Stage 4: Generate — produce SQL from the pruned table set
python generate.py \
--pruned-file pruned.json \
--table-repository combined_database_with_desc.json \
--provider gemini \
--output-file sql.json \
--env-file .envpython evaluate_retrieval.py \
--predictions-file base.json \
--ground-truth-file merged_questions.json \
--output-file eval.json{
"tables": {
"table_name": [
{
"id": "tbl_<hash>",
"columns": ["col1", "col2"],
"data": [["val1", "val2"]],
"primary_keys": ["col1"],
"foreign_keys": [],
"table_description": "Optional description"
}
]
}
}{
"questions": [
{
"question_id": "q1",
"question": "Natural language query text",
"tables": [{"id": "tbl_xxx"}]
}
]
}Each stage produces a JSON file that feeds into the next:
| Stage | Output | Description |
|---|---|---|
| Retrieve | base.json |
Top-k candidate tables per question |
| Expand | expansion.json |
Candidate set augmented with joinable tables |
| Refine | pruned.json |
Final table set after cross-encoder pruning |
| Generate | sql.json |
Generated SQL queries with metadata |
REaR/
├── setup.sh # Dependency installation
├── imports.py # Centralized imports
├── utils.py # Utility functions
├── preprocessing/
│ ├── create_col_store.py # Build column-level FAISS index
│ ├── create_vector_score.py # Build table-level vector store
│ ├── generate_questions_list.py # Extract questions from SQL
│ └── generate_table_descriptions.py # Generate table descriptions
├── retrieve.py # Stage 1: Dense retrieval
├── expand.py # Stage 2: Join-aware expansion
├── refine.py # Stage 3: Cross-encoder refinement
├── generate.py # Stage 4: SQL generation
├── evaluate_retrieval.py # Evaluation metrics
└── full_inference.py # End-to-end runner
| Variable | Description | Required |
|---|---|---|
GOOGLE_API_KEY |
Google Generative AI API key | For Gemini provider |
OPENAI_API_KEY |
OpenAI API key | For OpenAI provider |
DEEPINFRA_API_KEY |
DeepInfra API key | For DeepInfra provider |
| Parameter | Stage | Default | Description |
|---|---|---|---|
--top-k |
Retrieve | 5 |
Number of candidate tables per question |
--embedding-model |
Retrieve / Expand | BAAI/bge-base-en-v1.5 |
HuggingFace embedding model |
--top-n |
Refine | 5 |
Final number of tables after pruning |
--alpha |
Refine | — | Weight for query-to-table scores |
--beta |
Refine | — | Weight for table-to-table scores |
--provider |
Generate | gemini |
LLM provider (gemini, openai, deepinfra) |
| Provider | Model | Environment Variable |
|---|---|---|
| Google Gemini | gemini-2.0-flash |
GOOGLE_API_KEY |
| OpenAI | gpt-4o-mini |
OPENAI_API_KEY |
| DeepInfra | Llama-3.2-3B-Instruct |
DEEPINFRA_API_KEY |
REaR's pipeline consists of four stages:
- Retrieve: A dual-encoder with FAISS ranks tables by dense vector similarity to the input question using
BAAI/bge-base-en-v1.5embeddings. - Expand: A column-level FAISS index discovers additional joinable tables that share compatible columns with the retrieved candidates.
- Refine: A cross-encoder (
jina-reranker-v2-base-multilingual) scores table combinations using both query-to-table (QT) and table-to-table (T2T) relevance, then prunes the set to the top-n tables. - Generate: An LLM produces SQL queries from the refined table set and the original question.
| Metric | Description |
|---|---|
| Recall | Coverage of ground-truth tables in the retrieved set |
| Precision | Fraction of retrieved tables that are relevant |
| F1 | Harmonic mean of recall and precision |
| Full Recall / Precision | Perfect match rate (all-or-nothing) |
| Coverage | Schema-field completeness beyond table-level recall |
| Redundancy | Pairwise similarity within the retrieved set |
If you use REaR in your research, please cite our paper:
@misc{agarwal2025rearretrieveexpandrefine,
title={REaR: Retrieve, Expand and Refine for Effective Multitable Retrieval},
author={Rishita Agarwal and Himanshu Singhal and Peter Baile Chen
and Manan Roy Choudhury and Dan Roth and Vivek Gupta},
year={2025},
eprint={2511.00805},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2511.00805}
}This project is licensed under the Creative Commons Attribution 4.0 International License — see the LICENSE file for details.
