Skip to content

CoRAL-ASU/REaR

Repository files navigation

REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval

Paper Python License

REaR Architecture

Abstract

Answering natural language queries over relational data often requires retrieving and reasoning over multiple tables, yet most retrievers optimize only for query-table relevance and ignore table table compatibility. We introduce REAR (Retrieve, Expand and Refine), a three-stage, LLM-free framework that separates semantic relevance from structural joinability for efficient, high-fidelity multi-table retrieval. REAR (i) retrieves query-aligned tables, (ii) expands these with structurally joinable tables via fast, precomputed column-embedding comparisons, and (iii) refines them by pruning noisy or weakly related candidates. Empirically, REAR is retriever-agnostic and consistently improves dense/sparse retrievers on complex table QA datasets (BIRD, MMQA, and Spider) by improving both multi-table retrieval quality and downstream SQL execution. Despite being LLM-free, it delivers performance competitive with state-of-the-art LLM-augmented retrieval systems (e.g.,ARM) while achieving much lower latency and cost. Ablations confirm complementary gains from expansion and refinement, underscoring REAR as a practical, scalable building block for table-based downstream tasks (e.g., Text-to-SQL).

Paper: REaR: Retrieve, Expand, and Refine for Effective Multitable Retrieval


Quick Start

Installation

  1. Clone the repository:
git clone https://github.com/<your-username>/REaR.git
cd REaR
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate
  1. Install dependencies:
bash setup.sh

Setup Configuration

  1. Create a .env file with your API keys:
# LLM API Keys (at least one required for SQL generation)
GOOGLE_API_KEY=your-google-api-key-here
OPENAI_API_KEY=your-openai-api-key-here
DEEPINFRA_API_KEY=your-deepinfra-api-key-here
  1. Prepare your data: Place your table repository JSON and questions file in an accessible path (see Data Format below).

  2. Build the vector indices (one-time preprocessing):

# Column-level FAISS index (for the Expand stage)
python preprocessing/create_col_store.py \
  --table-repository combined_database_with_desc.json \
  --embedding-model BAAI/bge-base-en-v1.5 \
  --index-out vs_col.bin \
  --metadata-out doc_metadata.json

# Table-level vector store (for the Retrieve stage)
python preprocessing/create_vector_score.py \
  --table-repository combined_database_with_desc.json \
  --output-dir storage_bge \
  --embedding-model BAAI/bge-base-en-v1.5

Usage Examples

End-to-End Pipeline

Run all four stages (Retrieve → Expand → Refine → Generate) in a single command:

python full_inference.py \
  --env-file .env \
  --questions-file merged_questions.json \
  --vector-store-path storage_bge \
  --faiss-index vs_col.bin \
  --faiss-metadata doc_metadata.json \
  --table-repository combined_database_with_desc.json \
  --output-file e2e_results.json \
  --intermediate-dir runs/

Running Individual Stages

# Stage 1: Retrieve — top-k tables per question via dense retrieval
python retrieve.py \
  --questions-file merged_questions.json \
  --vector-store-path storage_bge \
  --output-file base.json

# Stage 2: Expand — discover joinable tables via column-level similarity
python expand.py \
  --retrieval-file base.json \
  --faiss-index vs_col.bin \
  --faiss-metadata doc_metadata.json \
  --output-file expansion.json

# Stage 3: Refine — prune with cross-encoder scoring
python refine.py \
  --expansion-file expansion.json \
  --table-repository combined_database_with_desc.json \
  --output-file pruned.json

# Stage 4: Generate — produce SQL from the pruned table set
python generate.py \
  --pruned-file pruned.json \
  --table-repository combined_database_with_desc.json \
  --provider gemini \
  --output-file sql.json \
  --env-file .env

Evaluation

python evaluate_retrieval.py \
  --predictions-file base.json \
  --ground-truth-file merged_questions.json \
  --output-file eval.json

Data Format

Table Repository

{
  "tables": {
    "table_name": [
      {
        "id": "tbl_<hash>",
        "columns": ["col1", "col2"],
        "data": [["val1", "val2"]],
        "primary_keys": ["col1"],
        "foreign_keys": [],
        "table_description": "Optional description"
      }
    ]
  }
}

Questions File

{
  "questions": [
    {
      "question_id": "q1",
      "question": "Natural language query text",
      "tables": [{"id": "tbl_xxx"}]
    }
  ]
}

Pipeline Outputs

Each stage produces a JSON file that feeds into the next:

Stage Output Description
Retrieve base.json Top-k candidate tables per question
Expand expansion.json Candidate set augmented with joinable tables
Refine pruned.json Final table set after cross-encoder pruning
Generate sql.json Generated SQL queries with metadata

Project Structure

REaR/
├── setup.sh                         # Dependency installation
├── imports.py                       # Centralized imports
├── utils.py                         # Utility functions
├── preprocessing/
│   ├── create_col_store.py          # Build column-level FAISS index
│   ├── create_vector_score.py       # Build table-level vector store
│   ├── generate_questions_list.py   # Extract questions from SQL
│   └── generate_table_descriptions.py  # Generate table descriptions
├── retrieve.py                      # Stage 1: Dense retrieval
├── expand.py                        # Stage 2: Join-aware expansion
├── refine.py                        # Stage 3: Cross-encoder refinement
├── generate.py                      # Stage 4: SQL generation
├── evaluate_retrieval.py            # Evaluation metrics
└── full_inference.py                # End-to-end runner

Configuration

Environment Variables

Variable Description Required
GOOGLE_API_KEY Google Generative AI API key For Gemini provider
OPENAI_API_KEY OpenAI API key For OpenAI provider
DEEPINFRA_API_KEY DeepInfra API key For DeepInfra provider

Key Parameters

Parameter Stage Default Description
--top-k Retrieve 5 Number of candidate tables per question
--embedding-model Retrieve / Expand BAAI/bge-base-en-v1.5 HuggingFace embedding model
--top-n Refine 5 Final number of tables after pruning
--alpha Refine Weight for query-to-table scores
--beta Refine Weight for table-to-table scores
--provider Generate gemini LLM provider (gemini, openai, deepinfra)

Supported LLM Providers

Provider Model Environment Variable
Google Gemini gemini-2.0-flash GOOGLE_API_KEY
OpenAI gpt-4o-mini OPENAI_API_KEY
DeepInfra Llama-3.2-3B-Instruct DEEPINFRA_API_KEY

Architecture

REaR's pipeline consists of four stages:

  1. Retrieve: A dual-encoder with FAISS ranks tables by dense vector similarity to the input question using BAAI/bge-base-en-v1.5 embeddings.
  2. Expand: A column-level FAISS index discovers additional joinable tables that share compatible columns with the retrieved candidates.
  3. Refine: A cross-encoder (jina-reranker-v2-base-multilingual) scores table combinations using both query-to-table (QT) and table-to-table (T2T) relevance, then prunes the set to the top-n tables.
  4. Generate: An LLM produces SQL queries from the refined table set and the original question.

Evaluation Metrics

Metric Description
Recall Coverage of ground-truth tables in the retrieved set
Precision Fraction of retrieved tables that are relevant
F1 Harmonic mean of recall and precision
Full Recall / Precision Perfect match rate (all-or-nothing)
Coverage Schema-field completeness beyond table-level recall
Redundancy Pairwise similarity within the retrieved set

Citation

If you use REaR in your research, please cite our paper:

@misc{agarwal2025rearretrieveexpandrefine,
  title={REaR: Retrieve, Expand and Refine for Effective Multitable Retrieval},
  author={Rishita Agarwal and Himanshu Singhal and Peter Baile Chen
          and Manan Roy Choudhury and Dan Roth and Vivek Gupta},
  year={2025},
  eprint={2511.00805},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2511.00805}
}

License

This project is licensed under the Creative Commons Attribution 4.0 International License — see the LICENSE file for details.

About

REAR is a fast, LLM-free framework for multi-table retrieval that separates semantic relevance from structural joinability. By retrieving relevant tables, expanding with joinable ones, and refining noisy candidates, it consistently improves multi-table QA and Text-to-SQL performance—matching LLM-based methods at much lower cost and latency.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors