Skip to content

ZeinabTaghavi/ImpliRet

Repository files navigation

📚 ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge

arXiv badge Hugging Face Open in Colab

ImpliRet (Implicit Fact Retrieval) flips the usual IR setup:
the query is intentionally surface-level (often just a *who / what / when) while the evidence hides implicitly inside the document and must be inferred rather than string-matched (e.g., LMU is a univerisity in Germany in the following Gif).

Dimension Variants Example in the document User query asks for
Reasoning type (3) Arithmetic, Temporal, World-Knowledge “The 2024 model is 2.5 × last year’s price.” “How much does the 2024 model cost?”
Discourse style (2) multispeaker forum threads / unispeaker chat logs Ten-turn chat by one speaker vs. two-post Q&A same

Corpus layout

  • 6 document pools = 3 reasoning types × 2 discourse styles
  • 6K documents + 6K queries (1 : 1), 1.5K in each pool
  • Each query has exactly one positive passage; the rest of the pool are hard negatives.
  • Dialogue/forum text is auto-generated by Gemma-3-27B and verified by a second LLM to ensure the implicit clue exists but is never stated explicitly.

Why it matters

The best baseline (ReasonIR-8B) reaches only ≈ 25 % nDCG@10, and even GPT-4.1 falters when asked to choose the right passage from 10 look-alikes—highlighting that document-side reasoning is still an open challenge.

Demo of ImpliRet reasoning


Table of Contents

📈 Results

🔬 Retrieval & RAG Results (click to collapse)

The table below reports nDCG@10 (↑ higher is better) for our baseline retrievers.

Retriever W. Know. Arithmetic Temporal Average
Sparse
BM25 14.69 11.06 10.98 12.24
Late Interaction
ColBERT v2 15.79 14.96 11.99 14.25
Dense Encoders
Contriever 16.50 13.70 12.73 14.31
Dragon+ 17.46 14.61 12.66 14.91
ReasonIR-8B 18.88 10.78 11.25 13.64
Knowledge-Graph-Augmented
HippoRAG 2 16.62 14.13 12.83 14.53
*Table 2. nDCG@10 retrieval performance averaged over uni‑speaker and multi‑speaker documents.*
🧩 RAG‑style Evaluation

The table below shows ROUGE‑1 recall (R‑1@k) for two long‑context LLM readers when the top‑k retrieved documents (oracle setting) are supplied.

Experiment k W. Know. Arithmetic Temporal Average
Llama 3.3 70B 1 73.79 90.13 81.85 81.92
10 27.37 16.98 25.23 23.19
30 17.43 4.42 10.29 10.71
GPT-4.1 1 93.24 92.12 84.90 88.05
10 62.21 23.86 15.59 35.06
30 53.91 9.28 6.93 22.90
GPT-o4-mini 1 92.34 92.45 93.44 92.74
10 88.11 76.61 73.94 79.55
30 75.44 76.31 14.86 55.54

Table 3. ROUGE‑1 recall (R‑1@k), averaged over uni‑speaker and multi‑speaker documents.


📂 Dataset

You can load the ImpliRet dataset via 🤗 Hugging Face like this:

  • Repository: zeinabTaghavi/ImpliRet
  • Reasoning Categories (split): arithmetic, wknow, temporal
  • Discourse styles (name): multispeaker, unispeaker
from datasets import load_dataset

ds = load_dataset(
    "zeinabTaghavi/ImpliRet",
    name="multispeaker",   # or "unispeaker"
    split="arithmetic"     # wknow | temporal
)

print(ds.features)        # quick schema check
print(ds[0]["question"])  # sanity sample

🛠️ Benchmarks

1. Quick setup

# clone & install
$ git clone https://github.com/ZeinabTaghavi/ImpliRet.git
$ cd ImpliRet
$ python -m venv impliret_env && source impliret_env/bin/activate
$ pip install -r requirements.txt

Repository map
├── RAG_Style/     
│   ├── experiment_configs   # Config of RAG with retrievals or Oracle retriever
│   ├── model_configs        # Config of each LLM that will be used in RAG_Style 
│   ├── script               # Codes of Asyncron and Syncron experiments
│   ├── results              
│   └── reports
├── Retrieval/         
│   ├── retrievals           # Codes of each experiment
│   ├── results
│   └── reports
└── README.md             

2. Evaluate retrieval baselines

Running the retrieval baselines (index creation): The retrievals: BM25s, ColBertV2, Contriever, DragonPlus, HippoRagV2, ReasonIR.

Example of running: Run the retriever and generate the report with bash Retrieval/retrieve.sh, which performs the following steps:

# Running the retrieval for indexing
python ./Retrieval/retrieve_indexing.py  --output_folder ./Retrieval/results/ --category arithmetic --discourse multispeaker --retriever_name bm25

# Reporting
python Retrieval/reporting.py

Indexing results are written to Retrieval/results. Reports (MRR, nDCG@10 …) are stored in Retrieval/reports.

⚠️  For running HippoRAG 2 and ReasonIR‑8B we used 4× A100 GPUs.


3. Evaluating RAG Style baselines

Here we try Long context and RAG, the setting of the experiments configs are in the RAG_Style/experiment_configs folder, the config of models are also stored in RAG_Style/model_configs.

Running the Experiment

You can choose among three setups for running this experiment:

Note: All examples are for Arithmetic category (A in the file name) and Multi Speaker discourse style (Multi in the file name).

1- Simplest way: Loading the model locally, using vllm with bash RAG_Style/s_run_tests.sh that does the following in detail:

# example with 
# LLM: Llama3.3 70-B, retriever: BM25s
# Number of documents that are given to LLM: 10
# Hence, The configuration file name is A_Multi_llama_bm_10.yaml
export HF_HOME=...
export HF_TOKEN= ...
python ./RAG_Style/scripts/sync/sync_run_tests.py \
       --config ./RAG_Style/experiment_configs/bm/A_Multi_llama_bm_10.yaml

2- Loading the vllm on server with bash RAG_Style/async_run_multi_llama.sh that does the following in detail:

export HF_HOME=...
export HF_TOKEN= ...
# ------------------------------------------------------------------
# Start vLLM server via helper script (background) and wait for load
# ------------------------------------------------------------------
# run_tests.sh  (top of file)
PROJECT_ROOT=...  # adjust once
source "$PROJECT_ROOT/scripts/async/start_vllm.sh"

# example with 
# LLM: Llama3.3 70-B, retriever: Oracle (positive document is in the context)
# Number of documents that are given to LLM: 10 (1 pos, 9 neg)
# Hence, The configuration file name is A_Multi_llama_1.yaml
python ./RAG_Style/scripts/async/async_run_tests.py \
       --config ./RAG_Style/experiment_configs/oracle_retriever/A_Multi_llama_10.yaml


# ------------------------------------------------------------------
# Shut down the vLLM server
# ------------------------------------------------------------------
echo "Stopping vLLM server (PID=$VLLM_PID)"
kill $VLLM_PID
wait $VLLM_PID 2>/dev/null

3- Using other models like GPT that does not need the server loading with RAG_Style/async_run_multi_GPT.sh or in detail: The outputs will be hashed and stored in Experiments/evaluation/results.

# example with 
# LLM: GPT4.1, retriever: Oracle (positive document is in the context)
# Number of documents that are given to LLM: 10 (1 pos, 9 neg)
# Hence, The configuration file name is A_Multi_GPT_10.yaml
python RAG_Style/scripts/async/async_run_tests.py \
       --config RAG_Style/experiment_configs/oracle_retriever/A_Multi_GPT_10.yaml

Evaluating the RAG_Style results

you can generate the reporting of RAG with the following command:

# Reporting the results:
python RAG_Style/scripts/reporting.py

The results will be stored at RAG_Style/results folder.


👟 Contributing - Run your own retriever

We welcome external baselines! The quickest path is through two companion notebooks:

Notebook Purpose
📓 notebook.ipynb End‑to‑end evaluation harness for all built‑in retrievers—run this first to verify your setup.
🚀 contribute.ipynb Step‑by‑step template for creating a custom MyRetriever, indexing the corpus, and running the full metric suite.

Submit your results

  1. Fork this repository (or clone it locally).
  2. Add code (optional).
    Use the 🚀 contribute.ipynb notebook to structure and export your custom retriever code.
  3. Submit results only (optional).
    Prefer to keep your code private? Run contribute.ipynb, generate the metrics, and verify the output format.
  4. Send it in.
    Open a pull request or email the artefacts (results ± code) plus a short description to zeinabtaghavi1377@gmail.com.
  5. We’ll merge, trigger CI, and add your numbers to Table 2 and the badges — 🥳🎉

Questions? Open an issue or drop us an email to zeinabtaghavi1377@gmail.com. — happy to help!😃


📜 Citation

@inproceedings{taghavi-etal-2025-impliret,
  author    = {Zeinab Sadat Taghavi and Ali Modarressi and Yunpu Ma and Hinrich Sch{\"u}tze},
  title     = {ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge},
  booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)},
  year      = {2025},
  month     = nov,
  address   = {Suzhou, China},
  publisher = {Association for Computational Linguistics},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors