📚 ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge

ImpliRet (Implicit Fact Retrieval) flips the usual IR setup:
the query is intentionally surface-level (often just a *who / what / when) while the evidence hides implicitly inside the document and must be inferred rather than string-matched (e.g., LMU is a univerisity in Germany in the following Gif).

Dimension	Variants	Example in the document	User query asks for
Reasoning type (3)	Arithmetic, Temporal, World-Knowledge	“The 2024 model is 2.5 × last year’s price.”	“How much does the 2024 model cost?”
Discourse style (2)	multispeaker forum threads / unispeaker chat logs	Ten-turn chat by one speaker vs. two-post Q&A	same

Corpus layout

6 document pools = 3 reasoning types × 2 discourse styles
6K documents + 6K queries (1 : 1), 1.5K in each pool
Each query has exactly one positive passage; the rest of the pool are hard negatives.
Dialogue/forum text is auto-generated by Gemma-3-27B and verified by a second LLM to ensure the implicit clue exists but is never stated explicitly.

Why it matters

The best baseline (ReasonIR-8B) reaches only ≈ 25 % nDCG@10, and even GPT-4.1 falters when asked to choose the right passage from 10 look-alikes—highlighting that document-side reasoning is still an open challenge.

Table of Contents

📈 Results
📂 Dataset
🛠️ Benchmarks
👟 Contributing
📜 Citation

📈 Results

🔬 Retrieval & RAG Results (click to collapse)

The table below reports nDCG@10 (↑ higher is better) for our baseline retrievers.

Retriever	W. Know.	Arithmetic	Temporal	Average
Sparse
BM25	14.69	11.06	10.98	12.24
Late Interaction
ColBERT v2	15.79	14.96	11.99	14.25
Dense Encoders
Contriever	16.50	13.70	12.73	14.31
Dragon+	17.46	14.61	12.66	14.91
ReasonIR-8B	18.88	10.78	11.25	13.64
Knowledge-Graph-Augmented
HippoRAG 2	16.62	14.13	12.83	14.53

*Table 2. nDCG@10 retrieval performance averaged over uni‑speaker and multi‑speaker documents.*

🧩 RAG‑style Evaluation

The table below shows ROUGE‑1 recall (R‑1@k) for two long‑context LLM readers when the top‑k retrieved documents (oracle setting) are supplied.

Experiment	k	W. Know.	Arithmetic	Temporal	Average
Llama 3.3 70B	1	73.79	90.13	81.85	81.92
	10	27.37	16.98	25.23	23.19
	30	17.43	4.42	10.29	10.71
GPT-4.1	1	93.24	92.12	84.90	88.05
	10	62.21	23.86	15.59	35.06
	30	53.91	9.28	6.93	22.90
GPT-o4-mini	1	92.34	92.45	93.44	92.74
	10	88.11	76.61	73.94	79.55
	30	75.44	76.31	14.86	55.54

Table 3. ROUGE‑1 recall (R‑1@k), averaged over uni‑speaker and multi‑speaker documents.

📂 Dataset

You can load the ImpliRet dataset via 🤗 Hugging Face like this:

Repository: zeinabTaghavi/ImpliRet
Reasoning Categories (split): arithmetic, wknow, temporal
Discourse styles (name): multispeaker, unispeaker

from datasets import load_dataset

ds = load_dataset(
    "zeinabTaghavi/ImpliRet",
    name="multispeaker",   # or "unispeaker"
    split="arithmetic"     # wknow | temporal
)

print(ds.features)        # quick schema check
print(ds[0]["question"])  # sanity sample

🛠️ Benchmarks

1. Quick setup

# clone & install
$ git clone https://github.com/ZeinabTaghavi/ImpliRet.git
$ cd ImpliRet
$ python -m venv impliret_env && source impliret_env/bin/activate
$ pip install -r requirements.txt

Repository map

├── RAG_Style/     
│   ├── experiment_configs   # Config of RAG with retrievals or Oracle retriever
│   ├── model_configs        # Config of each LLM that will be used in RAG_Style 
│   ├── script               # Codes of Asyncron and Syncron experiments
│   ├── results              
│   └── reports
├── Retrieval/         
│   ├── retrievals           # Codes of each experiment
│   ├── results
│   └── reports
└── README.md

2. Evaluate retrieval baselines

Running the retrieval baselines (index creation): The retrievals: BM25s, ColBertV2, Contriever, DragonPlus, HippoRagV2, ReasonIR.

Example of running: Run the retriever and generate the report with bash Retrieval/retrieve.sh, which performs the following steps:

# Running the retrieval for indexing
python ./Retrieval/retrieve_indexing.py  --output_folder ./Retrieval/results/ --category arithmetic --discourse multispeaker --retriever_name bm25

# Reporting
python Retrieval/reporting.py

Indexing results are written to Retrieval/results. Reports (MRR, nDCG@10 …) are stored in Retrieval/reports.

⚠️  For running HippoRAG 2 and ReasonIR‑8B we used 4× A100 GPUs.

3. Evaluating RAG Style baselines

Here we try Long context and RAG, the setting of the experiments configs are in the RAG_Style/experiment_configs folder, the config of models are also stored in RAG_Style/model_configs.

Running the Experiment

You can choose among three setups for running this experiment:

Note: All examples are for Arithmetic category (A in the file name) and Multi Speaker discourse style (Multi in the file name).

1- Simplest way: Loading the model locally, using vllm with bash RAG_Style/s_run_tests.sh that does the following in detail:

# example with 
# LLM: Llama3.3 70-B, retriever: BM25s
# Number of documents that are given to LLM: 10
# Hence, The configuration file name is A_Multi_llama_bm_10.yaml
export HF_HOME=...
export HF_TOKEN= ...
python ./RAG_Style/scripts/sync/sync_run_tests.py \
       --config ./RAG_Style/experiment_configs/bm/A_Multi_llama_bm_10.yaml

2- Loading the vllm on server with bash RAG_Style/async_run_multi_llama.sh that does the following in detail:

export HF_HOME=...
export HF_TOKEN= ...
# ------------------------------------------------------------------
# Start vLLM server via helper script (background) and wait for load
# ------------------------------------------------------------------
# run_tests.sh  (top of file)
PROJECT_ROOT=...  # adjust once
source "$PROJECT_ROOT/scripts/async/start_vllm.sh"

# example with 
# LLM: Llama3.3 70-B, retriever: Oracle (positive document is in the context)
# Number of documents that are given to LLM: 10 (1 pos, 9 neg)
# Hence, The configuration file name is A_Multi_llama_1.yaml
python ./RAG_Style/scripts/async/async_run_tests.py \
       --config ./RAG_Style/experiment_configs/oracle_retriever/A_Multi_llama_10.yaml


# ------------------------------------------------------------------
# Shut down the vLLM server
# ------------------------------------------------------------------
echo "Stopping vLLM server (PID=$VLLM_PID)"
kill $VLLM_PID
wait $VLLM_PID 2>/dev/null

3- Using other models like GPT that does not need the server loading with RAG_Style/async_run_multi_GPT.sh or in detail: The outputs will be hashed and stored in Experiments/evaluation/results.

# example with 
# LLM: GPT4.1, retriever: Oracle (positive document is in the context)
# Number of documents that are given to LLM: 10 (1 pos, 9 neg)
# Hence, The configuration file name is A_Multi_GPT_10.yaml
python RAG_Style/scripts/async/async_run_tests.py \
       --config RAG_Style/experiment_configs/oracle_retriever/A_Multi_GPT_10.yaml

Evaluating the RAG_Style results

you can generate the reporting of RAG with the following command:

# Reporting the results:
python RAG_Style/scripts/reporting.py

The results will be stored at RAG_Style/results folder.

👟 Contributing - Run your own retriever

We welcome external baselines! The quickest path is through two companion notebooks:

Notebook	Purpose
📓 `notebook.ipynb`	End‑to‑end evaluation harness for all built‑in retrievers—run this first to verify your setup.
🚀 `contribute.ipynb`	Step‑by‑step template for creating a custom `MyRetriever`, indexing the corpus, and running the full metric suite.

Submit your results

Fork this repository (or clone it locally).
Add code (optional).
Use the 🚀 contribute.ipynb notebook to structure and export your custom retriever code.
Submit results only (optional).
Prefer to keep your code private? Run contribute.ipynb, generate the metrics, and verify the output format.
Send it in.
Open a pull request or email the artefacts (results ± code) plus a short description to zeinabtaghavi1377@gmail.com.
We’ll merge, trigger CI, and add your numbers to Table 2 and the badges — 🥳🎉

Questions? Open an issue or drop us an email to zeinabtaghavi1377@gmail.com. — happy to help!😃

📜 Citation

@inproceedings{taghavi-etal-2025-impliret,
  author    = {Zeinab Sadat Taghavi and Ali Modarressi and Yunpu Ma and Hinrich Sch{\"u}tze},
  title     = {ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge},
  booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)},
  year      = {2025},
  month     = nov,
  address   = {Suzhou, China},
  publisher = {Association for Computational Linguistics},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge

📈 Results

📂 Dataset

🛠️ Benchmarks

1. Quick setup

2. Evaluate retrieval baselines

⚠️  For running HippoRAG 2 and ReasonIR‑8B we used 4× A100 GPUs.

3. Evaluating RAG Style baselines

Running the Experiment

Evaluating the RAG_Style results

👟 Contributing - Run your own retriever

Submit your results

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
...		...
RAG_Style		RAG_Style
Retrieval		Retrieval
.gitignore		.gitignore
Gif_main.gif		Gif_main.gif
Readme.MD		Readme.MD
async_run_GPT.sh		async_run_GPT.sh
async_run_multi_llama.sh		async_run_multi_llama.sh
contribute.ipynb		contribute.ipynb
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt
retrieve.sh		retrieve.sh
sync_run_tests.sh		sync_run_tests.sh

ZeinabTaghavi/ImpliRet

Folders and files

Latest commit

History

Repository files navigation

📚 ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge

📈 Results

📂 Dataset

🛠️ Benchmarks

1. Quick setup

2. Evaluate retrieval baselines

⚠️ For running HippoRAG 2 and ReasonIR‑8B we used 4× A100 GPUs.

3. Evaluating RAG Style baselines

Running the Experiment

Evaluating the RAG_Style results

👟 Contributing - Run your own retriever

Submit your results

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Quick setup

⚠️  For running HippoRAG 2 and ReasonIR‑8B we used 4× A100 GPUs.

👟 Contributing - Run your own retriever

Packages