This repository contains the official implementation of our SIGIR 2025 paper:
📄 Lightweight and Direct Document Relevance Optimization for Generative IR (DDRO)
- Optimizing Generative Retrieval with Ranking-Aligned Objectives
- Motivation
- What DDRO Does
- Learning Objectives
- 🛠️ Setup & Dependencies - Steps to Reproduce 🎯
- Preprocessed Data & Model Checkpoints
- 🔬 Evaluate Pre-trained Models from HuggingFace
- Citation
Misalignment in Learning Objectives:
Gen-IR models are typically trained via next-token prediction (cross-entropy loss) over docid tokens.
While effective for language modeling, this objective:
- 🎯 Optimizes token-level generation
- ❌ Not designed for document-level ranking
As a result, Gen-IR models are not directly optimized for learning-to-rank, which is the core requirement in IR systems.
In this work, we ask:
How can Gen-IR models directly learn to rank documents, instead of just predicting the next token?
We propose DDRO:
Lightweight and Direct Document Relevance Optimization for Gen-IR
- Aligns training objective with ranking by using pairwise preference learning
- Trains the model to prefer relevant documents over non-relevant ones
- Bridges the gap between autoregressive training and ranking-based optimization
- Requires no reinforcement learning or reward modeling
We optimize DDRO in two phases:
Learn to generate the correct docid sequence given a query by minimizing the autoregressive token-level cross-entropy loss:
Maximize the likelihood of generating the correct docid given a query:
This phase improves the ranking quality of generated document identifiers by applying a pairwise learning-to-rank objective inspired by Direct Preference Optimization (DPO).
📄 Rafailov et al., 2023 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model
This Direct Document Relevance Optimization (DDRO) loss guides the model to prefer relevant documents (docid⁺) over non-relevant ones (docid⁻) by comparing how both the current model and a frozen reference model score each document:
-
docid⁺: A relevant document for the queryq -
docid⁻: A non-relevant or less relevant document -
$\pi_\theta$ : The current model being optimized -
$\pi^{\text{ref}}$ : A frozen reference model (typically trained with SFT in Phase 1) - β: Temperature-like factor controlling sensitivity.
-
$\sigma$ : Sigmoid function, to map scores to [0,1] preference space
Encourage the model to rank relevant docid⁺ higher than non-relevant docid⁻:
The DPO loss is used after the SFT phase to fine-tune the ranking behavior of the model. Instead of just generating docid, the model now learns to rank docid⁺ higher than docid⁻ in a relevance/preference-aligned manner.
- Directly encourages higher generation scores for relevant documents
- Uses contrastive ranking rather than token-level generation
- Avoids reward modeling or RL while remaining efficient and scalable
While our optimization is inspired by the DPO framework Rafailov et al., 2023, its adaptation to Generative Document Retrieval is non-trivial:
- In contrast to open-ended preference alignment, our task involves structured docid generation under beam decoding constraints
- Our model uses an encoder-decoder architecture rather than decoder-only
- The objective is document-level ranking, not open-ended preference generation
This required novel integration of preference optimization into retrieval-specific pipelines, making DDRO uniquely suited for GenIR.
src/
├── data/ # Data downloading, preprocessing, and docid instance generation
├── pretrain/ # DDRO model training and evaluation logic (incl. ddro)
├── scripts/ # Entry-point shell scripts for SFT, ddro, BM25, and preprocessing
├── utils/ # Core utilities (tokenization, trie, metrics, trainers)
├── ddro.yml # Conda environment (for training DDRO)
├── pyserini.yml # Conda environment (for BM25 retrieval with Pyserini)
├── README.md # You're here!
└── requirements.txt # Additional Python dependencies🔎 Each subdirectory includes a detailed
README.mdwith instructions.
Clone the repository and create the conda environment:
git clone https://github.com/kidist-amde/ddro.git
cd ddro
conda env create -f ddro_env.yml
conda activate ddro_envWe use MS MARCO document (top-300k) and Natural Questions (NQ-320k) datasets, and a pretrained T5 model.
To download them, run the following commands from the project root (ddro/):
bash ./src/data/download/download_msmarco_datasets.sh
bash ./src/data/download/download_nq_datasets.sh
python ./src/data/download/download_t5_model.py📂 For details and download links, refer to: src/data/download/README.md
DDRO evaluated both on Natural Questions (NQ) and MS MARCO datasets.
✅ Sample Top-300K MS MARCO Subset Run the following script to preprocess and extract the top-300K most relevant MS MARCO documents based on qrels:
bash scripts/preprocess/sample_top_docs.sh- 📌 This will generate: resources/datasets/processed/msmarco-docs-sents.top.300k.json.gz (sentence-tokenized JSONL format, ranked by relevance frequency)
Once everything is downloaded and processed, your resources/ directory should look like this:
resources/
├── datasets/
│ ├── raw/
│ │ ├── msmarco-data/ # Raw MS MARCO dataset
│ │ └── nq-data/ # Raw Natural Questions dataset
│ └── processed/ # Preprocessed outputs
└── transformer_models/
└── t5-base/ # Local copy of T5 model & tokenizer
🔎 To process and sample both datasets, generate document IDs, and prepare training/evaluation instances, please refer to the corresponding README:
We first train a Supervised Fine-Tuning (SFT) model using next-token prediction across three stages:
- Pretraining on document content (
doc → docid) - Search Pretraining on pseudo queries (
pseudoquery → docid) - Finetuning on real queries using supervised pairs from qrels (with gold docids) (
query → docid)
This results in a seed model trained to autoregressively generate document identifiers.
You can run all stages with a single command:
bash ddro/src/scripts/sft/launch_SFT_training.shAfter training the SFT model (Phase 1), we apply Phase 2: Direct Document Relevance Optimization, which fine-tunes the model using a pairwise ranking objective, that trains the model to prefer relevant documents over non-relevant ones.
This bridges the gap between autoregressive generation and ranking-based retrieval.
We implement this using a custom version of Hugging Face's DPOTrainer.
Run DDRO training and evaluation:
bash scripts/ddro/slurm_submit_ddro_training.sh
bash scripts/ddro/slurm_submit_ddro_eval.shTL;DR: Here’s a drop-in replacement for your README “Model Evaluation” section that uses the new HF artifacts + launcher, no local files.
You can evaluate our published models directly from HF (no local preprocessing).
kiyam/ddro-msmarco-pq— MS MARCO (PQ)kiyam/ddro-msmarco-tu— MS MARCO (Title+URL)kiyam/ddro-nq-pq— Natural Questions (PQ)kiyam/ddro-nq-tu— Natural Questions (Title+URL)
-
DocID tables: https://huggingface.co/datasets/kiyam/ddro-docids
pq_msmarco_docids.txt,tu_msmarco_docids.txtpq_nq_docids.txt,tu_nq_docids.txt
-
Eval test sets: https://huggingface.co/datasets/kiyam/ddro-testsets
msmarco/test_data_top_300k/query_dev.t5_128_1.{pq|url}.top_300k.jsonnq/test_data/query_dev.t5_128_1.{pq_nq|url_title_nq}.json
These pairs are mutually consistent; mixing assets across sources will cause ID mismatches.
A) Use the launcher (builds HF URIs for you)
# SLURM
sbatch src/pretrain/hf_eval/slurm_submit_hf_eval.sh
# Or run directly:
python src/pretrain/hf_eval/launch_hf_eval_from_config.py \
--dataset msmarco \ # msmarco | nq
--encoding pq \ # pq | url
--scale top_300k \ # matches filenames on HF
--hf_docids_repo kiyam/ddro-docids \
--hf_tests_repo kiyam/ddro-testsetsB) Call the evaluator with HF URIs (no local files)
# Example: NQ + Title+URL (tu)
python src/pretrain/hf_eval/eval_hf_docid_ranking.py \
--per_gpu_batch_size 4 \
--log_path logs/nq/dpo_DDRO_url_title.log \
--pretrain_model_path kiyam/ddro-nq-tu \
--docid_path "hf:dataset:kiyam/ddro-docids:tu_nq_docids.txt" \
--test_file_path "hf:dataset:kiyam/ddro-testsets:nq/test_data/query_dev.t5_128_1.url_title_nq.json" \
--dataset_script_dir src/data/data_scripts \
--dataset_cache_dir ./cache \
--num_beams 50 \
--add_doc_num 6144 \
--max_seq_length 64 \
--max_docid_length 100 \
--use_docid_rank True \
--docid_format nq \
--lookup_fallback True \
--device cuda:0
# Example: MS MARCO + PQ
python src/pretrain/hf_eval/eval_hf_docid_ranking.py \
--per_gpu_batch_size 4 \
--log_path logs/msmarco/dpo_DDRO_pq.log \
--pretrain_model_path kiyam/ddro-msmarco-pq \
--docid_path "hf:dataset:kiyam/ddro-docids:pq_msmarco_docids.txt" \
--test_file_path "hf:dataset:kiyam/ddro-testsets:msmarco/test_data_top_300k/query_dev.t5_128_1.pq.top_300k.json" \
--dataset_script_dir src/data/data_scripts \
--dataset_cache_dir ./cache \
--num_beams 80 \
--add_doc_num 6144 \
--max_seq_length 64 \
--max_docid_length 24 \
--use_docid_rank True \
--docid_format msmarco \
--lookup_fallback True \
--device cuda:0-
Tokenizer stack: we recommend
transformers==4.37.2,tokenizers==0.15.2. -
DocID namespace:
- NQ–PQ uses canonical integer docids (
pq_nq_docids.txt). - NQ–TU uses lowercased url_title strings (
tu_nq_docids.txt). Ensure the test setquery_idmatches the DocID table’s LHS namespace.
- NQ–PQ uses canonical integer docids (
-
Beams: NQ-PQ (100), NQ-TU (50), MS MARCO-PQ (80) are good defaults (the launcher sets these).
📂 Evaluation logs and metrics are saved to:
logs/<dataset>/dpo_*.log
logs/<dataset>/dpo_*.csv
We evaluate DDRO on two standard retrieval benchmarks:
All datasets, pseudo queries, docid encodings, and model checkpoints are available here:
🔗 DDRO Generative IR Collection on Hugging Face 🤗
We gratefully acknowledge the following open-source projects:
This project is licensed under the Apache 2.0 License.
@inproceedings{mekonnen2025lightweight,
title={Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval},
author={Mekonnen, Kidist Amde and Tang, Yubao and de Rijke, Maarten},
booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages={1327--1338},
year={2025}
}For questions, please open an issue.
© 2025 Kidist Amde Mekonnen · Made with ❤️ at IRLab, University of Amsterdam.




