Skip to content

We introduce the direct document relevance optimization (DDRO) for training a pairwise ranker model. DDRO encourages the model to focus on document-level relevance during generation

License

Notifications You must be signed in to change notification settings

kidist-amde/ddro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DDRO: Direct Document Relevance Optimization for Generative Information Retrieval

Paper License HuggingFace

This repository contains the official implementation of our SIGIR 2025 paper:
📄 Lightweight and Direct Document Relevance Optimization for Generative IR (DDRO)

  • Optimizing Generative Retrieval with Ranking-Aligned Objectives

📑 Table of Contents

Motivation

Misalignment in Learning Objectives:
Gen-IR models are typically trained via next-token prediction (cross-entropy loss) over docid tokens.
While effective for language modeling, this objective:

  • 🎯 Optimizes token-level generation
  • ❌ Not designed for document-level ranking

As a result, Gen-IR models are not directly optimized for learning-to-rank, which is the core requirement in IR systems.

What DDRO Does

In this work, we ask:

How can Gen-IR models directly learn to rank documents, instead of just predicting the next token?

We propose DDRO:
Lightweight and Direct Document Relevance Optimization for Gen-IR

✅ Key Contributions:

  • Aligns training objective with ranking by using pairwise preference learning
  • Trains the model to prefer relevant documents over non-relevant ones
  • Bridges the gap between autoregressive training and ranking-based optimization
  • Requires no reinforcement learning or reward modeling

DDRO training pipeline overview

Learning Objectives in DDRO

We optimize DDRO in two phases:


📘 Phase 1: Supervised Fine-Tuning (SFT)

Learn to generate the correct docid sequence given a query by minimizing the autoregressive token-level cross-entropy loss:

  • DDRO Image

Maximize the likelihood of generating the correct docid given a query:

  • DDRO Image

📗 Phase 2: Pairwise Ranking Optimization (DDRO Loss)

This phase improves the ranking quality of generated document identifiers by applying a pairwise learning-to-rank objective inspired by Direct Preference Optimization (DPO).

📄 Rafailov et al., 2023 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model

  • DDRO Image

📖 Description

This Direct Document Relevance Optimization (DDRO) loss guides the model to prefer relevant documents (docid⁺) over non-relevant ones (docid⁻) by comparing how both the current model and a frozen reference model score each document:

  • docid⁺: A relevant document for the query q
  • docid⁻: A non-relevant or less relevant document
  • $\pi_\theta$: The current model being optimized
  • $\pi^{\text{ref}}$: A frozen reference model (typically trained with SFT in Phase 1)
  • β: Temperature-like factor controlling sensitivity.
  • $\sigma$: Sigmoid function, to map scores to [0,1] preference space

Encourage the model to rank relevant docid⁺ higher than non-relevant docid⁻:

  • DDRO Image

✅ Usage

The DPO loss is used after the SFT phase to fine-tune the ranking behavior of the model. Instead of just generating docid, the model now learns to rank docid⁺ higher than docid⁻ in a relevance/preference-aligned manner.


✅ Why It Works

  • Directly encourages higher generation scores for relevant documents
  • Uses contrastive ranking rather than token-level generation
  • Avoids reward modeling or RL while remaining efficient and scalable

💡 Why DDRO is Different from Standard DPO

While our optimization is inspired by the DPO framework Rafailov et al., 2023, its adaptation to Generative Document Retrieval is non-trivial:

  • In contrast to open-ended preference alignment, our task involves structured docid generation under beam decoding constraints
  • Our model uses an encoder-decoder architecture rather than decoder-only
  • The objective is document-level ranking, not open-ended preference generation

This required novel integration of preference optimization into retrieval-specific pipelines, making DDRO uniquely suited for GenIR.

📁 Project Structure

src/
├── data/                # Data downloading, preprocessing, and docid instance generation
├── pretrain/            # DDRO model training and evaluation logic (incl. ddro)
├── scripts/             # Entry-point shell scripts for SFT, ddro, BM25, and preprocessing
├── utils/               # Core utilities (tokenization, trie, metrics, trainers)
├── ddro.yml             # Conda environment (for training DDRO)
├── pyserini.yml         # Conda environment (for BM25 retrieval with Pyserini)
├── README.md            # You're here!
└── requirements.txt     # Additional Python dependencies

📌 Important

🔎 Each subdirectory includes a detailed README.md with instructions.


🛠️ Setup & Dependencies

1. Install Environment

Clone the repository and create the conda environment:

git clone https://github.com/kidist-amde/ddro.git
cd ddro
conda env create -f ddro_env.yml
conda activate ddro_env

2. Download Datasets and Pretrained Model

We use MS MARCO document (top-300k) and Natural Questions (NQ-320k) datasets, and a pretrained T5 model.

To download them, run the following commands from the project root (ddro/):

bash   ./src/data/download/download_msmarco_datasets.sh
bash   ./src/data/download/download_nq_datasets.sh
python ./src/data/download/download_t5_model.py

📂 For details and download links, refer to: src/data/download/README.md

3. Data Preparation

DDRO evaluated both on Natural Questions (NQ) and MS MARCO datasets.

✅ Sample Top-300K MS MARCO Subset Run the following script to preprocess and extract the top-300K most relevant MS MARCO documents based on qrels:

bash scripts/preprocess/sample_top_docs.sh
  • 📌 This will generate: resources/datasets/processed/msmarco-docs-sents.top.300k.json.gz (sentence-tokenized JSONL format, ranked by relevance frequency)

Expected Directory Structure

Once everything is downloaded and processed, your resources/ directory should look like this:

resources/
├── datasets/
│   ├── raw/
│   │   ├── msmarco-data/         # Raw MS MARCO dataset 
│   │   └── nq-data/              # Raw Natural Questions dataset
│   └── processed/                # Preprocessed outputs
└── transformer_models/
      └── t5-base/                # Local copy of T5 model & tokenizer

📌 Important

🔎 To process and sample both datasets, generate document IDs, and prepare training/evaluation instances, please refer to the corresponding README:

🔗 src/data/data_prep/README.md


Training Pipeline

📘 Phase 1: Supervised Fine-Tuning (SFT)

We first train a Supervised Fine-Tuning (SFT) model using next-token prediction across three stages:

  1. Pretraining on document content (doc → docid)
  2. Search Pretraining on pseudo queries (pseudoquery → docid)
  3. Finetuning on real queries using supervised pairs from qrels (with gold docids) (query → docid)

This results in a seed model trained to autoregressively generate document identifiers.

You can run all stages with a single command:

bash ddro/src/scripts/sft/launch_SFT_training.sh

📍 The --encoding flag in the script supports id formats like pq, url.

🔧 Phase 2: DDRO Training (Pairwise Optimization)

After training the SFT model (Phase 1), we apply Phase 2: Direct Document Relevance Optimization, which fine-tunes the model using a pairwise ranking objective, that trains the model to prefer relevant documents over non-relevant ones.

This bridges the gap between autoregressive generation and ranking-based retrieval.

We implement this using a custom version of Hugging Face's DPOTrainer.

Run DDRO training and evaluation:

bash scripts/ddro/slurm_submit_ddro_training.sh
bash scripts/ddro/slurm_submit_ddro_eval.sh

TL;DR: Here’s a drop-in replacement for your README “Model Evaluation” section that uses the new HF artifacts + launcher, no local files.


Model Evaluation

🔬 Evaluate Pre-trained Models from Hugging Face

You can evaluate our published models directly from HF (no local preprocessing).

Available Models

  • kiyam/ddro-msmarco-pq — MS MARCO (PQ)
  • kiyam/ddro-msmarco-tu — MS MARCO (Title+URL)
  • kiyam/ddro-nq-pq — Natural Questions (PQ)
  • kiyam/ddro-nq-tu — Natural Questions (Title+URL)

Required HF datasets (ready-made)

These pairs are mutually consistent; mixing assets across sources will cause ID mismatches.


⚡ Quick Evaluation (recommended)

A) Use the launcher (builds HF URIs for you)

# SLURM
sbatch src/pretrain/hf_eval/slurm_submit_hf_eval.sh

# Or run directly:
python src/pretrain/hf_eval/launch_hf_eval_from_config.py \
  --dataset msmarco \            # msmarco | nq
  --encoding pq \                # pq | url 
  --scale top_300k \             # matches filenames on HF
  --hf_docids_repo kiyam/ddro-docids \
  --hf_tests_repo  kiyam/ddro-testsets

B) Call the evaluator with HF URIs (no local files)

# Example: NQ + Title+URL (tu)
python src/pretrain/hf_eval/eval_hf_docid_ranking.py \
  --per_gpu_batch_size 4 \
  --log_path logs/nq/dpo_DDRO_url_title.log \
  --pretrain_model_path kiyam/ddro-nq-tu \
  --docid_path "hf:dataset:kiyam/ddro-docids:tu_nq_docids.txt" \
  --test_file_path "hf:dataset:kiyam/ddro-testsets:nq/test_data/query_dev.t5_128_1.url_title_nq.json" \
  --dataset_script_dir src/data/data_scripts \
  --dataset_cache_dir ./cache \
  --num_beams 50 \
  --add_doc_num 6144 \
  --max_seq_length 64 \
  --max_docid_length 100 \
  --use_docid_rank True \
  --docid_format nq \
  --lookup_fallback True \
  --device cuda:0

# Example: MS MARCO + PQ
python src/pretrain/hf_eval/eval_hf_docid_ranking.py \
  --per_gpu_batch_size 4 \
  --log_path logs/msmarco/dpo_DDRO_pq.log \
  --pretrain_model_path kiyam/ddro-msmarco-pq \
  --docid_path "hf:dataset:kiyam/ddro-docids:pq_msmarco_docids.txt" \
  --test_file_path "hf:dataset:kiyam/ddro-testsets:msmarco/test_data_top_300k/query_dev.t5_128_1.pq.top_300k.json" \
  --dataset_script_dir src/data/data_scripts \
  --dataset_cache_dir ./cache \
  --num_beams 80 \
  --add_doc_num 6144 \
  --max_seq_length 64 \
  --max_docid_length 24 \
  --use_docid_rank True \
  --docid_format msmarco \
  --lookup_fallback True \
  --device cuda:0

🔧 Notes & tips

  • Tokenizer stack: we recommend transformers==4.37.2, tokenizers==0.15.2.

  • DocID namespace:

    • NQ–PQ uses canonical integer docids (pq_nq_docids.txt).
    • NQ–TU uses lowercased url_title strings (tu_nq_docids.txt). Ensure the test set query_id matches the DocID table’s LHS namespace.
  • Beams: NQ-PQ (100), NQ-TU (50), MS MARCO-PQ (80) are good defaults (the launcher sets these).

📂 Evaluation logs and metrics are saved to:

logs/<dataset>/dpo_*.log
logs/<dataset>/dpo_*.csv

📚 Datasets Used

We evaluate DDRO on two standard retrieval benchmarks:

Preprocessed Data & Model Checkpoints

All datasets, pseudo queries, docid encodings, and model checkpoints are available here:
🔗 DDRO Generative IR Collection on Hugging Face 🤗


🙏 Acknowledgments

We gratefully acknowledge the following open-source projects:


📄 License

This project is licensed under the Apache 2.0 License.


Citation

@inproceedings{mekonnen2025lightweight,
  title={Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval},
  author={Mekonnen, Kidist Amde and Tang, Yubao and de Rijke, Maarten},
  booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages={1327--1338},
  year={2025}
}

📬 Contact

For questions, please open an issue.

© 2025 Kidist Amde Mekonnen · Made with ❤️ at IRLab, University of Amsterdam.