Skip to content

omron-sinicx/dgpo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DGPO: Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Rikuto Kotoge1   Mai Nishimura2   Jiaxin Ma2

1The University of Osaka   2OMRON SINIC X Corporation

arXiv Project Page HuggingFace License Pixi


📄 Abstract

Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARCap), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO significantly makes agentic RAG feasible in computing resource-constrained environments.

🛠️ Installation

Install Pixi following the official instruction

curl -fsSL https://pixi.sh/install.sh | sh

Clone the repository and install dependencies:

git clone https://github.com/omron-sinicx/dgpo.git
cd dgpo
pixi install -a

All dependencies are installed in the project-root directory (.pixi/). No global or system-wide packages are modified.

🚀 Available Tasks

List available tasks with pixi task list. Each task runs in its associated virtual environment (see pyproject.toml).

Task              Description
download          Download corpus and dataset
download-corpus   Download indexing and corpus (133.66GB)
download-dataset  Download NQ/HotpotQA training data
format-toml       Format TOML files with tombi
login-hf          Login to Hugging Face
login-wandb       Login to W&B
start             Start retrieval server
test              Test pretrained models
train             Train model w/ DGPO

⚙️ Configuration

Before running any commands, set the following paths in the [tool.pixi.activation.env] section of pyproject.toml:

Variable Description
📁 RETRIEVER_DATA_PATH Directory for retrieval index & corpus (133.66 GB)
🧪 EXP_ROOT Root directory for experiment checkpoints and logs
📊 DATA_ROOT Directory for training dataset (default: ./data)
[tool.pixi.activation.env]
RETRIEVER_DATA_PATH = "/path/to/retriever"  # e5_Flat.index + wiki-18.jsonl
EXP_ROOT = "/path/to/experiments"
DATA_ROOT = "./data"

📦 Dataset Preparation

Log in to Hugging Face, then download all required data (133.66 GB total):

pixi run login-hf  # for huggingface authorization
pixi run download  # downloads corpus + training dataset

To download individually:

  • pixi run download-corpus — retrieval index & corpus
  • pixi run download-dataset — training dataset

🧠 Training

Train a compact agentic LLM.

Teacher model (3B) was trained using the Search-R1 repository. Cold start initialization can be applied to any knowledge distillation method. We used the standard forward kld based method implementation provided in the TAID repository.

(1) Launch a local retrieval server.

pixi run start

(2) Run DGPO

pixi run login-wandb # for wandb authorization
pixi run train
pixi run train {data_name} {student} {teacher} {exp_name}

🔍 Inference

You can play with the trained model with your own question.

(1) Launch a local retrieval server.

pixi run start

(2) Run inference.

pixi run test {data_name} {model} {exp_name}

:octocat: Acknowledgement

Implementation is built upon Search-R1, veRL, and RAGEN. We sincerely appreciate the efforts for their contributions to open-source research and development.

This work was supported by JST AIP Acceleration Research JPMJCR23U2 and JST PRESTO, Japan, Grant Number JPMJPR2518.

📝 Citation

@inproceedings{kotoge2026dgpo,
    title = "Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities",
    author = "Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin",
    booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2026",
    publisher = "Association for Computational Linguistics",
}

@inproceedings{kotoge2025democratizing,
    title = "Democratizing Agentic {RAG}: Distillation-Guided Policy Optimization for Compact Language Models",
    author = "Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin",
    booktitle = "NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning",
    year = "2025",
    url = "https://openreview.net/forum?id=CP0H9NAWES",
}

About

[ACL 2026 main] DGPO: Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors