DGPO: Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Rikuto Kotoge¹ Mai Nishimura² Jiaxin Ma²

¹The University of Osaka ²OMRON SINIC X Corporation

📄 Abstract

Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARCap), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO significantly makes agentic RAG feasible in computing resource-constrained environments.

🛠️ Installation

Install Pixi following the official instruction

curl -fsSL https://pixi.sh/install.sh | sh

Clone the repository and install dependencies:

git clone https://github.com/omron-sinicx/dgpo.git
cd dgpo
pixi install -a

All dependencies are installed in the project-root directory (.pixi/). No global or system-wide packages are modified.

🚀 Available Tasks

List available tasks with pixi task list. Each task runs in its associated virtual environment (see pyproject.toml).

Task              Description
download          Download corpus and dataset
download-corpus   Download indexing and corpus (133.66GB)
download-dataset  Download NQ/HotpotQA training data
format-toml       Format TOML files with tombi
login-hf          Login to Hugging Face
login-wandb       Login to W&B
start             Start retrieval server
test              Test pretrained models
train             Train model w/ DGPO

⚙️ Configuration

Before running any commands, set the following paths in the [tool.pixi.activation.env] section of pyproject.toml:

Variable	Description
📁 `RETRIEVER_DATA_PATH`	Directory for retrieval index & corpus (133.66 GB)
🧪 `EXP_ROOT`	Root directory for experiment checkpoints and logs
📊 `DATA_ROOT`	Directory for training dataset (default: `./data`)

[tool.pixi.activation.env]
RETRIEVER_DATA_PATH = "/path/to/retriever"  # e5_Flat.index + wiki-18.jsonl
EXP_ROOT = "/path/to/experiments"
DATA_ROOT = "./data"

📦 Dataset Preparation

Log in to Hugging Face, then download all required data (133.66 GB total):

pixi run login-hf  # for huggingface authorization
pixi run download  # downloads corpus + training dataset

To download individually:

pixi run download-corpus — retrieval index & corpus

pixi run download-dataset — training dataset

🧠 Training

Train a compact agentic LLM.

Teacher model (3B) was trained using the Search-R1 repository. Cold start initialization can be applied to any knowledge distillation method. We used the standard forward kld based method implementation provided in the TAID repository.

(1) Launch a local retrieval server.

pixi run start

(2) Run DGPO

pixi run login-wandb # for wandb authorization
pixi run train
pixi run train {data_name} {student} {teacher} {exp_name}

🔍 Inference

You can play with the trained model with your own question.

(1) Launch a local retrieval server.

pixi run start

(2) Run inference.

pixi run test {data_name} {model} {exp_name}

Acknowledgement

Implementation is built upon Search-R1, veRL, and RAGEN. We sincerely appreciate the efforts for their contributions to open-source research and development.

This work was supported by JST AIP Acceleration Research JPMJCR23U2 and JST PRESTO, Japan, Grant Number JPMJPR2518.

📝 Citation

@inproceedings{kotoge2026dgpo,
    title = "Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities",
    author = "Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin",
    booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2026",
    publisher = "Association for Computational Linguistics",
}

@inproceedings{kotoge2025democratizing,
    title = "Democratizing Agentic {RAG}: Distillation-Guided Policy Optimization for Compact Language Models",
    author = "Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin",
    booktitle = "NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning",
    year = "2025",
    url = "https://openreview.net/forum?id=CP0H9NAWES",
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
scripts		scripts
search_r1		search_r1
verl		verl
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer.py		infer.py
pixi.lock		pixi.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DGPO: Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

📄 Abstract

🛠️ Installation

🚀 Available Tasks

⚙️ Configuration

📦 Dataset Preparation

🧠 Training

🔍 Inference

You can play with the trained model with your own question.

Acknowledgement

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DGPO: Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

📄 Abstract

🛠️ Installation

🚀 Available Tasks

⚙️ Configuration

📦 Dataset Preparation

🧠 Training

🔍 Inference

You can play with the trained model with your own question.

Acknowledgement

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages