feat(distillation): add on-policy distillation using RolloutEngine by zahrayousefijamarani · Pull Request #1376 · areal-project/AReaL

zahrayousefijamarani · 2026-05-28T19:39:01Z

Description

Summary

This PR enables on-policy distillation with a dedicated teacher rollout/inference engine (vLLM/SGLang), instead of relying on a train-engine teacher path.
The goal is to reduce memory overhead and provide a clean inference-side token log-prob scoring API used by distillation losses.

Motivation

In on-policy distillation, teacher is used for teacher_logp scoring only.
A full train-engine teacher can allocate unnecessary training-state memory (optimizer/grad-related structures), while rollout/inference teacher is lighter and better aligned with the actual use case.

Related Issue

Fixes #1367

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

Additional Context

What changed

1) Teacher config refactor

TeacherConfig no longer inherits train-actor config.
Added explicit teacher fields:
- rollout: InferenceEngineConfig
- path: str
- offload: bool
Retains:
- rl_loss_weight
- distill_loss_weight

2) Inference scoring API

Added InferenceEngine.compute_logp(...) API.
Extended remote backend protocol with:
- build_score_request(...)
- parse_score_response(...)

3) Remote scoring implementation

Implemented compute_logp in RemoteInfEngine:

Sends backend score requests
Parses token log-prob outputs
Returns per-trajectory tensors aligned to masked token positions

Added passthrough implementations in:

RemotevLLMEngine
RemoteSGLangEngine

Backend-specific score request/response logic added for:

vLLM
SGLang

4) Controller integration

Added RolloutController.compute_logp(...) so trainer can call scoring via controller mode.
Requests are sharded across workers.
Results are merged in input order.

5) Trainer integration

RLTrainer now supports dedicated teacher rollout initialization via _init_teacher_rollout(...).
Training loop now consumes:
- teacher.compute_logp(rollout_batch)
Attaches:
- teacher_logp
- rl_loss_weight
- distill_loss_weight

Added compatibility guards:

Ensure teacher teardown in close()

6) Docs / examples

Updated distillation example config to new schema:
- teacher.path
- teacher.rollout
Added the new plot in the result section.

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

gemini-code-assist

Code Review

This pull request introduces a rollout-based teacher engine type for inference-only teacher distillation using vLLM or SGLang, deprecating the legacy train-engine teacher path. It implements token log-probability computation across remote engines, controllers, and the trainer. The review feedback identifies a missing pipeline parallel size parameter in the SGLang teacher configuration and recommends adding defensive checks when parsing API responses from vLLM and SGLang servers to prevent potential runtime errors.

…t responses

HwVanICI · 2026-05-28T21:54:26Z

Small changes to add:

Can add warning when multiple engine types are given in teacherconfig to avoid confusion.
Previously we saw that prompt_logprobs=1 in vLLM may not always return the student token in the first index, did you verify this? Maybe we can add a verification to ensure the student token matches the prompt logprob id.
Otherwise looks good to me.

zahrayousefijamarani added 3 commits May 28, 2026 17:58

feat(distillation): add on-policy distillation using RolloutEngine

40ed831

fix: add build_score_request function to vllm_remote.py

b8c6a07

chore(pre-commit): apply formatting suggestions

3dc2665

zahrayousefijamarani requested review from CormickKneey, HwVanICI, PrometheusComing, fishcrap, garrett4wade, geshi001, guozhihao-224, nuzant, rchardx and sitabulaixizawaluduo as code owners May 28, 2026 19:39

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Comment thread areal/trainer/rl_trainer.py

Comment thread areal/engine/vllm_remote.py Outdated

Comment thread areal/engine/sglang_remote.py Outdated

fix: pass pp_size to SGLangConfig and add defensive checks for rollou…

acf01d6

…t responses

chore(config): warn when multiple teacher engine types are configured

1db1360

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distillation): add on-policy distillation using RolloutEngine#1376

feat(distillation): add on-policy distillation using RolloutEngine#1376
zahrayousefijamarani wants to merge 5 commits into
areal-project:mainfrom
zahrayousefijamarani:on_policy_distillation

zahrayousefijamarani commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HwVanICI commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zahrayousefijamarani commented May 28, 2026

Description

Summary

Motivation

Related Issue

Type of Change

Checklist

Additional Context

What changed

1) Teacher config refactor

2) Inference scoring API

3) Remote scoring implementation

4) Controller integration

5) Trainer integration

6) Docs / examples

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HwVanICI commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants