Skip to content

[Feature] Use RolloutEngine as distillation teacher to reduce GPU memory vs TrainEngine teacher #1367

@zahrayousefijamarani

Description

@zahrayousefijamarani

Checklist

  • This feature will maintain backward compatibility with the current APIs in
    areal/api/. If not, please raise a refactor issue first.

Background

In on-policy distillation, the teacher is used only for log-prob scoring (teacher_logp) and does not need optimizer, gradients, or training states.
Using a full TrainEngine for teacher is memory-heavy and increases resource pressure.
We want a teacher path based on RolloutEngine / InferenceEngine to reduce GPU memory usage.

Current distillation workflows historically instantiate teacher as a train-style engine, which can allocate unnecessary training components (optimizer states, train-time buffers, etc.) for an inference-only teacher use case.

Potential Solution

  • teacher is configured as an inference rollout engine (vLLM/SGLang).
  • RLTrainer calls teacher.compute_logp(...) on rollout batches.
  • Teacher model path/config is independent from actor rollout model path.
  • Teacher lifecycle uses rollout/controller semantics (init/offload/onload/destroy) without train-engine overhead.

Benefits

  • Lower peak GPU memory for distillation runs.
  • Better stability on limited-memory hardware.
  • Better separation of concerns (teacher scoring vs student training).

Additional Information

Minimal config example

teacher:
  path: <teacher-model-path>
  rollout:
    backend: "vllm:d1p1t1"   # or sglang:d...
  offload: true
  rl_loss_weight: 1.0
  distill_loss_weight: 0.005

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions