Checklist
Background
During RL training, the rollout phase dominates wall-clock time and resource consumption (inference servers, network communication, reward computation). When developers need to iterate on training logic (advantage computation, PPO loss, reward normalization, etc.), re-running full rollouts each time is prohibitively expensive.
Additionally, when a training issue is observed at step N, reproducing the exact conditions requires re-running the entire experiment up to that point. Without a way to capture and replay the exact rollout data, debugging becomes slow and non-deterministic.
We need a trajectory record/replay mechanism that allows:
- Dump mode: Serialize the complete rollout batch (token IDs, loss masks, logprobs, rewards, etc.) to disk at each training step during normal training.
- Replay mode: Skip rollout and inference engine initialization entirely, loading previously recorded batches from disk to drive the training loop.
This enables:
- Deterministic reproduction: Bugs observed at a specific step can be reliably reproduced by replaying the exact same input data, eliminating non-determinism from rollout.
- Efficient debugging: Isolate training-side issues from rollout-side issues by holding the input data constant.
Potential Solution
Add a DebugConfig to PPOConfig with two mutually exclusive flags:
dump_rollout_data: Record each step's full tensor batch to disk as .pt files.
replay_rollout_data: Load batches from disk, bypassing rollout and inference engine entirely.
path: Optional custom directory for dump/replay files.
Checklist
areal/api/. If not, please raise a refactor issue first.Background
During RL training, the rollout phase dominates wall-clock time and resource consumption (inference servers, network communication, reward computation). When developers need to iterate on training logic (advantage computation, PPO loss, reward normalization, etc.), re-running full rollouts each time is prohibitively expensive.
Additionally, when a training issue is observed at step N, reproducing the exact conditions requires re-running the entire experiment up to that point. Without a way to capture and replay the exact rollout data, debugging becomes slow and non-deterministic.
We need a trajectory record/replay mechanism that allows:
This enables:
Potential Solution
Add a
DebugConfigtoPPOConfigwith two mutually exclusive flags:dump_rollout_data: Record each step's full tensor batch to disk as.ptfiles.replay_rollout_data: Load batches from disk, bypassing rollout and inference engine entirely.path: Optional custom directory for dump/replay files.