fix(trainer): skip controller-side CUDA sync in single-controller mode by Adiactive · Pull Request #1377 · areal-project/AReaL

Adiactive · 2026-05-29T23:45:07Z

Description

In single-controller mode the trainer process is a pure orchestrator that issues RPCs to engine workers and holds no model or local GPU work, yet RLTrainer still calls current_platform.synchronize() unconditionally after the cpu_group barriers in _save_hf, _save_recover_checkpoint, _evaluate_fn, _evaluate, and _export_and_commit_stats. On a controller node that has GPUs, this forces a CUDA context onto the controller process; when that node's GPUs are already occupied by other jobs, the call fails with CUDA error: out of memory at torch.cuda.synchronize() and aborts training during the first save/eval.

This PR guards all five controller-side syncs with not is_single_controller(), so they remain active in SPMD mode (where the trainer process is itself a GPU worker that must flush its own device) but are skipped in single-controller mode. The dist.barrier(group=self.actor.cpu_group) calls are unchanged, preserving cross-worker coordination; only the gratuitous controller-side CUDA sync is removed (workers synchronize their own devices via RPC).

Related Issue

N/A

Type of Change

🐛 Bug fix

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

gemini-code-assist

Code Review

This pull request modifies areal/trainer/rl_trainer.py to conditionally execute current_platform.synchronize() only when is_single_controller() is false. This check has been added across several methods, including checkpoint saving, evaluation, and statistics exporting. There are no review comments to address, and I have no additional feedback to provide.

fix(trainer): skip controller-side CUDA sync in single-controller mode

6f44c9c

Adiactive requested review from fishcrap, garrett4wade and rchardx as code owners May 29, 2026 23:45

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trainer): skip controller-side CUDA sync in single-controller mode#1377

fix(trainer): skip controller-side CUDA sync in single-controller mode#1377
Adiactive wants to merge 1 commit into
areal-project:mainfrom
Adiactive:fix/controller-skip-cuda-sync

Adiactive commented May 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Adiactive commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Adiactive commented May 29, 2026 •

edited

Loading