Skip to content

fix(trainer): skip controller-side CUDA sync in single-controller mode#1377

Open
Adiactive wants to merge 1 commit into
areal-project:mainfrom
Adiactive:fix/controller-skip-cuda-sync
Open

fix(trainer): skip controller-side CUDA sync in single-controller mode#1377
Adiactive wants to merge 1 commit into
areal-project:mainfrom
Adiactive:fix/controller-skip-cuda-sync

Conversation

@Adiactive
Copy link
Copy Markdown
Contributor

@Adiactive Adiactive commented May 29, 2026

Description

In single-controller mode the trainer process is a pure orchestrator that issues RPCs to engine workers and holds no model or local GPU work, yet RLTrainer still calls current_platform.synchronize() unconditionally after the cpu_group barriers in _save_hf, _save_recover_checkpoint, _evaluate_fn, _evaluate, and _export_and_commit_stats. On a controller node that has GPUs, this forces a CUDA context onto the controller process; when that node's GPUs are already occupied by other jobs, the call fails with CUDA error: out of memory at torch.cuda.synchronize() and aborts training during the first save/eval.

This PR guards all five controller-side syncs with not is_single_controller(), so they remain active in SPMD mode (where the trainer process is itself a GPU worker that must flush its own device) but are skipped in single-controller mode. The dist.barrier(group=self.actor.cpu_group) calls are unchanged, preserving cross-worker coordination; only the gratuitous controller-side CUDA sync is removed (workers synchronize their own devices via RPC).

Related Issue

N/A

Type of Change

  • 🐛 Bug fix

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies areal/trainer/rl_trainer.py to conditionally execute current_platform.synchronize() only when is_single_controller() is false. This check has been added across several methods, including checkpoint saving, evaluation, and statistics exporting. There are no review comments to address, and I have no additional feedback to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant