fix(trainer): skip controller-side CUDA sync in single-controller mode#1377
Open
Adiactive wants to merge 1 commit into
Open
fix(trainer): skip controller-side CUDA sync in single-controller mode#1377Adiactive wants to merge 1 commit into
Adiactive wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request modifies areal/trainer/rl_trainer.py to conditionally execute current_platform.synchronize() only when is_single_controller() is false. This check has been added across several methods, including checkpoint saving, evaluation, and statistics exporting. There are no review comments to address, and I have no additional feedback to provide.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
In single-controller mode the trainer process is a pure orchestrator that issues RPCs to engine workers and holds no model or local GPU work, yet
RLTrainerstill callscurrent_platform.synchronize()unconditionally after thecpu_groupbarriers in_save_hf,_save_recover_checkpoint,_evaluate_fn,_evaluate, and_export_and_commit_stats. On a controller node that has GPUs, this forces a CUDA context onto the controller process; when that node's GPUs are already occupied by other jobs, the call fails withCUDA error: out of memoryattorch.cuda.synchronize()and aborts training during the first save/eval.This PR guards all five controller-side syncs with
not is_single_controller(), so they remain active in SPMD mode (where the trainer process is itself a GPU worker that must flush its own device) but are skipped in single-controller mode. Thedist.barrier(group=self.actor.cpu_group)calls are unchanged, preserving cross-worker coordination; only the gratuitous controller-side CUDA sync is removed (workers synchronize their own devices via RPC).Related Issue
N/A
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-pr