fix(ppo): exclude no-eos rows from reward normalization#1351
Open
haoyang9804 wants to merge 2 commits into
Open
fix(ppo): exclude no-eos rows from reward normalization#1351haoyang9804 wants to merge 2 commits into
haoyang9804 wants to merge 2 commits into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the PPO actor to support masking sequences without an EOS token during reward normalization and includes a new test case for this functionality. A review comment suggests using the max_seqlen variable instead of attn_mask.shape[1] for better consistency and readability within the _compute_advantages method.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
mask_no_eos_with_zero=True, AReaL suppresses task rewards for generationsthat filled the response length without EOS. However,
PPOActor._compute_advantagesnormalizes scalar rewards before applying that no-EOS mask. A no-EOS row with an
outlier raw reward can therefore shift the normalized reward and advantage for
another valid EOS row, even though the no-EOS task reward is later zeroed.
This patch computes the no-EOS mask before reward normalization and passes it to
reward_norm, so rows that will not receive task reward also do not contribute toreward normalization statistics.
Concrete Minimal Example
Configuration:
Batch:
Expected invariant:
Before this patch, the no-EOS outlier changes the valid row:
{ "observed_advantage_diff": [[-49.5, 0.0, 0.0], [0.0, 0.0, 0.0]], "quiet_training_signal_bug": true }After this patch, the invariant holds:
{ "observed_advantage_diff": [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]], "quiet_training_signal_bug": false }Root Cause
The old order was:
So a no-EOS row was excluded from task reward assignment, but not from
normalization statistics. With batch mean normalization,
[1.0, 100.0]became[-49.5, 49.5]; the valid EOS row then received-49.5instead of matching theclean
[1.0, 1.0]case.Fix
The patch computes
seq_no_eos_maskbefore reward normalization. Whenmask_no_eos_with_zero=True, it passes~seq_no_eos_maskas the normalizationmask:
This preserves existing behavior when
mask_no_eos_with_zero=False.Validation Recipe
{ "bug_id": "AREAL-REWARD-NORM-NO-EOS-LEAK", "validation_mode": "actual_areal_ppo_actor_compute_advantages_boundary_hook", "hooked_boundary": "areal.trainer.ppo.actor.PPOActor._compute_advantages", "constructed_scenario": { "reward_norm": {"mean_level": "batch", "std_level": null}, "mask_no_eos_with_zero": true, "baseline_rewards": [1.0, 1.0], "outlier_rewards": [1.0, 100.0], "attention_mask": [[1, 1, 0], [1, 1, 1]], "loss_mask_before_roll": [[0, 1, 0], [0, 1, 1]], "row_1_is_no_eos": true }, "expected_invariant": "A no-EOS row whose task reward is zeroed must not affect reward normalization or valid EOS-row advantages." }Runner script:
Validation hook:
Validation
Results:
Hook output on unpatched main:
{ "observed_advantage_diff": [[-49.5, 0.0, 0.0], [0.0, 0.0, 0.0]], "quiet_training_signal_bug": true }Hook output on this branch:
{ "observed_advantage_diff": [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]], "quiet_training_signal_bug": false }Duplicate Check
BUG_FINDINGS.md: no existing AReaL entry for no-EOS rows contaminatingreward normalization before
mask_no_eos_with_zero.myARealbranches: checked reward, norm, and eos branch names; existingfix/norm-mask-invalid-values,fix/gspo-2d-masked-advantages, andfix/ppo-adv-mask-invalid-logprobscover different paths.duplicate.
areal-project/AReaLPRs/issues: searchedreward_norm mask_no_eos_with_zero no EOSandPPOActor reward_norm no eos advantages;no duplicate issue or PR found.
actor.py/docs/config references tomask_no_eos_with_zeroandreward_normappeared; no existing fix path wasfound.