Skip to content

[algo] fix rollout rejected rows in group advantages#6452

Open
haoyang9804 wants to merge 1 commit into
verl-project:mainfrom
haoyang9804:fix/rollout-rs-group-advantage-mask
Open

[algo] fix rollout rejected rows in group advantages#6452
haoyang9804 wants to merge 1 commit into
verl-project:mainfrom
haoyang9804:fix/rollout-rs-group-advantage-mask

Conversation

@haoyang9804
Copy link
Copy Markdown
Contributor

@haoyang9804 haoyang9804 commented May 23, 2026

What does this PR do?

This is a reopen of both #6375. It fixes a quiet GRPO advantage leak when rollout rejection sampling fully rejects one response in a same-prompt group. Different from #6375, this PR uses a more general and understandable patch to fix this bug.

The bug is triggered when all of the following are true:

  • algorithm.adv_estimator=grpo
  • rollout correction rejection sampling is enabled, for example algorithm.rollout_correction.rollout_rs=seq_sum_k1
  • the rollout batch contains multiple responses with the same uid
  • one response is fully rejected, so its response_mask becomes all zero
  • that rejected response still has non-zero token_level_rewards

In this case, the rejected response no longer trains itself, but it can still change the GRPO baseline for accepted sibling responses from the same prompt.

How is this bug introduced?

The implementation of GRPO uses scores = token_level_rewards.sum(dim=-1) to aggregate token-level rewards to response-level reward. Then it calculates advantage:

# compute_grpo_outcome_advantage
bsz = scores.shape[0]
for i in range(bsz):
    id2score[index[i]].append(scores[i])
for idx in id2score:
    if len(id2score[idx]) == 1:
        id2mean[idx] = torch.tensor(0.0)
        id2std[idx] = torch.tensor(1.0)
    elif len(id2score[idx]) > 1:
        scores_tensor = torch.stack(id2score[idx])
        id2mean[idx] = torch.mean(scores_tensor)
        id2std[idx] = torch.std(scores_tensor)
    else:
        raise ValueError(f"no score in prompt index: {idx}")
for i in range(bsz):
    if norm_adv_by_std_in_grpo:
        scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
    else:
        scores[i] = scores[i] - id2mean[index[i]]
scores = scores.unsqueeze(-1) * response_mask

where the last statement scores = scores.unsqueeze(-1) * response_mask masks off tokens.

In SeparateRayPPOTrainer's fit_step, the code first calculates the reward (batch = self._fit_compute_reward(batch)), then calls self._fit_compute_advantage(batch), where compute_rollout_correction_and_add_to_batch may happen before compute_advantage, which finally calls compute_grpo_outcome_advantage.

compute_rollout_correction_and_add_to_batch performs rollout correction by rejecting stale tokens or responses. If it rejects a response, it masks off all tokens from it, resulting a all-zero tensor.

However, there is a gap between all-zero mask and removing it from policy update.
Suppose a response (or sequence) is totally masked off, scores = scores.unsqueeze(-1) * response_mask from compute_grpo_outcome_advantage turns its score to all zeros, but it still influences the in-group advantage in the following code:

if norm_adv_by_std_in_grpo:
    scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
else:
    scores[i] = scores[i] - id2mean[index[i]]

In other words, its reward still changes other responses' advantages though it has been dropped off by compute_grpo_outcome_advantage.

This bug is not excludedly owned by GRPO. Outcome-style group estimators (GRPO,vectorized GRPO,PassK,RF++ baseline,RLOO,OPO,GPG,vectorized RLOO) all have the same bug. This PR is trying to fix them all.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to handle fully masked or rejected responses across various outcome advantage estimators in the PPO trainer. It adds a helper function _valid_response_rows to identify valid sequences and updates estimators like GRPO, RLOO, and others to exclude invalid samples from group statistics calculations. Additionally, comprehensive tests were added to verify that masked sequences do not affect advantage computations. Feedback highlights a behavioral inconsistency between the non-vectorized and vectorized GRPO implementations when a group contains only one valid response, suggesting that the non-vectorized version should use the sample's own score as a baseline to ensure consistency and stability.

Comment on lines 323 to 331
if len(id2score[idx]) == 1:
id2mean[idx] = torch.tensor(0.0)
id2std[idx] = torch.tensor(1.0)
id2mean[idx] = scores.new_tensor(0.0)
id2std[idx] = scores.new_tensor(1.0)
elif len(id2score[idx]) > 1:
scores_tensor = torch.stack(id2score[idx])
id2mean[idx] = torch.mean(scores_tensor)
id2std[idx] = torch.std(scores_tensor)
else:
raise ValueError(f"no score in prompt index: {idx}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a behavioral inconsistency between the non-vectorized and vectorized implementations of GRPO when a group contains only one valid response (either because the group size is 1 or because other members were rejected).

In this non-vectorized implementation, if len(id2score[idx]) == 1, the baseline is set to 0 and std to 1, resulting in an advantage equal to the raw score. However, in the vectorized version (compute_grpo_vectorized_outcome_advantage), group_mean_std will compute the mean as the score itself, leading to an advantage of 0 (since score - mean = 0).

For group-based estimators, an advantage of 0 is generally more appropriate when no comparison samples are available. To ensure consistency and avoid training instability from un-baselined single samples, the non-vectorized version should also use the sample's own score as the baseline.

        for idx in id2score:
            scores_tensor = torch.stack(id2score[idx])
            id2mean[idx] = torch.mean(scores_tensor)
            if len(id2score[idx]) > 1:
                id2std[idx] = torch.std(scores_tensor)
            else:
                id2std[idx] = scores.new_tensor(0.0)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before my patch, non-vectorized and vectorized implementations of GRPO is inconsistent. I'm not sure changing it in this pr is suitable or not. @wuxibin89 Hi Xibin, what do you think of this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant