[algo] fix rollout rejected rows in group advantages by haoyang9804 · Pull Request #6452 · verl-project/verl

haoyang9804 · 2026-05-23T09:58:10Z

What does this PR do?

This is a reopen of both #6375. It fixes a quiet GRPO advantage leak when rollout rejection sampling fully rejects one response in a same-prompt group. Different from #6375, this PR uses a more general and understandable patch to fix this bug.

The bug is triggered when all of the following are true:

algorithm.adv_estimator=grpo
rollout correction rejection sampling is enabled, for example algorithm.rollout_correction.rollout_rs=seq_sum_k1
the rollout batch contains multiple responses with the same uid
one response is fully rejected, so its response_mask becomes all zero
that rejected response still has non-zero token_level_rewards

In this case, the rejected response no longer trains itself, but it can still change the GRPO baseline for accepted sibling responses from the same prompt.

How is this bug introduced?

The implementation of GRPO uses scores = token_level_rewards.sum(dim=-1) to aggregate token-level rewards to response-level reward. Then it calculates advantage:

# compute_grpo_outcome_advantage
bsz = scores.shape[0]
for i in range(bsz):
    id2score[index[i]].append(scores[i])
for idx in id2score:
    if len(id2score[idx]) == 1:
        id2mean[idx] = torch.tensor(0.0)
        id2std[idx] = torch.tensor(1.0)
    elif len(id2score[idx]) > 1:
        scores_tensor = torch.stack(id2score[idx])
        id2mean[idx] = torch.mean(scores_tensor)
        id2std[idx] = torch.std(scores_tensor)
    else:
        raise ValueError(f"no score in prompt index: {idx}")
for i in range(bsz):
    if norm_adv_by_std_in_grpo:
        scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
    else:
        scores[i] = scores[i] - id2mean[index[i]]
scores = scores.unsqueeze(-1) * response_mask

where the last statement scores = scores.unsqueeze(-1) * response_mask masks off tokens.

In SeparateRayPPOTrainer's fit_step, the code first calculates the reward (batch = self._fit_compute_reward(batch)), then calls self._fit_compute_advantage(batch), where compute_rollout_correction_and_add_to_batch may happen before compute_advantage, which finally calls compute_grpo_outcome_advantage.

compute_rollout_correction_and_add_to_batch performs rollout correction by rejecting stale tokens or responses. If it rejects a response, it masks off all tokens from it, resulting a all-zero tensor.

However, there is a gap between all-zero mask and removing it from policy update.
Suppose a response (or sequence) is totally masked off, scores = scores.unsqueeze(-1) * response_mask from compute_grpo_outcome_advantage turns its score to all zeros, but it still influences the in-group advantage in the following code:

if norm_adv_by_std_in_grpo:
    scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
else:
    scores[i] = scores[i] - id2mean[index[i]]

In other words, its reward still changes other responses' advantages though it has been dropped off by compute_grpo_outcome_advantage.

This bug is not excludedly owned by GRPO. Outcome-style group estimators (GRPO,vectorized GRPO,PassK,RF++ baseline,RLOO,OPO,GPG,vectorized RLOO) all have the same bug. This PR is trying to fix them all.

gemini-code-assist

Code Review

This pull request introduces logic to handle fully masked or rejected responses across various outcome advantage estimators in the PPO trainer. It adds a helper function _valid_response_rows to identify valid sequences and updates estimators like GRPO, RLOO, and others to exclude invalid samples from group statistics calculations. Additionally, comprehensive tests were added to verify that masked sequences do not affect advantage computations. Feedback highlights a behavioral inconsistency between the non-vectorized and vectorized GRPO implementations when a group contains only one valid response, suggesting that the non-vectorized version should use the sample's own score as a baseline to ensure consistency and stability.

gemini-code-assist · 2026-05-23T10:04:58Z

            if len(id2score[idx]) == 1:
-                id2mean[idx] = torch.tensor(0.0)
-                id2std[idx] = torch.tensor(1.0)
+                id2mean[idx] = scores.new_tensor(0.0)
+                id2std[idx] = scores.new_tensor(1.0)
            elif len(id2score[idx]) > 1:
                scores_tensor = torch.stack(id2score[idx])
                id2mean[idx] = torch.mean(scores_tensor)
                id2std[idx] = torch.std(scores_tensor)
            else:
                raise ValueError(f"no score in prompt index: {idx}")


There is a behavioral inconsistency between the non-vectorized and vectorized implementations of GRPO when a group contains only one valid response (either because the group size is 1 or because other members were rejected).

In this non-vectorized implementation, if len(id2score[idx]) == 1, the baseline is set to 0 and std to 1, resulting in an advantage equal to the raw score. However, in the vectorized version (compute_grpo_vectorized_outcome_advantage), group_mean_std will compute the mean as the score itself, leading to an advantage of 0 (since score - mean = 0).

For group-based estimators, an advantage of 0 is generally more appropriate when no comparison samples are available. To ensure consistency and avoid training instability from un-baselined single samples, the non-vectorized version should also use the sample's own score as the baseline.

for idx in id2score: scores_tensor = torch.stack(id2score[idx]) id2mean[idx] = torch.mean(scores_tensor) if len(id2score[idx]) > 1: id2std[idx] = torch.std(scores_tensor) else: id2std[idx] = scores.new_tensor(0.0)

Before my patch, non-vectorized and vectorized implementations of GRPO is inconsistent. I'm not sure changing it in this pr is suitable or not. @wuxibin89 Hi Xibin, what do you think of this issue

[algo] fix rollout rejected rows in group advantages

265c924

haoyang9804 requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners May 23, 2026 09:58

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[algo] fix rollout rejected rows in group advantages#6452

[algo] fix rollout rejected rows in group advantages#6452
haoyang9804 wants to merge 1 commit into
verl-project:mainfrom
haoyang9804:fix/rollout-rs-group-advantage-mask

haoyang9804 commented May 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

haoyang9804 May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haoyang9804 commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How is this bug introduced?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

haoyang9804 May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

haoyang9804 commented May 23, 2026 •

edited

Loading