Skip to content

[reward, worker] fix: guard reward_tensor index -1 when valid_response_length == 0#6508

Open
donghyq wants to merge 1 commit into
verl-project:mainfrom
donghyq:fix/reward-manager-zero-length-index
Open

[reward, worker] fix: guard reward_tensor index -1 when valid_response_length == 0#6508
donghyq wants to merge 1 commit into
verl-project:mainfrom
donghyq:fix/reward-manager-zero-length-index

Conversation

@donghyq
Copy link
Copy Markdown

@donghyq donghyq commented May 27, 2026

What does this PR do?

Fixes #6476

Fixes two bugs in the reward managers triggered when valid_response_length == 0 (e.g. an aborted or zero-token generation).

Bug 1 — silent index -1 write (naive.py, prime.py, dapo.py)

All three reward managers contain:

reward_tensor[i, valid_response_length - 1] = reward

When valid_response_length == 0, this evaluates to reward_tensor[i, -1], which PyTorch silently treats as the last token position rather than raising an error. The reward is written to a wrong position, corrupting the training signal for that sample.

Fix: skip the reward assignment when valid_response_length == 0.

if valid_response_length > 0:
reward_tensor[i, valid_response_length - 1] = reward

Bug 2 — AttributeError when overlong_buffer_cfg=None (dapo.py)

DapoRewardManager.call accesses self.overlong_buffer_cfg.enable unconditionally. Since overlong_buffer_cfg=None is the default, this raises AttributeError: 'NoneType' object has no attribute 'enable' for any user who does not configure overlong penalty.

Fix: add a None guard consistent with the one already present in init.

if self.overlong_buffer_cfg is not None and self.overlong_buffer_cfg.enable:

Checklist Before Starting

Tests

Added CPU-only regression tests in tests/workers/reward_manager/test_reward_manager_zero_length_on_cpu.py

covering all three managers:

  • Zero-length response → reward_tensor stays all-zero (bug 1 regression)
  • Last token position not written for zero-length response
  • Normal response still receives reward at the correct position (regression)
  • Mixed batch (zero-length + normal) — samples do not interfere
  • overlong_buffer_cfg=None does not raise AttributeError (bug 2 regression)

API and Usage Example

No API changes. The fix is purely defensive — behavior is unchanged for valid inputs.

Design & Code Changes

  • verl/workers/reward_manager/naive.py: wrap reward assignment in if valid_response_length > 0
  • verl/workers/reward_manager/prime.py: same guard for batched path
  • verl/workers/reward_manager/dapo.py: same guard + add self.overlong_buffer_cfg is not None and before .enable
  • tests/workers/reward_manager/test_reward_manager_zero_length_on_cpu.py: new CPU regression test file

Checklist Before Submitting

  • Read the Contribute Guide.
  • Applied pre-commit checks — all passed.
  • Documentation: no doc changes needed (bug fix only).
  • Unit tests added (test_reward_manager_zero_length_on_cpu.py), picked up automatically by cpu_unit_tests CI.
  • Send message in ci-request channel / Feishu group after PR is open.
  • Not related to recipe submodule.

…e_length == 0

  When a generation is aborted or truncated to zero tokens,
  valid_response_length is 0 and reward_tensor[i, valid_response_length - 1]
  silently writes the reward to index -1 (last token), corrupting training
  signal for neighbouring tokens.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where an empty or aborted generation (resulting in a valid response length of 0) causes silent reward tensor corruption by writing to index -1. The fix guards reward assignments in dapo.py, naive.py, and prime.py with a check for valid_response_length > 0. Additionally, it prevents an AttributeError in dapo.py when overlong_buffer_cfg is None and introduces regression tests. The reviewer's feedback suggests optimizing the loop in prime.py by caching the .item() call of the tensor to avoid redundant host-device synchronization.

Comment on lines +177 to +178
if valid_response_length[i].item() > 0:
reward_tensor[i, valid_response_length[i].item() - 1] = scores[i]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling .item() on a GPU tensor inside a loop triggers host-device synchronization, which blocks the CPU and can significantly degrade performance. Since valid_response_length[i].item() is used twice here, we should retrieve it once and store it in a local variable to minimize synchronization overhead.

Suggested change
if valid_response_length[i].item() > 0:
reward_tensor[i, valid_response_length[i].item() - 1] = scores[i]
val_len = valid_response_length[i].item()
if val_len > 0:
reward_tensor[i, val_len - 1] = scores[i]

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


donghanyu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Reward manager writes reward to index -1 when valid_response_length == 0

2 participants