[reward, worker] fix: guard reward_tensor index -1 when valid_response_length == 0 by donghyq · Pull Request #6508 · verl-project/verl

donghyq · 2026-05-27T18:33:23Z

What does this PR do?

Fixes two bugs in the reward managers triggered when valid_response_length == 0 (e.g. an aborted or zero-token generation).

Bug 1 — silent index -1 write (naive.py, prime.py, dapo.py)

All three reward managers contain:

reward_tensor[i, valid_response_length - 1] = reward

When valid_response_length == 0, this evaluates to reward_tensor[i, -1], which PyTorch silently treats as the last token position rather than raising an error. The reward is written to a wrong position, corrupting the training signal for that sample.

Fix: skip the reward assignment when valid_response_length == 0.

if valid_response_length > 0:
reward_tensor[i, valid_response_length - 1] = reward

Bug 2 — AttributeError when overlong_buffer_cfg=None (dapo.py)

DapoRewardManager.call accesses self.overlong_buffer_cfg.enable unconditionally. Since overlong_buffer_cfg=None is the default, this raises AttributeError: 'NoneType' object has no attribute 'enable' for any user who does not configure overlong penalty.

Fix: add a None guard consistent with the one already present in init.

if self.overlong_buffer_cfg is not None and self.overlong_buffer_cfg.enable:

Checklist Before Starting

Search for similar PRs: https://github.com/volcengine/verl/pulls?q=valid_response_length — no existing PR addresses these
two bugs.
PR title formatted as [reward, worker] fix: ... ✅

Tests

Added CPU-only regression tests in tests/workers/reward_manager/test_reward_manager_zero_length_on_cpu.py

covering all three managers:

Zero-length response → reward_tensor stays all-zero (bug 1 regression)
Last token position not written for zero-length response
Normal response still receives reward at the correct position (regression)
Mixed batch (zero-length + normal) — samples do not interfere
overlong_buffer_cfg=None does not raise AttributeError (bug 2 regression)

API and Usage Example

No API changes. The fix is purely defensive — behavior is unchanged for valid inputs.

Design & Code Changes

verl/workers/reward_manager/naive.py: wrap reward assignment in if valid_response_length > 0
verl/workers/reward_manager/prime.py: same guard for batched path
verl/workers/reward_manager/dapo.py: same guard + add self.overlong_buffer_cfg is not None and before .enable
tests/workers/reward_manager/test_reward_manager_zero_length_on_cpu.py: new CPU regression test file

Checklist Before Submitting

Read the Contribute Guide.
Applied pre-commit checks — all passed.
Documentation: no doc changes needed (bug fix only).
Unit tests added (test_reward_manager_zero_length_on_cpu.py), picked up automatically by cpu_unit_tests CI.
Send message in ci-request channel / Feishu group after PR is open.
Not related to recipe submodule.

…e_length == 0 When a generation is aborted or truncated to zero tokens, valid_response_length is 0 and reward_tensor[i, valid_response_length - 1] silently writes the reward to index -1 (last token), corrupting training signal for neighbouring tokens.

gemini-code-assist

Code Review

This pull request addresses a bug where an empty or aborted generation (resulting in a valid response length of 0) causes silent reward tensor corruption by writing to index -1. The fix guards reward assignments in dapo.py, naive.py, and prime.py with a check for valid_response_length > 0. Additionally, it prevents an AttributeError in dapo.py when overlong_buffer_cfg is None and introduces regression tests. The reviewer's feedback suggests optimizing the loop in prime.py by caching the .item() call of the tensor to avoid redundant host-device synchronization.

gemini-code-assist · 2026-05-27T18:35:32Z

+            if valid_response_length[i].item() > 0:
+                reward_tensor[i, valid_response_length[i].item() - 1] = scores[i]


Calling .item() on a GPU tensor inside a loop triggers host-device synchronization, which blocks the CPU and can significantly degrade performance. Since valid_response_length[i].item() is used twice here, we should retrieve it once and store it in a local variable to minimize synchronization overhead.

Suggested change

if valid_response_length[i].item() > 0:

reward_tensor[i, valid_response_length[i].item() - 1] = scores[i]

val_len = valid_response_length[i].item()

if val_len > 0:

reward_tensor[i, val_len - 1] = scores[i]

CLAassistant · 2026-05-27T18:36:14Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

donghanyu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reward, worker] fix: guard reward_tensor index -1 when valid_response_length == 0#6508

[reward, worker] fix: guard reward_tensor index -1 when valid_response_length == 0#6508
donghyq wants to merge 1 commit into
verl-project:mainfrom
donghyq:fix/reward-manager-zero-length-index

donghyq commented May 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 27, 2026

Uh oh!

CLAassistant commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if valid_response_length[i].item() > 0:
		reward_tensor[i, valid_response_length[i].item() - 1] = scores[i]

-            if valid_response_length[i].item() > 0:
-                reward_tensor[i, valid_response_length[i].item() - 1] = scores[i]
+            val_len = valid_response_length[i].item()
+            if val_len > 0:
+                reward_tensor[i, val_len - 1] = scores[i]

Conversation

donghyq commented May 27, 2026

What does this PR do?

Checklist Before Starting

Tests

API and Usage Example

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants