Skip to content

[reward_manager] fix: guard against empty responses and None overlong_buffer_cfg#6484

Open
imitater-dou wants to merge 2 commits into
verl-project:mainfrom
imitater-dou:fix/dapo-overlong-buffer-none-check
Open

[reward_manager] fix: guard against empty responses and None overlong_buffer_cfg#6484
imitater-dou wants to merge 2 commits into
verl-project:mainfrom
imitater-dou:fix/dapo-overlong-buffer-none-check

Conversation

@imitater-dou
Copy link
Copy Markdown

Problem

Two silent bugs found via static analysis in verl/workers/reward_manager/:

Bug 1 — DAPORewardManager: AttributeError when overlong_buffer_cfg=None

__call__ accessed self.overlong_buffer_cfg.enable (line 121) without first checking for None. Since overlong_buffer_cfg defaults to None, any call without this argument crashes at runtime:

AttributeError: 'NoneType' object has no attribute 'enable'

The __init__ method correctly guards with if self.overlong_buffer_cfg is not None:, but __call__ did not.

Bug 2 — All three managers: reward written to index -1 for empty responses

NaiveRewardManager, DAPORewardManager, and PrimeRewardManager all compute:

reward_tensor[i, valid_response_length - 1] = reward

When valid_response_length == 0 (aborted generation, fully-padded response), this becomes reward_tensor[i, -1] — Python/PyTorch silently writes the reward to the last position of the row, corrupting training signal with no error raised.

Related: PR #5613 fixes the same pattern in core_algos.py for GDPO. This PR addresses the same class of bug in the three reward_manager classes, which are separate code paths not covered by #5613.

Fix

  1. Add is not None guard before .enable access in dapo.py.
  2. Add if valid_response_length > 0: guard before reward placement in all three managers.

Validation

Confirmed with a minimal Python script (no GPU required):

[Bug 1] dapo.py: overlong_buffer_cfg=None → AttributeError
  BEFORE: CRASH confirmed → 'NoneType' object has no attribute 'enable'
  AFTER fix: No crash ✓

[Bug 2] reward placement with valid_response_length=0
  BEFORE: row0=[0.0, 0.0, 0.0, 0.0, 1.0]  ← reward silently placed at index -1 (WRONG)
  AFTER:  row0=[0.0, 0.0, 0.0, 0.0, 0.0]  ← empty response skipped (CORRECT)
          row1=[0.0, 0.0, 2.0, 0.0, 0.0]  ← normal response unaffected (CORRECT)

[PASS] Both bugs confirmed and fixes validated ✓

Checklist

  • Not duplicating an existing PR (related to [gdpo] Fix reward misplacement for fully-padded responses #5613 which covers core_algos.py only; this PR covers reward_manager/ files)
  • All changed lines reviewed
  • Validation script run and passing
  • AI assistance used (Claude) for static analysis and fix generation; all changes reviewed and verified by human submitter

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces safety checks to prevent out-of-bounds indexing errors when assigning rewards in the dapo, naive, and prime reward managers by ensuring valid_response_length is greater than zero. It also adds a null check for overlong_buffer_cfg in dapo.py. The review feedback correctly points out that using PyTorch scalar tensors directly in conditional statements and indexing inside a loop can cause implicit CPU-GPU synchronization overhead. It is recommended to convert these tensors to Python integers using .item() to improve performance.

Comment thread verl/workers/reward_manager/dapo.py Outdated
Comment on lines +132 to +133
if valid_response_length > 0:
reward_tensor[i, valid_response_length - 1] = reward
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a PyTorch scalar tensor directly in a Python conditional statement (if valid_response_length > 0) and as an index (valid_response_length - 1) forces implicit CPU-GPU synchronization if the tensor resides on the GPU. Since this occurs inside a loop over the batch, it can introduce significant performance overhead. Converting the scalar tensor to a Python integer using .item() avoids this synchronization and ensures cleaner indexing.

Suggested change
if valid_response_length > 0:
reward_tensor[i, valid_response_length - 1] = reward
if valid_response_length.item() > 0:
reward_tensor[i, valid_response_length.item() - 1] = reward

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 89b81a9 — added .item() at both the conditional and the index site in dapo.py and naive.py. prime.py already used .item() in the same pattern, so all three managers are now consistent. Thanks for catching this.

Comment thread verl/workers/reward_manager/naive.py Outdated
Comment on lines +100 to +101
if valid_response_length > 0:
reward_tensor[i, valid_response_length - 1] = reward
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a PyTorch scalar tensor directly in a Python conditional statement (if valid_response_length > 0) and as an index (valid_response_length - 1) forces implicit CPU-GPU synchronization if the tensor resides on the GPU. Since this occurs inside a loop over the batch, it can introduce significant performance overhead. Converting the scalar tensor to a Python integer using .item() avoids this synchronization and ensures cleaner indexing.

Suggested change
if valid_response_length > 0:
reward_tensor[i, valid_response_length - 1] = reward
if valid_response_length.item() > 0:
reward_tensor[i, valid_response_length.item() - 1] = reward

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 89b81a9 — same as the dapo.py change above.

imitater-dou pushed a commit to imitater-dou/verl that referenced this pull request May 26, 2026
…eward placement

Address Gemini code-review feedback on PR verl-project#6484:
valid_response_length is a scalar PyTorch tensor; using it directly in a
Python conditional and as a tensor index triggers implicit device-to-host
synchronization inside the per-sample loop. Convert to Python int with
.item() at the two usage sites in dapo.py and naive.py.

prime.py already used .item() in the same pattern; now all three managers
are consistent.

Co-authored-by: Claude
Signed-off-by: imitater-dou <ikun3.1415927@gmail.com>
…_buffer_cfg

Two silent bugs in reward manager __call__ methods:

1. DAPORewardManager.__call__ accessed self.overlong_buffer_cfg.enable without
   checking for None first. Since overlong_buffer_cfg defaults to None, any call
   without this argument raised AttributeError at runtime.

2. NaiveRewardManager, DAPORewardManager and PrimeRewardManager all computed
   reward_tensor[i, valid_response_length - 1] without guarding against
   valid_response_length == 0. PyTorch/Python negative indexing silently wrote
   the reward to the last position of the tensor row, corrupting training signal
   for aborted or fully-padded responses.

Both issues are confirmed by static analysis and reproduced with a minimal
Python script (no GPU required).

Related: verl-project#5613 fixes the same index-(-1) pattern in core_algos.py for GDPO;
this PR addresses the same class of bug in the three reward manager classes.

Co-authored-by: Claude
Signed-off-by: imitater-dou <ikun3.1415927@gmail.com>
…eward placement

Address Gemini code-review feedback on PR verl-project#6484:
valid_response_length is a scalar PyTorch tensor; using it directly in a
Python conditional and as a tensor index triggers implicit device-to-host
synchronization inside the per-sample loop. Convert to Python int with
.item() at the two usage sites in dapo.py and naive.py.

prime.py already used .item() in the same pattern; now all three managers
are consistent.

Co-authored-by: Claude
Signed-off-by: imitater-dou <ikun3.1415927@gmail.com>
@imitater-dou imitater-dou force-pushed the fix/dapo-overlong-buffer-none-check branch from 89b81a9 to 31d143a Compare May 26, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant