[reward_manager] fix: guard against empty responses and None overlong_buffer_cfg#6484
[reward_manager] fix: guard against empty responses and None overlong_buffer_cfg#6484imitater-dou wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces safety checks to prevent out-of-bounds indexing errors when assigning rewards in the dapo, naive, and prime reward managers by ensuring valid_response_length is greater than zero. It also adds a null check for overlong_buffer_cfg in dapo.py. The review feedback correctly points out that using PyTorch scalar tensors directly in conditional statements and indexing inside a loop can cause implicit CPU-GPU synchronization overhead. It is recommended to convert these tensors to Python integers using .item() to improve performance.
| if valid_response_length > 0: | ||
| reward_tensor[i, valid_response_length - 1] = reward |
There was a problem hiding this comment.
Using a PyTorch scalar tensor directly in a Python conditional statement (if valid_response_length > 0) and as an index (valid_response_length - 1) forces implicit CPU-GPU synchronization if the tensor resides on the GPU. Since this occurs inside a loop over the batch, it can introduce significant performance overhead. Converting the scalar tensor to a Python integer using .item() avoids this synchronization and ensures cleaner indexing.
| if valid_response_length > 0: | |
| reward_tensor[i, valid_response_length - 1] = reward | |
| if valid_response_length.item() > 0: | |
| reward_tensor[i, valid_response_length.item() - 1] = reward |
There was a problem hiding this comment.
Fixed in 89b81a9 — added .item() at both the conditional and the index site in dapo.py and naive.py. prime.py already used .item() in the same pattern, so all three managers are now consistent. Thanks for catching this.
| if valid_response_length > 0: | ||
| reward_tensor[i, valid_response_length - 1] = reward |
There was a problem hiding this comment.
Using a PyTorch scalar tensor directly in a Python conditional statement (if valid_response_length > 0) and as an index (valid_response_length - 1) forces implicit CPU-GPU synchronization if the tensor resides on the GPU. Since this occurs inside a loop over the batch, it can introduce significant performance overhead. Converting the scalar tensor to a Python integer using .item() avoids this synchronization and ensures cleaner indexing.
| if valid_response_length > 0: | |
| reward_tensor[i, valid_response_length - 1] = reward | |
| if valid_response_length.item() > 0: | |
| reward_tensor[i, valid_response_length.item() - 1] = reward |
There was a problem hiding this comment.
Fixed in 89b81a9 — same as the dapo.py change above.
…eward placement Address Gemini code-review feedback on PR verl-project#6484: valid_response_length is a scalar PyTorch tensor; using it directly in a Python conditional and as a tensor index triggers implicit device-to-host synchronization inside the per-sample loop. Convert to Python int with .item() at the two usage sites in dapo.py and naive.py. prime.py already used .item() in the same pattern; now all three managers are consistent. Co-authored-by: Claude Signed-off-by: imitater-dou <ikun3.1415927@gmail.com>
…_buffer_cfg Two silent bugs in reward manager __call__ methods: 1. DAPORewardManager.__call__ accessed self.overlong_buffer_cfg.enable without checking for None first. Since overlong_buffer_cfg defaults to None, any call without this argument raised AttributeError at runtime. 2. NaiveRewardManager, DAPORewardManager and PrimeRewardManager all computed reward_tensor[i, valid_response_length - 1] without guarding against valid_response_length == 0. PyTorch/Python negative indexing silently wrote the reward to the last position of the tensor row, corrupting training signal for aborted or fully-padded responses. Both issues are confirmed by static analysis and reproduced with a minimal Python script (no GPU required). Related: verl-project#5613 fixes the same index-(-1) pattern in core_algos.py for GDPO; this PR addresses the same class of bug in the three reward manager classes. Co-authored-by: Claude Signed-off-by: imitater-dou <ikun3.1415927@gmail.com>
…eward placement Address Gemini code-review feedback on PR verl-project#6484: valid_response_length is a scalar PyTorch tensor; using it directly in a Python conditional and as a tensor index triggers implicit device-to-host synchronization inside the per-sample loop. Convert to Python int with .item() at the two usage sites in dapo.py and naive.py. prime.py already used .item() in the same pattern; now all three managers are consistent. Co-authored-by: Claude Signed-off-by: imitater-dou <ikun3.1415927@gmail.com>
89b81a9 to
31d143a
Compare
Problem
Two silent bugs found via static analysis in
verl/workers/reward_manager/:Bug 1 —
DAPORewardManager:AttributeErrorwhenoverlong_buffer_cfg=None__call__accessedself.overlong_buffer_cfg.enable(line 121) without first checking forNone. Sinceoverlong_buffer_cfgdefaults toNone, any call without this argument crashes at runtime:The
__init__method correctly guards withif self.overlong_buffer_cfg is not None:, but__call__did not.Bug 2 — All three managers: reward written to index
-1for empty responsesNaiveRewardManager,DAPORewardManager, andPrimeRewardManagerall compute:When
valid_response_length == 0(aborted generation, fully-padded response), this becomesreward_tensor[i, -1]— Python/PyTorch silently writes the reward to the last position of the row, corrupting training signal with no error raised.Fix
is not Noneguard before.enableaccess indapo.py.if valid_response_length > 0:guard before reward placement in all three managers.Validation
Confirmed with a minimal Python script (no GPU required):
Checklist
core_algos.pyonly; this PR coversreward_manager/files)