[reward, worker] fix: guard reward_tensor index -1 when valid_response_length == 0#6508
[reward, worker] fix: guard reward_tensor index -1 when valid_response_length == 0#6508donghyq wants to merge 1 commit into
Conversation
…e_length == 0 When a generation is aborted or truncated to zero tokens, valid_response_length is 0 and reward_tensor[i, valid_response_length - 1] silently writes the reward to index -1 (last token), corrupting training signal for neighbouring tokens.
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where an empty or aborted generation (resulting in a valid response length of 0) causes silent reward tensor corruption by writing to index -1. The fix guards reward assignments in dapo.py, naive.py, and prime.py with a check for valid_response_length > 0. Additionally, it prevents an AttributeError in dapo.py when overlong_buffer_cfg is None and introduces regression tests. The reviewer's feedback suggests optimizing the loop in prime.py by caching the .item() call of the tensor to avoid redundant host-device synchronization.
| if valid_response_length[i].item() > 0: | ||
| reward_tensor[i, valid_response_length[i].item() - 1] = scores[i] |
There was a problem hiding this comment.
Calling .item() on a GPU tensor inside a loop triggers host-device synchronization, which blocks the CPU and can significantly degrade performance. Since valid_response_length[i].item() is used twice here, we should retrieve it once and store it in a local variable to minimize synchronization overhead.
| if valid_response_length[i].item() > 0: | |
| reward_tensor[i, valid_response_length[i].item() - 1] = scores[i] | |
| val_len = valid_response_length[i].item() | |
| if val_len > 0: | |
| reward_tensor[i, val_len - 1] = scores[i] |
|
donghanyu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
What does this PR do?
Fixes #6476
Fixes two bugs in the reward managers triggered when
valid_response_length == 0(e.g. an aborted or zero-token generation).Bug 1 — silent index -1 write (naive.py, prime.py, dapo.py)
All three reward managers contain:
When valid_response_length == 0, this evaluates to reward_tensor[i, -1], which PyTorch silently treats as the last token position rather than raising an error. The reward is written to a wrong position, corrupting the training signal for that sample.
Fix: skip the reward assignment when valid_response_length == 0.
if valid_response_length > 0:
reward_tensor[i, valid_response_length - 1] = reward
Bug 2 — AttributeError when overlong_buffer_cfg=None (dapo.py)
DapoRewardManager.call accesses self.overlong_buffer_cfg.enable unconditionally. Since overlong_buffer_cfg=None is the default, this raises AttributeError: 'NoneType' object has no attribute 'enable' for any user who does not configure overlong penalty.
Fix: add a None guard consistent with the one already present in init.
Checklist Before Starting
two bugs.
Tests
Added CPU-only regression tests in tests/workers/reward_manager/test_reward_manager_zero_length_on_cpu.py
covering all three managers:
API and Usage Example
No API changes. The fix is purely defensive — behavior is unchanged for valid inputs.
Design & Code Changes
Checklist Before Submitting