fix(wzn): fix format reward contamination from prompt template in gsm8k/geo3k#52
Merged
puyuan1996 merged 1 commit intoopendilab:mainfrom Mar 17, 2026
Merged
Conversation
…8k/geo3k
• Root cause: reward scoring in training used decoded full sequence (prompt + response), and the system prompt contains <think>...</think> and \boxed{} examples, which could inflate format_reward.
• Fix: extract assistant completion before reward scoring, then run format/accuracy checks on the extracted response only.
puyuan1996
requested changes
Mar 16, 2026
| return s | ||
|
|
||
| # Qwen-style chat markers (keep only the last assistant segment) | ||
| assistant_marker = "<|im_start|>assistant" |
Collaborator
There was a problem hiding this comment.
- 这个对所有模型(qwen qwen-vl deepseek)是通用的吗
- Is this applicable to all models (e.g., Qwen, Qwen-VL, DeepSeek)?
| acc_r = geo3k_accuracy_reward_fn(sol, gt) | ||
| fmt_r = geo3k_format_reward_fn(sol) | ||
| acc_r = geo3k_accuracy_reward_fn(sol_completion, gt) | ||
| fmt_r = geo3k_format_reward_fn(sol_completion) |
Collaborator
There was a problem hiding this comment.
- 修复前一直显示format-reward=1,,但实际回复可能没有遵循format吗
- I noticed it always showed format-reward=1 before this fix. Were there cases where the actual response bypassed the format constraints?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
• Problem: reward scoring in training used decoded full sequence (system_prompt + response), If the system_prompt contains "<think>...</think>" and "\boxed{}" examples, so format_reward_fn could match the prompt portion and make format_reward appear near-constant 1.0 even when the model output format is wrong.
• Fix: extract assistant completion before reward scoring, then run format/accuracy checks on the extracted response only.
📝 Summary
🏷️ Type of Change
🔗 Related Issues
🛠️ Key Changes
Fix by extracting the assistant completion from the decoded transcript first (Qwen <|im_start|>assistant ... <|im_end|>
supported), and run both format/accuracy checks on the extracted completion only.
🧪 Testing
Environment:
Command(s):
# Paste the command used to run testsResults:
✅ Checklist
make formatandmake fcheckto ensure code style compliance.