Skip to content

fix(wzn): fix format reward contamination from prompt template in gsm8k/geo3k#52

Merged
puyuan1996 merged 1 commit intoopendilab:mainfrom
zunian-wan:dev-geo3k-fmt_reward-fix
Mar 17, 2026
Merged

fix(wzn): fix format reward contamination from prompt template in gsm8k/geo3k#52
puyuan1996 merged 1 commit intoopendilab:mainfrom
zunian-wan:dev-geo3k-fmt_reward-fix

Conversation

@zunian-wan
Copy link
Contributor

@zunian-wan zunian-wan commented Mar 14, 2026

• Problem: reward scoring in training used decoded full sequence (system_prompt + response), If the system_prompt contains "<think>...</think>" and "\boxed{}" examples, so format_reward_fn could match the prompt portion and make format_reward appear near-constant 1.0 even when the model output format is wrong.

• Fix: extract assistant completion before reward scoring, then run format/accuracy checks on the extracted response only.

📝 Summary

🏷️ Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • 🎨 Refactoring (code style, formatting, local variables)
  • Performance (improvements to code performance)
  • Testing (adding or fixing tests)
  • 📚 Documentation (updates to documentation)
  • 💥 Breaking change (fix or feature that causes existing functionality to fail)

🔗 Related Issues

  • Fixes #
  • Related to #

🛠️ Key Changes

Fix by extracting the assistant completion from the decoded transcript first (Qwen <|im_start|>assistant ... <|im_end|>
supported), and run both format/accuracy checks on the extracted completion only.

🧪 Testing

Environment:

  • Python:
  • PyTorch:
  • CUDA:

Command(s):

# Paste the command used to run tests

Results:

  • Tests passed locally

✅ Checklist

  • I have run make format and make fcheck to ensure code style compliance.
  • I have added or updated tests for this change.
  • I have updated the documentation (if applicable).
  • I have documented any breaking changes (if applicable).
  • This PR is ready for review.

…8k/geo3k

• Root cause: reward scoring in training used decoded full sequence (prompt + response), and the system prompt contains <think>...</think> and \boxed{} examples, which could inflate format_reward.
• Fix: extract assistant completion before reward scoring, then run format/accuracy checks on the extracted response only.
@PaParaZz1 PaParaZz1 added the bug Something isn't working label Mar 14, 2026
return s

# Qwen-style chat markers (keep only the last assistant segment)
assistant_marker = "<|im_start|>assistant"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 这个对所有模型(qwen qwen-vl deepseek)是通用的吗
  • Is this applicable to all models (e.g., Qwen, Qwen-VL, DeepSeek)?

acc_r = geo3k_accuracy_reward_fn(sol, gt)
fmt_r = geo3k_format_reward_fn(sol)
acc_r = geo3k_accuracy_reward_fn(sol_completion, gt)
fmt_r = geo3k_format_reward_fn(sol_completion)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 修复前一直显示format-reward=1,,但实际回复可能没有遵循format吗
  • I noticed it always showed format-reward=1 before this fix. Were there cases where the actual response bypassed the format constraints?

@puyuan1996 puyuan1996 merged commit 8afaa75 into opendilab:main Mar 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants