fix(wzn): fix format reward contamination from prompt template in gsm8k/geo3k by zunian-wan · Pull Request #52 · opendilab/LightRFT

zunian-wan · 2026-03-14T12:49:56Z

• Problem: reward scoring in training used decoded full sequence (system_prompt + response), If the system_prompt contains "<think>...</think>" and "\boxed{}" examples, so format_reward_fn could match the prompt portion and make format_reward appear near-constant 1.0 even when the model output format is wrong.

• Fix: extract assistant completion before reward scoring, then run format/accuracy checks on the extracted response only.

📝 Summary

🏷️ Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🎨 Refactoring (code style, formatting, local variables)
⚡ Performance (improvements to code performance)
✅ Testing (adding or fixing tests)
📚 Documentation (updates to documentation)
💥 Breaking change (fix or feature that causes existing functionality to fail)

🔗 Related Issues

Fixes #
Related to #

🛠️ Key Changes

Fix by extracting the assistant completion from the decoded transcript first (Qwen <|im_start|>assistant ... <|im_end|>
supported), and run both format/accuracy checks on the extracted completion only.

🧪 Testing

Environment:

Python:
PyTorch:
CUDA:

Command(s):

# Paste the command used to run tests

Results:

Tests passed locally

✅ Checklist

I have run make format and make fcheck to ensure code style compliance.
I have added or updated tests for this change.
I have updated the documentation (if applicable).
I have documented any breaking changes (if applicable).
This PR is ready for review.

…8k/geo3k • Root cause: reward scoring in training used decoded full sequence (prompt + response), and the system prompt contains <think>...</think> and \boxed{} examples, which could inflate format_reward. • Fix: extract assistant completion before reward scoring, then run format/accuracy checks on the extracted response only.

puyuan1996 · 2026-03-16T07:10:56Z

examples/gsm8k_geo3k/reward_models_utils.py

+        return s
+
+    # Qwen-style chat markers (keep only the last assistant segment)
+    assistant_marker = "<|im_start|>assistant"


这个对所有模型（qwen qwen-vl deepseek）是通用的吗

Is this applicable to all models (e.g., Qwen, Qwen-VL, DeepSeek)?

puyuan1996 · 2026-03-16T07:13:33Z

examples/gsm8k_geo3k/reward_models_utils.py

-                acc_r = geo3k_accuracy_reward_fn(sol, gt)
-                fmt_r = geo3k_format_reward_fn(sol)
+                acc_r = geo3k_accuracy_reward_fn(sol_completion, gt)
+                fmt_r = geo3k_format_reward_fn(sol_completion)


修复前一直显示format-reward=1,，但实际回复可能没有遵循format吗

I noticed it always showed format-reward=1 before this fix. Were there cases where the actual response bypassed the format constraints?

PaParaZz1 added the bug Something isn't working label Mar 14, 2026

puyuan1996 requested changes Mar 16, 2026

View reviewed changes

puyuan1996 merged commit 8afaa75 into opendilab:main Mar 17, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(wzn): fix format reward contamination from prompt template in gsm8k/geo3k#52

fix(wzn): fix format reward contamination from prompt template in gsm8k/geo3k#52
puyuan1996 merged 1 commit intoopendilab:mainfrom
zunian-wan:dev-geo3k-fmt_reward-fix

zunian-wan commented Mar 14, 2026 •

edited

Loading

Uh oh!

puyuan1996 Mar 16, 2026

Uh oh!

puyuan1996 Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zunian-wan commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Summary

🏷️ Type of Change

🔗 Related Issues

🛠️ Key Changes

🧪 Testing

✅ Checklist

Uh oh!

puyuan1996 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

puyuan1996 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zunian-wan commented Mar 14, 2026 •

edited

Loading