Fix Qwen MTP batched target-verify drift by Blaizzy · Pull Request #1210 · Blaizzy/mlx-vlm

Blaizzy · 2026-05-21T10:34:59Z

Summary

Fixes Qwen3.5/Qwen3.6 MTP batch drift by making target-verify paths match singleton numerics for left-padded and mixed-length batches.

Changes include:

preserve left-padding position offsets during Qwen batched prefill
make Qwen target-verify fallback projections row/time singleton exact when the dense helper cannot be used
pass the sliced attention mask through target-verify attention chunks
extend speculative tests to cover multi-row target-verify projections

Root Cause

The remaining batch drift came from target-verify fallback projections that were only split by time, not by row. Small GDN projections fell back to batched GEMM across rows, which changed recurrent GDN state numerics for mixed batches. Left-padded batched prefill also needed singleton-equivalent position handling so no-drafter and MTP prefill states stay aligned.

Validation

Focused tests:

PYTHONPATH=/tmp/codex-mlx-lm-target:. pytest \
  mlx_vlm/tests/test_speculative.py::test_qwen_target_verify_linear_matches_singleton_dense_gemv \
  mlx_vlm/tests/test_speculative.py::test_qwen_target_verify_small_projection_matches_singleton_dense_gemv \
  mlx_vlm/tests/test_speculative.py::test_qwen_target_verify_gated_norm_matches_singleton_path \
  mlx_vlm/tests/test_speculative.py::test_qwen_gdn_verify_conv_matches_singleton_windows -q

Result: 4 passed.

Qwen3.6-35B-A3B AIME 2026 ids 1-4, max_tokens=256, temperature=0, seed=42, thinking enabled:

Mode	Batch	Wall	Tokens	Tok/s	Exactness
No drafter	singleton x4	27.60s	1024	37.10	reference
No drafter	4	25.34s	1024	40.42	exact vs singleton
MTP	singleton x4	11.97s	1024	85.52	reference
MTP	4	9.02s	1024	113.48	exact vs singleton

MTP batch-4 is 2.81x faster than no-drafter batch-4 for this short run, and 1.33x faster than sequential singleton MTP.

Qwen3.5 9B 5-bit Temperature Sweep

AIME 2026 prompts, max_tokens=2048, seed=42, thinking enabled. All runs below were token-identical vs their no-drafter reference.

Batch 4, first 4 prompts. Before is the uniform sampled-walk fallback; After is the positioned ragged sampled path.

Temp	No-drafter tok/s	Before Match	Before MTP tok/s	Before Speedup	Before Accept	After Match	After MTP tok/s	After Speedup	After Accept	MTP tok/s Δ
0.0	50.83	4/4	122.37	2.38x	2.72	4/4	121.01	2.38x	2.72	-1.1%
0.2	50.69	4/4	105.43	2.04x	2.21	4/4	118.85	2.34x	2.72	+12.7%
0.6	50.45	4/4	101.52	1.97x	2.12	4/4	114.04	2.26x	2.60	+12.3%
1.0	50.77	4/4	96.70	1.88x	1.98	4/4	103.09	2.03x	2.34	+6.6%

Current positioned ragged sampled path at additional batch sizes:

Batch	Temp	No-drafter tok/s	MTP tok/s	Speedup	Match	Accept	Rounds
2	0.0	55.88	100.64	1.80x	2/2	2.75	754
2	0.2	55.73	97.14	1.74x	2/2	2.75	752
2	0.6	56.57	88.35	1.56x	2/2	2.62	820
2	1.0	56.25	85.95	1.53x	2/2	2.52	820
8	0.0	52.18	135.93	2.60x	8/8	2.76	770
8	0.2	52.59	134.35	2.55x	8/8	2.76	775
8	0.6	52.32	126.78	2.42x	8/8	2.63	829
8	1.0	52.66	118.08	2.24x	8/8	2.34	935

Use uniform deferred verification for non-greedy batched MTP so target sampling consumes RNG in the same lockstep order as no-drafter batches. Keep ragged acceptance enabled for greedy decoding, where argmax has no RNG-order drift and preserves the faster batch path.

Add a positioned target sampler so no-drafter and MTP consume deterministic per-position target draws instead of relying on global RNG order. This keeps sampled batched decoding exact while allowing Qwen MTP to use the ragged acceptance path.

# Conflicts: # mlx_vlm/generate.py # mlx_vlm/models/qwen3_5/language.py # mlx_vlm/server/generation.py # mlx_vlm/tests/test_generate.py # mlx_vlm/tests/test_server.py # mlx_vlm/tests/test_speculative.py

# Conflicts: # mlx_vlm/generate/ar.py

lucasnewman · 2026-06-01T19:45:24Z

-                return mx.random.categorical(logprobs * (1 / args.temperature))
-
-        return sampler
+        return _PositionedTargetSampler(


Should this only be used when there's a draft model? I'm wondering if the vmap-based sampler might be a bit slower...

lucasnewman

LGTM

Blaizzy added 19 commits May 21, 2026 07:37

Fix Qwen3.5 batched left-padding drift

b1ec70d

Fix Qwen target verify batch drift

ac4f2d5

Fix Qwen batch parity for padded vision rows

256757b

Fix exact batched Qwen MTP verification

f8f570c

Fix ragged Qwen3.5 MTP batch parity

7722c96

Enable ragged Qwen MTP for sampled batches

11eb394

Add a positioned target sampler so no-drafter and MTP consume deterministic per-position target draws instead of relying on global RNG order. This keeps sampled batched decoding exact while allowing Qwen MTP to use the ragged acceptance path.

Fix server sampler reuse across idle batches

3aaba7b

Route server MTP singleton through batch path

744372a

Fix seeded Qwen MTP CLI parity

c28fa1a

Speed up exact positioned MTP sampling

8169230

Use exact qmatvec for Qwen MTP verifier logits

c3581f2

Fuse Qwen GDN accepted state scatter

9b25b25

Fix singleton batch generator cache performance

a94e83c

Route server MTP through batch generator

774c69e

Speed up quantized Qwen batch decode

e4471e0

Use true batched Qwen3.5 server decode

7d62403

Add exact ragged Qwen3.5 decode attention

35fd30e

Avoid slow mixed ragged attention dispatch

1a08841

cropduster mentioned this pull request May 25, 2026

TypeError: _build_replacement_call got an unexpected keyword argument 'target_verify' in Qwen3.5/3.6 MTP models — PR #1210 does not resolve #1219

Closed

Blaizzy added 4 commits May 25, 2026 21:43

Improve Qwen3.5 batched decode scaling

ceb9049

Avoid MTP rollback syncs

159bf24

Remove slow Qwen decode qmv path

8662366

Reduce Qwen3.5 batched decode sync overhead

a8e73d2

Blaizzy linked an issue May 27, 2026 that may be closed by this pull request

TypeError: _build_replacement_call got an unexpected keyword argument 'target_verify' in Qwen3.5/3.6 MTP models — PR #1210 does not resolve #1219

Closed

Blaizzy added 5 commits May 27, 2026 09:39

Merge remote-tracking branch 'origin/main' into pc/qwen-mtp-batch-drift

1bfecb6

# Conflicts: # mlx_vlm/generate.py # mlx_vlm/models/qwen3_5/language.py # mlx_vlm/server/generation.py # mlx_vlm/tests/test_generate.py # mlx_vlm/tests/test_server.py # mlx_vlm/tests/test_speculative.py

Support PoolingCache in batched cache creation

c807a22

Apply pre-commit formatting

0f726e0

Merge main into qwen MTP batch drift

367a90e

Improve Qwen batch decode stability

859d724

Blaizzy marked this pull request as ready for review June 1, 2026 19:35

Merge remote-tracking branch 'origin/main' into pc/qwen-mtp-batch-drift

b938a64

# Conflicts: # mlx_vlm/generate/ar.py

lucasnewman reviewed Jun 1, 2026

View reviewed changes

lucasnewman approved these changes Jun 1, 2026

View reviewed changes

Blaizzy merged commit eb7537b into main Jun 1, 2026
1 check passed

Blaizzy deleted the pc/qwen-mtp-batch-drift branch June 1, 2026 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Qwen MTP batched target-verify drift#1210

Fix Qwen MTP batched target-verify drift#1210
Blaizzy merged 29 commits into
mainfrom
pc/qwen-mtp-batch-drift

Blaizzy commented May 21, 2026 •

edited

Loading

Uh oh!

lucasnewman Jun 1, 2026

Uh oh!

lucasnewman left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Blaizzy commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Validation

Qwen3.5 9B 5-bit Temperature Sweep

Uh oh!

lucasnewman Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

lucasnewman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Blaizzy commented May 21, 2026 •

edited

Loading