fix(server): place image tokens inside the user turn (gemma4 vision grounding) (#140) by Pushkinist · Pull Request #144 · Pushkinist/rMLX

Pushkinist · 2026-06-18T15:16:06Z

Summary

Fixes #140.

The ticket reported gemma4 e4b QAT-fp4 (mxfp4/nvfp4) image inference producing degenerate, image-independent output, and hypothesized a dequant bug in the QAT vision embedder/projector. Diagnostics falsified that hypothesis and found the real cause — image-token placement.

Diagnosis (evidence, not the ticket's guess)

Vision-tower weights are bit-identical between e4b-mxfp8 (works) and e4b-qat-bf16; the vision soft-token embeddings (vision_tower_out / after-projection) are bit-identical between mxfp8 and qat-mxfp4 — no NaN/Inf. So the vision pipeline + projector + fp4 dequant are all correct.
The qat-bf16 control is also broken, ruling out fp4 dequant entirely. Decisively, the "working" mxfp8 control also fails on prompts like "Describe this image in detail" — it only passed by luck of phrasing.
Root cause: crates/rmlx-server/src/engine/image.rs spliced the per-image token block after the leading BOS but before the user-turn opener <|turn|>user\n, leaving the image outside the user message. HF/mlx-vlm substitute the image placeholder inside the user turn.

Fix (general — token placement, independent of weight quant)

New free fn splice_image_block(prompt_tokens, blocks, turn_opener) inserts image blocks after the last occurrence of the user-turn opener (the final user turn), with an after-BOS fallback when the opener is empty/absent. Applied to Gemma4 + Gemma4Unified (GEMMA4_USER_TURN_OPENER).
Qwen3-VL unified onto the same fn (QWEN3VL_USER_TURN_OPENER), replacing its hand-rolled first-match splice — single-turn byte-identical, multi-turn now correctly targets the final user turn.
This also fixes latent flakiness for all Gemma4 image requests (mxfp8 included), not just the QAT case.

Changes

crates/rmlx-server/src/engine/image.rs — splice_image_block free fn + per-arch opener consts; Gemma4/Unified + Qwen3-VL routed through it; old inline/hand-rolled splices removed; panic-guard reason documented.
crates/rmlx-server/src/engine/image_tests.rs — in-turn placement, last-turn selection, multi-image order/contiguity, after-BOS fallbacks, Qwen3-VL single-turn equivalence.
docs/MODELS.md — Gemma4 image-prompt placement note.

Verification

make lint clean; cargo test -p rmlx-server --lib engine::image 6/6; cargo test -p rmlx-models -p rmlx-quant 653 pass.
Real-model proof (HELLO / PURPLE 42 test image):
- e4b qat-mxfp4: all 4 previously-failing prompts now read the image ("HELLO", "PURPLE 42").
- e4b qat-nvfp4: reads "H E L L O / P U R P L E / 42".
- e4b mxfp8 (regression): still reads, now robust on the "Describe…" prompt that previously failed by luck.
- Qwen3-VL-30B-4bit (refactored splice path): small image, no errors.

Notes

Gemma3 keeps the after-BOS fallback (its opener ids weren't snapshot-verifiable here); revisit if Gemma3 image grounding is later found flaky.

🤖 Generated with Claude Code

The per-image soft-token block (<boi> + N image tokens + <eoi>) was spliced after the leading BOS token, before the user-turn opener. This put the image outside the user turn, so the model frequently failed to ground on it and produced image-independent, repetitive output (e.g. "you have not provided an image"). The effect was prompt-phrasing-dependent and hit low-precision QAT-fp4 e4b vision snapshots hardest (degenerate r/r/r), while the same image on mxfp8 sometimes happened to work. The vision tower, multimodal projector, fp4 dequant, and soft-token scatter were all correct (soft tokens are bit-identical between mxfp8 and qat-mxfp4) — the bug was purely token placement. Match the HF/mlx-vlm processor, which substitutes the <|image|> placeholder in place within the user message: splice the block right after the user-turn opener <|turn>user\n. Use the last opener so multi-turn prompts target the final user turn; fall back to after-BOS when the opener is absent (Gemma3 keeps that path). Extracts the splice into a unit-testable free fn with a sibling test file. Closes #140

…ti-image + Qwen3-VL tests - Hoist the Qwen3-VL user-turn opener to QWEN3VL_USER_TURN_OPENER const (mirrors GEMMA4_USER_TURN_OPENER) and route run_qwen3vl_image through splice_image_block, replacing the hand-rolled position()-based splice. Single-turn behavior is byte-identical (first == last with one opener); multi-turn now correctly targets the final user turn instead of the first. - Add multi_image_blocks_appear_in_order_and_contiguous test: two blocks, asserts ordering, contiguity, and correct trailing token. - Add qwen3vl_single_turn_splice_matches_last_opener test: confirms the new routing produces the same insert index as the previous code for a single-turn prompt. - Extend splice_image_block's indexing_slicing allow-reason to document that is_empty() guards windows(0) and the length check guards the oversized-opener case.

Pushkinist added 2 commits June 18, 2026 22:04

Pushkinist self-assigned this Jun 18, 2026

Pushkinist merged commit 0051389 into main Jun 18, 2026
2 checks passed

Pushkinist deleted the fix/140-e4b-qat-fp4-vision branch June 18, 2026 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): place image tokens inside the user turn (gemma4 vision grounding) (#140)#144

fix(server): place image tokens inside the user turn (gemma4 vision grounding) (#140)#144
Pushkinist merged 2 commits into
mainfrom
fix/140-e4b-qat-fp4-vision

Pushkinist commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Pushkinist commented Jun 18, 2026

Summary

Diagnosis (evidence, not the ticket's guess)

Fix (general — token placement, independent of weight quant)

Changes

Verification

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant