Skip to content

fix(server): place image tokens inside the user turn (gemma4 vision grounding) (#140)#144

Merged
Pushkinist merged 2 commits into
mainfrom
fix/140-e4b-qat-fp4-vision
Jun 18, 2026
Merged

fix(server): place image tokens inside the user turn (gemma4 vision grounding) (#140)#144
Pushkinist merged 2 commits into
mainfrom
fix/140-e4b-qat-fp4-vision

Conversation

@Pushkinist

Copy link
Copy Markdown
Owner

Summary

Fixes #140.

The ticket reported gemma4 e4b QAT-fp4 (mxfp4/nvfp4) image inference producing degenerate, image-independent output, and hypothesized a dequant bug in the QAT vision embedder/projector. Diagnostics falsified that hypothesis and found the real cause — image-token placement.

Diagnosis (evidence, not the ticket's guess)

  • Vision-tower weights are bit-identical between e4b-mxfp8 (works) and e4b-qat-bf16; the vision soft-token embeddings (vision_tower_out / after-projection) are bit-identical between mxfp8 and qat-mxfp4 — no NaN/Inf. So the vision pipeline + projector + fp4 dequant are all correct.
  • The qat-bf16 control is also broken, ruling out fp4 dequant entirely. Decisively, the "working" mxfp8 control also fails on prompts like "Describe this image in detail" — it only passed by luck of phrasing.
  • Root cause: crates/rmlx-server/src/engine/image.rs spliced the per-image token block after the leading BOS but before the user-turn opener <|turn|>user\n, leaving the image outside the user message. HF/mlx-vlm substitute the image placeholder inside the user turn.

Fix (general — token placement, independent of weight quant)

  • New free fn splice_image_block(prompt_tokens, blocks, turn_opener) inserts image blocks after the last occurrence of the user-turn opener (the final user turn), with an after-BOS fallback when the opener is empty/absent. Applied to Gemma4 + Gemma4Unified (GEMMA4_USER_TURN_OPENER).
  • Qwen3-VL unified onto the same fn (QWEN3VL_USER_TURN_OPENER), replacing its hand-rolled first-match splice — single-turn byte-identical, multi-turn now correctly targets the final user turn.
  • This also fixes latent flakiness for all Gemma4 image requests (mxfp8 included), not just the QAT case.

Changes

  • crates/rmlx-server/src/engine/image.rssplice_image_block free fn + per-arch opener consts; Gemma4/Unified + Qwen3-VL routed through it; old inline/hand-rolled splices removed; panic-guard reason documented.
  • crates/rmlx-server/src/engine/image_tests.rs — in-turn placement, last-turn selection, multi-image order/contiguity, after-BOS fallbacks, Qwen3-VL single-turn equivalence.
  • docs/MODELS.md — Gemma4 image-prompt placement note.

Verification

  • make lint clean; cargo test -p rmlx-server --lib engine::image 6/6; cargo test -p rmlx-models -p rmlx-quant 653 pass.
  • Real-model proof (HELLO / PURPLE 42 test image):
    • e4b qat-mxfp4: all 4 previously-failing prompts now read the image ("HELLO", "PURPLE 42").
    • e4b qat-nvfp4: reads "H E L L O / P U R P L E / 42".
    • e4b mxfp8 (regression): still reads, now robust on the "Describe…" prompt that previously failed by luck.
    • Qwen3-VL-30B-4bit (refactored splice path): small image, no errors.

Notes

Gemma3 keeps the after-BOS fallback (its opener ids weren't snapshot-verifiable here); revisit if Gemma3 image grounding is later found flaky.

🤖 Generated with Claude Code

The per-image soft-token block (<boi> + N image tokens + <eoi>) was spliced
after the leading BOS token, before the user-turn opener. This put the image
outside the user turn, so the model frequently failed to ground on it and
produced image-independent, repetitive output (e.g. "you have not provided an
image"). The effect was prompt-phrasing-dependent and hit low-precision QAT-fp4
e4b vision snapshots hardest (degenerate r/r/r), while the same image on mxfp8
sometimes happened to work.

The vision tower, multimodal projector, fp4 dequant, and soft-token scatter were
all correct (soft tokens are bit-identical between mxfp8 and qat-mxfp4) — the
bug was purely token placement. Match the HF/mlx-vlm processor, which
substitutes the <|image|> placeholder in place within the user message: splice
the block right after the user-turn opener <|turn>user\n. Use the last opener so
multi-turn prompts target the final user turn; fall back to after-BOS when the
opener is absent (Gemma3 keeps that path).

Extracts the splice into a unit-testable free fn with a sibling test file.

Closes #140
…ti-image + Qwen3-VL tests

- Hoist the Qwen3-VL user-turn opener to QWEN3VL_USER_TURN_OPENER const
  (mirrors GEMMA4_USER_TURN_OPENER) and route run_qwen3vl_image through
  splice_image_block, replacing the hand-rolled position()-based splice.
  Single-turn behavior is byte-identical (first == last with one opener);
  multi-turn now correctly targets the final user turn instead of the first.
- Add multi_image_blocks_appear_in_order_and_contiguous test: two blocks,
  asserts ordering, contiguity, and correct trailing token.
- Add qwen3vl_single_turn_splice_matches_last_opener test: confirms the
  new routing produces the same insert index as the previous code for a
  single-turn prompt.
- Extend splice_image_block's indexing_slicing allow-reason to document
  that is_empty() guards windows(0) and the length check guards the
  oversized-opener case.
@Pushkinist Pushkinist self-assigned this Jun 18, 2026
@Pushkinist Pushkinist merged commit 0051389 into main Jun 18, 2026
2 checks passed
@Pushkinist Pushkinist deleted the fix/140-e4b-qat-fp4-vision branch June 18, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gemma4 e4b QAT (mxfp4/nvfp4) image inference degenerate — vision conditioning dropped (text + mxfp8 vision fine; 26b QAT fine)

1 participant