fix(server): place image tokens inside the user turn (gemma4 vision grounding) (#140)#144
Merged
Merged
Conversation
The per-image soft-token block (<boi> + N image tokens + <eoi>) was spliced after the leading BOS token, before the user-turn opener. This put the image outside the user turn, so the model frequently failed to ground on it and produced image-independent, repetitive output (e.g. "you have not provided an image"). The effect was prompt-phrasing-dependent and hit low-precision QAT-fp4 e4b vision snapshots hardest (degenerate r/r/r), while the same image on mxfp8 sometimes happened to work. The vision tower, multimodal projector, fp4 dequant, and soft-token scatter were all correct (soft tokens are bit-identical between mxfp8 and qat-mxfp4) — the bug was purely token placement. Match the HF/mlx-vlm processor, which substitutes the <|image|> placeholder in place within the user message: splice the block right after the user-turn opener <|turn>user\n. Use the last opener so multi-turn prompts target the final user turn; fall back to after-BOS when the opener is absent (Gemma3 keeps that path). Extracts the splice into a unit-testable free fn with a sibling test file. Closes #140
…ti-image + Qwen3-VL tests - Hoist the Qwen3-VL user-turn opener to QWEN3VL_USER_TURN_OPENER const (mirrors GEMMA4_USER_TURN_OPENER) and route run_qwen3vl_image through splice_image_block, replacing the hand-rolled position()-based splice. Single-turn behavior is byte-identical (first == last with one opener); multi-turn now correctly targets the final user turn instead of the first. - Add multi_image_blocks_appear_in_order_and_contiguous test: two blocks, asserts ordering, contiguity, and correct trailing token. - Add qwen3vl_single_turn_splice_matches_last_opener test: confirms the new routing produces the same insert index as the previous code for a single-turn prompt. - Extend splice_image_block's indexing_slicing allow-reason to document that is_empty() guards windows(0) and the length check guards the oversized-opener case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #140.
The ticket reported gemma4 e4b QAT-fp4 (mxfp4/nvfp4) image inference producing degenerate, image-independent output, and hypothesized a dequant bug in the QAT vision embedder/projector. Diagnostics falsified that hypothesis and found the real cause — image-token placement.
Diagnosis (evidence, not the ticket's guess)
vision_tower_out/ after-projection) are bit-identical between mxfp8 and qat-mxfp4 — no NaN/Inf. So the vision pipeline + projector + fp4 dequant are all correct.crates/rmlx-server/src/engine/image.rsspliced the per-image token block after the leading BOS but before the user-turn opener<|turn|>user\n, leaving the image outside the user message. HF/mlx-vlm substitute the image placeholder inside the user turn.Fix (general — token placement, independent of weight quant)
splice_image_block(prompt_tokens, blocks, turn_opener)inserts image blocks after the last occurrence of the user-turn opener (the final user turn), with an after-BOS fallback when the opener is empty/absent. Applied to Gemma4 + Gemma4Unified (GEMMA4_USER_TURN_OPENER).QWEN3VL_USER_TURN_OPENER), replacing its hand-rolled first-match splice — single-turn byte-identical, multi-turn now correctly targets the final user turn.Changes
crates/rmlx-server/src/engine/image.rs—splice_image_blockfree fn + per-arch opener consts; Gemma4/Unified + Qwen3-VL routed through it; old inline/hand-rolled splices removed; panic-guard reason documented.crates/rmlx-server/src/engine/image_tests.rs— in-turn placement, last-turn selection, multi-image order/contiguity, after-BOS fallbacks, Qwen3-VL single-turn equivalence.docs/MODELS.md— Gemma4 image-prompt placement note.Verification
make lintclean;cargo test -p rmlx-server --lib engine::image6/6;cargo test -p rmlx-models -p rmlx-quant653 pass.Notes
Gemma3 keeps the after-BOS fallback (its opener ids weren't snapshot-verifiable here); revisit if Gemma3 image grounding is later found flaky.
🤖 Generated with Claude Code