fix(qwen3-vl): size KV from --max-ctx + chunk prefill (#138)#145
Merged
Conversation
…138) Qwen3-VL image/text generate built per-layer KV caches with the bare KV_MAX_SEQ_DEFAULT=4096 and never bracketed prefill, so a prompt longer than 4096 (a large image tiles to thousands of soft tokens) overflowed the fixed decode buffer: `slice_update: [broadcast_shapes] (1,4,6776,128) vs (1,4,4096,128)`. The KV stayed at 4096 regardless of --max-ctx. Both paths now resolve (initial_max_seq, ceiling) from the effective --max-ctx via kv_max_seq_and_ceiling (same chain as Gemma4 / Qwen3.5-MoE) and build the ring with with_quant_max_seq(...).with_max_seq_ceiling(...). The ring grows lazily up to the ceiling so a prompt up to --max-ctx fits; an over-cap prompt is rejected with a clean KvCeilingExceeded -> context_overflow instead of a cryptic slice_update broadcast. Plumbs max_ctx_override through run_qwen3vl_image -> generate_image. Both prefills are also chunked (per-arch prefill_chunk=512) so a long prompt is not one multi-thousand-token Metal command buffer (the ~10s GPU watchdog): text uses the shared chunked_prefill; image uses a new forward_embeds_chunked with per-chunk deepstack slicing. The ViT now evaluates per block + materializes its merger/deepstack outputs so the vision graph does not fold into the first prefill chunk. Proven on mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit: a >4096-token text prompt completes with coherent output (KV grows 4096->8192), an over-cap prompt returns context_length_exceeded, and a small image is unaffected. A very large image (tens of thousands of ViT patches, O(n^2) full attention) can still hit the GPU watchdog on memory-constrained Apple Silicon — a vision-tower scaling limit independent of the KV-sizing fix; documented in docs/MODELS.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #138.
Qwen3-VL image inference ignored
--max-ctx: the KV cache stayed at 4096, so a prompt over 4096 tokens (a large image tiles to thousands of soft tokens) hitmlx: slice_update [broadcast_shapes] (1,4,N,128) vs (1,4,4096,128).Root cause
crates/rmlx-models/src/qwen3_vl_moe/generate.rsbuilt per-layer KV viaKvCache::with_quant(kv_quant)— which defaultsmax_seqtoKV_MAX_SEQ_DEFAULT=4096— and never calledwith_max_seq_ceilingnor bracketed prefill withenter_prefill()/exit_prefill(). Within_prefill=false, the None-path prefill went to the fixed 4096 decode buffer instead of the lazy-grow ring. Every other arch (gemma4, qwen3_5_moe, qwen3) sizes KV fromkv_max_seq_and_ceiling(max_ctx_override, mpe)and brackets prefill; Qwen3-VL did neither, and the image path never even receivedmax_ctx_override.Fix (general — same chain as the other arches, no model-name special-casing)
(initial_max_seq, ceiling)viakv_max_seq_and_ceiling(max_ctx_override, mpe)and build caches withwith_quant_max_seq(...).with_max_seq_ceiling(...)— byte-identical call shape to gemma4/qwen3_5_moe.max_ctx_overrideplumbed throughrun_qwen3vl_image→generate_image.chunked_prefill; image via newforward_embeds_chunkedwith per-chunk deepstack slicing) under aqwen3_vl_moeprefill_chunk=512row, so a long prompt isn't one oversized Metal command buffer. ViT now evals per block + materializes merger/deepstack outputs.ensure_prefill_capacity's ceiling →KvCeilingExceeded(already classified Fatal →context_overflow), instead of the migratableslice_updatebroadcast that was retried 3× delivering 0 tokens.forward_embeds.Changes
crates/rmlx-models/src/qwen3_vl_moe/generate.rs— both paths: ceiling-sized caches + chunked prefill;max_ctx_overrideused (was_-prefixed) + added togenerate_image.crates/rmlx-models/src/qwen3_vl_moe/model.rs— newforward_embeds_chunked(per-chunk deepstack-row slicing + watchdog flush); deadforward_embedsremoved.crates/rmlx-models/src/qwen3_vl_moe/vision.rs— per-blockh.eval()+ materialize merger/deepstack outputs (ViT watchdog mitigation; numerically identical).crates/rmlx-models/src/qwen3_vl_moe/generate_tests.rs— NEW: >4096 prefill fits, ceiling honored (not 4096), over-cap →KvCeilingExceedednot broadcast, small image unaffected.crates/rmlx-models/src/prefill_chunk.rs(+tests) —qwen3_vl_moe→ 512 row.crates/rmlx-server/src/engine/{image.rs,arch_generator.rs}— threadmax_ctx_overrideinto the image path.docs/MODELS.md— Qwen3-VL--max-ctxKV-sizing + chunking + ViT-watchdog note.Verification
make lintclean;cargo test -p rmlx-models -p rmlx-serverall pass (newgenerate_tests4/4,prefill_chunk8/8). Rust review: no findings.--max-ctx 16384):kv_alloc cause="grow" offset=4914 max_seq=8192.--max-ctx 4000) →{"error":{"message":"prompt has 4913 tokens, max_ctx is 4000","type":"context_length_exceeded"}}— clean, no panic.slice_update [broadcast_shapes]across the whole run — the 4096-broadcast symptom is gone.Known separate limit (not this bug)
On a watchdog-constrained Apple-Silicon GPU, a Qwen3-VL image whose ViT exceeds ~1300 soft tokens (≈>5200 patches, O(n²) full-attention) can overrun the ~10s Metal GPU watchdog in the vision tower, before text decode — a pre-existing vision-tower scaling limit independent of KV sizing. The KV fix is what completes such images on a GPU with sufficient watchdog headroom (the original reporter's machine cleared the ViT and failed only at the now-fixed KV slice_update). Tracking ViT streaming/tiling as a follow-up.
🤖 Generated with Claude Code