Summary
Separate, pre-existing limit surfaced while fixing #138 (KV-cache sizing, now resolved in #145).
On a watchdog-constrained Apple-Silicon GPU, a Qwen3-VL (Qwen3VLMoeForConditionalGeneration) image whose vision tower processes more than ~1300 soft tokens (≈ >5200 patches) overruns the ~10s Metal GPU watchdog inside the ViT full-attention (O(n²) per block), aborting the process before text decode. This is distinct from #138 — that bug was the KV cache capping at 4096 and is fixed; this is vision-tower compute time.
Evidence (post-#145, branch was fix/138-qwen3vl-maxctx-kv)
Real-model sweep on mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit, --max-ctx 16384:
| Image |
soft tokens |
Outcome |
| 448×448 |
196 |
completes, coherent |
| 896×896 |
784 |
completes, coherent |
| 1152×1152 |
1296 |
Metal GPU Timeout (ViT stage) |
| 2560×2560 |
6400 |
ViT completes (~13s), then Metal GPU Timeout at KV prefill |
No slice_update [broadcast_shapes] anywhere (the #138 symptom is gone). The failure is now a Metal command-buffer GPU-timeout in the vision tower / prefill.
Impact
Large, full-resolution images (which tile to thousands of soft tokens) cannot complete on memory/watchdog-constrained GPUs, even though the KV cache now sizes correctly. The original #138 reporter's machine had enough watchdog headroom to clear the ViT, so for them #145 is sufficient; lower-headroom machines need this.
Likely direction
Stream / tile the ViT attention (and the image-prefill command buffers) so no single Metal command buffer exceeds the watchdog budget — split the vision-tower attention into windowed/tiled passes with intermediate eval() barriers, analogous to the prefill_chunk chunking already added for the text/KV prefill in #145. Keep it reference-faithful to mlx-vlm numerics.
Out of scope
KV-cache sizing (done in #145). This issue is purely the vision-tower / prefill GPU-time scaling.
Summary
Separate, pre-existing limit surfaced while fixing #138 (KV-cache sizing, now resolved in #145).
On a watchdog-constrained Apple-Silicon GPU, a Qwen3-VL (
Qwen3VLMoeForConditionalGeneration) image whose vision tower processes more than ~1300 soft tokens (≈ >5200 patches) overruns the ~10s Metal GPU watchdog inside the ViT full-attention (O(n²) per block), aborting the process before text decode. This is distinct from #138 — that bug was the KV cache capping at 4096 and is fixed; this is vision-tower compute time.Evidence (post-#145, branch was
fix/138-qwen3vl-maxctx-kv)Real-model sweep on
mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit,--max-ctx 16384:No
slice_update [broadcast_shapes]anywhere (the #138 symptom is gone). The failure is now a Metal command-buffer GPU-timeout in the vision tower / prefill.Impact
Large, full-resolution images (which tile to thousands of soft tokens) cannot complete on memory/watchdog-constrained GPUs, even though the KV cache now sizes correctly. The original #138 reporter's machine had enough watchdog headroom to clear the ViT, so for them #145 is sufficient; lower-headroom machines need this.
Likely direction
Stream / tile the ViT attention (and the image-prefill command buffers) so no single Metal command buffer exceeds the watchdog budget — split the vision-tower attention into windowed/tiled passes with intermediate
eval()barriers, analogous to theprefill_chunkchunking already added for the text/KV prefill in #145. Keep it reference-faithful to mlx-vlm numerics.Out of scope
KV-cache sizing (done in #145). This issue is purely the vision-tower / prefill GPU-time scaling.