Skip to content

Qwen3-VL large-image ViT full-attention overruns Metal GPU watchdog (>~1300 soft tokens) #146

@Pushkinist

Description

@Pushkinist

Summary

Separate, pre-existing limit surfaced while fixing #138 (KV-cache sizing, now resolved in #145).

On a watchdog-constrained Apple-Silicon GPU, a Qwen3-VL (Qwen3VLMoeForConditionalGeneration) image whose vision tower processes more than ~1300 soft tokens (≈ >5200 patches) overruns the ~10s Metal GPU watchdog inside the ViT full-attention (O(n²) per block), aborting the process before text decode. This is distinct from #138 — that bug was the KV cache capping at 4096 and is fixed; this is vision-tower compute time.

Evidence (post-#145, branch was fix/138-qwen3vl-maxctx-kv)

Real-model sweep on mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit, --max-ctx 16384:

Image soft tokens Outcome
448×448 196 completes, coherent
896×896 784 completes, coherent
1152×1152 1296 Metal GPU Timeout (ViT stage)
2560×2560 6400 ViT completes (~13s), then Metal GPU Timeout at KV prefill

No slice_update [broadcast_shapes] anywhere (the #138 symptom is gone). The failure is now a Metal command-buffer GPU-timeout in the vision tower / prefill.

Impact

Large, full-resolution images (which tile to thousands of soft tokens) cannot complete on memory/watchdog-constrained GPUs, even though the KV cache now sizes correctly. The original #138 reporter's machine had enough watchdog headroom to clear the ViT, so for them #145 is sufficient; lower-headroom machines need this.

Likely direction

Stream / tile the ViT attention (and the image-prefill command buffers) so no single Metal command buffer exceeds the watchdog budget — split the vision-tower attention into windowed/tiled passes with intermediate eval() barriers, analogous to the prefill_chunk chunking already added for the text/KV prefill in #145. Keep it reference-faithful to mlx-vlm numerics.

Out of scope

KV-cache sizing (done in #145). This issue is purely the vision-tower / prefill GPU-time scaling.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions