Add ViLT (dandelin/vilt-b32-finetuned-vqa) visual-question-answering support#951
Add ViLT (dandelin/vilt-b32-finetuned-vqa) visual-question-answering support#951ssss141414 wants to merge 1 commit into
Conversation
…support Adds OnnxConfig + ModelPatcher for ViLT visual-question-answering since vendor optimum coverage is absent and stock ViltEmbeddings.visual_embed is not ONNX-traceable (Python iteration over tensor shapes, torch.multinomial, per-row nonzero loops). Patcher swaps in a static-shape replacement using nn.functional.interpolate for spatial position embeddings and a synthesized all-ones token mask. H/W axes are pinned static; pixel_mask is intentionally dropped since the patched path does not reference it. Validated on dandelin/vilt-b32-finetuned-vqa @ CPU fp32: - L0 build: 62.9s, 449.2 MB optimized ONNX - L1 perf: p50=65.83ms, throughput=14.82 samples/sec (20 iters, warmup 3) - L2 numerics: cos=1.000000, max_abs_diff=4.2e-5, top-class match (3129-way head)
Reviewer verification: OV cpu / gpu / npu — branch \shzhen/add-vilt-vqa\Commands\\powershell configuv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering -o temp/verify_pr951_vilt_config.json build (OV CPU, fp32, using recipe)uv run winml build -c examples/recipes/dandelin_vilt-b32-finetuned-vqa/visual-question-answering_config.json -m dandelin/vilt-b32-finetuned-vqa -o temp/verify_pr951_vilt_build --ep openvino --device cpu --precision fp32 --no-quant --no-compile --rebuild perf — cpu / gpu / npu (from built ONNX, 5 iters + 2 warmup)uv run winml perf -m temp/verify_pr951_vilt_build/model.onnx --ep openvino --device cpu --iterations 5 --warmup 2 --skip-build -f json eval schema checkuv run winml eval --schema --task visual-question-answering Results
Notes:
|
|
Validation results (2026-06-25) for PR #951 on this Windows ARM64 host. Scope
Main branch baseline (before PR)
PR #951 branch
Conclusion
|
|
ADDENDUM: main branch baseline (NO support) On current \main\ @ HEAD: Conclusion: This PR introduces vilt support. The \�ilt.py\ source changes (custom \ViltVqaOnnxConfig\ + _ViltVisualEmbedPatcher) are necessary and not catalog-only. All OV devices now pass config/build/perf validation. |
| "input_ids": {0: "batch_size", 1: "sequence_length"}, | ||
| "attention_mask": {0: "batch_size", 1: "sequence_length"}, | ||
| "token_type_ids": {0: "batch_size", 1: "sequence_length"}, | ||
| "pixel_values": {0: "batch_size"}, |
There was a problem hiding this comment.
Static pixel_values H/W silently restricts the export to square images, and the justification here is inaccurate.
I checked ViltImageProcessor against the runtime. The default is size={"shortest_edge": 384}, size_divisor=32 — it pins the shorter edge to 384 and preserves aspect ratio (longer edge ≈ up to int(1333/800*384), floored to a multiple of 32), then pads to the per-batch max. It does not pad to 384×384. Empirically (batch=1, defaults):
| input H×W | pixel_values |
pixel_mask all-ones |
|---|---|---|
| 384×384 | (1,3,384,384) | ✅ |
| 480×640 | (1,3,384,512) | ✅ |
| 640×480 | (1,3,512,384) | ✅ |
| 800×600 | (1,3,512,384) | ✅ |
So the all-ones pixel_mask assumption is correct (good), but "always 384×384 via ViltProcessor" (docstring L183) is not — only square inputs land on 384×384. Because inputs marks only batch_size dynamic, the exported ONNX accepts only 384×384; a standard non-square ViltProcessor output (e.g. 384×512) fails at session.run, or forces callers to square-resize (distorts aspect ratio → VQA-accuracy risk). L2 numerics likely passed because validation used a 384×384 input.
Also: the cited reason for pinning — "dynamic symbols trip Resize shape-inference (H:12 W:12 → H:0 W:0)" (L181–183) — describes the original visual_embed (pixel_mask.sum()→0 under tracing), not the patched path. The patch replaced that with a static-grid bilinear interpolate on real dims, so the 0×0 premise no longer applies and dynamic H/W may export fine now.
Suggest either (a) make H/W dynamic — the patch already interpolates to actual x.shape[2], x.shape[3], so re-test the Resize export — or (b) keep it static but document the real constraint honestly (square-384-only + the preprocessing/accuracy caveat) instead of asserting the processor always emits 384×384.
| VisionDecoderIOConfig as _VisionDecoderIOConfig, # triggers registration | ||
| ) | ||
| from .vision_encoder_decoder import VisionEncoderIOConfig as _VisionEncoderIOConfig | ||
| from .vilt import MODEL_CLASS_MAPPING as _VILT_CLASS_MAPPING |
There was a problem hiding this comment.
This will fail CI lint (ruff I001). The two new .vilt imports are placed after .vision_encoder_decoder, but vilt sorts before vision_encoder_decoder (vil < vis). Ruff's I rule is enabled and this file's per-file-ignores (D104, E402, F401, F403) don't exempt I, so ruff check errors with I001 Import block is un-sorted (confirmed against this branch). Per the repo CLAUDE.md ("Run uv run ruff check --fix after revising Python code"), running that reorders the block and resolves it.
| 1, | ||
| 40 | ||
| ], | ||
| "value_range": [ |
There was a problem hiding this comment.
Recipe contradicts the PR's own reviewer note on mask value_range. The PR description says "Recipe value_range for mask-of-ones inputs must be [1, 2] not [0, 1] because randint high is exclusive." But here attention_mask (and token_type_ids below) use [0, 2]. It's functionally harmless for export — tracing doesn't depend on input values and the PT-vs-ORT check feeds identical inputs to both — but the note and the file disagree and will mislead anyone re-deriving this recipe. Please reconcile (fix the note or the values).
Summary
Adds first-class support for ViLT under the
visual-question-answeringtask, validated ondandelin/vilt-b32-finetuned-vqa.ViLT has no vendor optimum coverage, and its stock
ViltEmbeddings.visual_embedis fundamentally not ONNX-traceable (Python iteration over tensor shapes,torch.multinomial, per-rownonzero()loops). Eager works because the loops resolve concretely; tracing fails. This PR therefore ships:ViltVqaOnnxConfig(OnnxConfig)registered via@register_onnx_overwrite("vilt", "visual-question-answering")._ViltVisualEmbedPatcher(ModelPatcher)that swapsvisual_embedfor a static-shape replacement usingnn.functional.interpolate(spatial_pos, size=(H, W), mode='bilinear', align_corners=True)and a synthesized all-ones token mask.pixel_values;pixel_maskis intentionally omitted from the export signature since the patched path doesn't read it (leaving it in would create a dead graph input).models/hf/__init__.py.Files changed
src/winml/modelkit/models/hf/vilt.pysrc/winml/modelkit/models/hf/__init__.pyexamples/recipes/dandelin_vilt-b32-finetuned-vqa/visual-question-answering_config.jsonexamples/recipes/README.mdValidation (dandelin/vilt-b32-finetuned-vqa @ CPU fp32)
Notes for reviewers
input_ids,attention_mask,token_type_ids(dynamicbatch_size/sequence_length),pixel_values(onlybatch_sizedynamic, H/W=384 static).logits(3129-way),batch_sizedynamic.value_rangefor mask-of-ones inputs must be[1, 2]not[0, 1]becauserandinthighis exclusive — relevant if anyone re-derives this recipe.