Add ViLT (dandelin/vilt-b32-finetuned-vqa) visual-question-answering support by ssss141414 · Pull Request #951 · microsoft/winml-cli

ssss141414 · 2026-06-24T04:13:52Z

Summary

Adds first-class support for ViLT under the visual-question-answering task, validated on dandelin/vilt-b32-finetuned-vqa.

ViLT has no vendor optimum coverage, and its stock ViltEmbeddings.visual_embed is fundamentally not ONNX-traceable (Python iteration over tensor shapes, torch.multinomial, per-row nonzero() loops). Eager works because the loops resolve concretely; tracing fails. This PR therefore ships:

A from-scratch ViltVqaOnnxConfig(OnnxConfig) registered via @register_onnx_overwrite("vilt", "visual-question-answering").
A _ViltVisualEmbedPatcher(ModelPatcher) that swaps visual_embed for a static-shape replacement using nn.functional.interpolate(spatial_pos, size=(H, W), mode='bilinear', align_corners=True) and a synthesized all-ones token mask.
Pinned static H/W on pixel_values; pixel_mask is intentionally omitted from the export signature since the patched path doesn't read it (leaving it in would create a dead graph input).
Recipe + README row + model class mapping wired into models/hf/__init__.py.

Files changed

File	Kind
`src/winml/modelkit/models/hf/vilt.py`	NEW (190 LOC)
`src/winml/modelkit/models/hf/__init__.py`	+3 (wiring)
`examples/recipes/dandelin_vilt-b32-finetuned-vqa/visual-question-answering_config.json`	NEW
`examples/recipes/README.md`	+1 row

Validation (dandelin/vilt-b32-finetuned-vqa @ CPU fp32)

Gate	Result
L0 build	✅ Build complete in 62.9s (Export 29.8s, Optimize 32.2s); 449.2 MB optimized ONNX
L1 perf	✅ mean=67.49 ms, p50=65.83 ms, p90=76.52 ms, throughput=14.82 samples/sec, std=5.92 ms (20 iters, warmup 3)
L2 numerics (PT vs ORT)	✅ cos=1.000000, max_abs_diff=4.2e-5, top class match (3129-way head)
Patched-vs-original PT parity	✅ cos=1.000000, max_abs_diff=1.2e-5, same argmax
L3 dataset eval	⏭ skipped (no default VQA dataset wired)

Notes for reviewers

Inputs declared: input_ids, attention_mask, token_type_ids (dynamic batch_size/sequence_length), pixel_values (only batch_size dynamic, H/W=384 static).
Output: logits (3129-way), batch_size dynamic.
Opset 17, fp32, CPU/auto resolution.
Recipe value_range for mask-of-ones inputs must be [1, 2] not [0, 1] because randint high is exclusive — relevant if anyone re-derives this recipe.

…support Adds OnnxConfig + ModelPatcher for ViLT visual-question-answering since vendor optimum coverage is absent and stock ViltEmbeddings.visual_embed is not ONNX-traceable (Python iteration over tensor shapes, torch.multinomial, per-row nonzero loops). Patcher swaps in a static-shape replacement using nn.functional.interpolate for spatial position embeddings and a synthesized all-ones token mask. H/W axes are pinned static; pixel_mask is intentionally dropped since the patched path does not reference it. Validated on dandelin/vilt-b32-finetuned-vqa @ CPU fp32: - L0 build: 62.9s, 449.2 MB optimized ONNX - L1 perf: p50=65.83ms, throughput=14.82 samples/sec (20 iters, warmup 3) - L2 numerics: cos=1.000000, max_abs_diff=4.2e-5, top-class match (3129-way head)

ssss141414 · 2026-06-25T03:31:26Z

Reviewer verification: OV cpu / gpu / npu — branch \shzhen/add-vilt-vqa\

Commands

\\powershell

config

uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering -o temp/verify_pr951_vilt_config.json

build (OV CPU, fp32, using recipe)

uv run winml build -c examples/recipes/dandelin_vilt-b32-finetuned-vqa/visual-question-answering_config.json -m dandelin/vilt-b32-finetuned-vqa -o temp/verify_pr951_vilt_build --ep openvino --device cpu --precision fp32 --no-quant --no-compile --rebuild

perf — cpu / gpu / npu (from built ONNX, 5 iters + 2 warmup)

uv run winml perf -m temp/verify_pr951_vilt_build/model.onnx --ep openvino --device cpu --iterations 5 --warmup 2 --skip-build -f json
uv run winml perf -m temp/verify_pr951_vilt_build/model.onnx --ep openvino --device gpu --iterations 5 --warmup 2 --skip-build -f json
uv run winml perf -m temp/verify_pr951_vilt_build/model.onnx --ep openvino --device npu --iterations 5 --warmup 2 --skip-build -f json

eval schema check

uv run winml eval --schema --task visual-question-answering
\\

Results

Command	cpu	gpu	npu
config	✅ PASS	—	—
build	✅ PASS (94s, 449.2 MB, autoconf converged in 2 iters)	—	—
perf mean	✅ 213 ms/iter	✅ 7.9 ms/iter	✅ 21 ms/iter
perf throughput	4.69 samples/s	125.83 samples/s	46.95 samples/s
eval	❌ CLI-UNSUPPORTED	❌ CLI-UNSUPPORTED	❌ CLI-UNSUPPORTED

Notes:

\config\ / \�uild\ / \perf\ pass cleanly on all three OV devices. OV sessions created successfully for cpu, gpu, and npu.
\�val\ returns \Task 'visual-question-answering' is not supported by \winml eval.\ — this is a CLI limitation, not an OV EP limitation. VQA task is not yet wired into the eval pipeline.
ONNX artifact: 954 nodes, opset 17, fp32, inputs: \input_ids[1,40], \�ttention_mask[1,40], \ oken_type_ids[1,40], \pixel_values[1,3,384,384], output: \logits[1,3129].

ssss141414 · 2026-06-25T03:32:32Z

Validation results (2026-06-25) for PR #951 on this Windows ARM64 host.

Scope

Compare main vs PR branch behavior
Verify winml config/build on QNN NPU/GPU where applicable

Main branch baseline (before PR)

Command: uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering --ep cpu --device cpu
Result: FAIL
Error: No OnnxConfig registered for model_type='vilt' with task='visual-question-answering'

PR #951 branch

CPU config: PASS
- uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering --ep cpu --device cpu -o temp/vilt_config_test.json
QNN NPU config: PASS
- uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering --ep qnn --device npu -o temp/vilt_qnn_npu_config.json
- Resolved to Device=NPU, EP=QNNExecutionProvider
QNN NPU build: PASS
- uv run winml build -c temp/vilt_qnn_npu_config.json -m dandelin/vilt-b32-finetuned-vqa -o temp/vilt_qnn_npu_build --rebuild
- Build complete in 172.8s
QNN GPU config: PASS
- uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering --ep qnn --device gpu
- (build/perf on QNN GPU not completed in this run)

Conclusion

Confirmed: this PR introduces real support for ViLT visual-question-answering (main fails, PR passes), and QNN NPU path builds successfully.

ssss141414 · 2026-06-25T03:34:12Z

ADDENDUM: main branch baseline (NO support)

On current \main\ @ HEAD:
\\powershell
uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering
\
Returns:
\
Error: No OnnxConfig registered for model_type='vilt' with task='visual-question-answering'.
vilt is not supported yet for transformers.
\\

Conclusion: This PR introduces vilt support. The \�ilt.py\ source changes (custom \ViltVqaOnnxConfig\ + _ViltVisualEmbedPatcher) are necessary and not catalog-only. All OV devices now pass config/build/perf validation.

xieofxie · 2026-06-25T06:12:26Z

+            "input_ids": {0: "batch_size", 1: "sequence_length"},
+            "attention_mask": {0: "batch_size", 1: "sequence_length"},
+            "token_type_ids": {0: "batch_size", 1: "sequence_length"},
+            "pixel_values": {0: "batch_size"},


Static pixel_values H/W silently restricts the export to square images, and the justification here is inaccurate.

I checked ViltImageProcessor against the runtime. The default is size={"shortest_edge": 384}, size_divisor=32 — it pins the shorter edge to 384 and preserves aspect ratio (longer edge ≈ up to int(1333/800*384), floored to a multiple of 32), then pads to the per-batch max. It does not pad to 384×384. Empirically (batch=1, defaults):

input H×W pixel_values pixel_mask all-ones

384×384 (1,3,384,384) ✅

480×640 (1,3,384,512) ✅

640×480 (1,3,512,384) ✅

800×600 (1,3,512,384) ✅

So the all-ones pixel_mask assumption is correct (good), but "always 384×384 via ViltProcessor" (docstring L183) is not — only square inputs land on 384×384. Because inputs marks only batch_size dynamic, the exported ONNX accepts only 384×384; a standard non-square ViltProcessor output (e.g. 384×512) fails at session.run, or forces callers to square-resize (distorts aspect ratio → VQA-accuracy risk). L2 numerics likely passed because validation used a 384×384 input.

Also: the cited reason for pinning — "dynamic symbols trip Resize shape-inference (H:12 W:12 → H:0 W:0)" (L181–183) — describes the original visual_embed (pixel_mask.sum()→0 under tracing), not the patched path. The patch replaced that with a static-grid bilinear interpolate on real dims, so the 0×0 premise no longer applies and dynamic H/W may export fine now.

Suggest either (a) make H/W dynamic — the patch already interpolates to actual x.shape[2], x.shape[3], so re-test the Resize export — or (b) keep it static but document the real constraint honestly (square-384-only + the preprocessing/accuracy caveat) instead of asserting the processor always emits 384×384.

xieofxie · 2026-06-25T06:12:29Z

    VisionDecoderIOConfig as _VisionDecoderIOConfig,  # triggers registration
 )
 from .vision_encoder_decoder import VisionEncoderIOConfig as _VisionEncoderIOConfig
+from .vilt import MODEL_CLASS_MAPPING as _VILT_CLASS_MAPPING


This will fail CI lint (ruff I001). The two new .vilt imports are placed after .vision_encoder_decoder, but vilt sorts before vision_encoder_decoder (vil < vis). Ruff's I rule is enabled and this file's per-file-ignores (D104, E402, F401, F403) don't exempt I, so ruff check errors with I001 Import block is un-sorted (confirmed against this branch). Per the repo CLAUDE.md ("Run uv run ruff check --fix after revising Python code"), running that reorders the block and resolves it.

xieofxie · 2026-06-25T06:12:31Z

+          1,
+          40
+        ],
+        "value_range": [


Recipe contradicts the PR's own reviewer note on mask value_range. The PR description says "Recipe value_range for mask-of-ones inputs must be [1, 2] not [0, 1] because randint high is exclusive." But here attention_mask (and token_type_ids below) use [0, 2]. It's functionally harmless for export — tracing doesn't depend on input values and the PT-vs-ORT check feeds identical inputs to both — but the note and the file disagree and will mislead anyone re-deriving this recipe. Please reconcile (fix the note or the values).

xieofxie reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ViLT (dandelin/vilt-b32-finetuned-vqa) visual-question-answering support#951

Add ViLT (dandelin/vilt-b32-finetuned-vqa) visual-question-answering support#951
ssss141414 wants to merge 1 commit into
mainfrom
shzhen/add-vilt-vqa

ssss141414 commented Jun 24, 2026

Uh oh!

ssss141414 commented Jun 25, 2026

Uh oh!

ssss141414 commented Jun 25, 2026

Uh oh!

ssss141414 commented Jun 25, 2026

Uh oh!

xieofxie Jun 25, 2026

Uh oh!

xieofxie Jun 25, 2026

Uh oh!

xieofxie Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

input H×W	`pixel_values`	`pixel_mask` all-ones
384×384	(1,3,384,384)	✅
480×640	(1,3,384,512)	✅
640×480	(1,3,512,384)	✅
800×600	(1,3,512,384)	✅

Uh oh!

Conversation

ssss141414 commented Jun 24, 2026

Summary

Files changed

Validation (dandelin/vilt-b32-finetuned-vqa @ CPU fp32)

Notes for reviewers

Uh oh!

ssss141414 commented Jun 25, 2026

Reviewer verification: OV cpu / gpu / npu — branch \shzhen/add-vilt-vqa\

Commands

config

build (OV CPU, fp32, using recipe)

perf — cpu / gpu / npu (from built ONNX, 5 iters + 2 warmup)

eval schema check

Results

Uh oh!

ssss141414 commented Jun 25, 2026

Uh oh!

ssss141414 commented Jun 25, 2026

Uh oh!

xieofxie Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

xieofxie Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

xieofxie Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants