Skip to content

Add ViLT (dandelin/vilt-b32-finetuned-vqa) visual-question-answering support#951

Draft
ssss141414 wants to merge 1 commit into
mainfrom
shzhen/add-vilt-vqa
Draft

Add ViLT (dandelin/vilt-b32-finetuned-vqa) visual-question-answering support#951
ssss141414 wants to merge 1 commit into
mainfrom
shzhen/add-vilt-vqa

Conversation

@ssss141414

Copy link
Copy Markdown
Contributor

Summary

Adds first-class support for ViLT under the visual-question-answering task, validated on dandelin/vilt-b32-finetuned-vqa.

ViLT has no vendor optimum coverage, and its stock ViltEmbeddings.visual_embed is fundamentally not ONNX-traceable (Python iteration over tensor shapes, torch.multinomial, per-row nonzero() loops). Eager works because the loops resolve concretely; tracing fails. This PR therefore ships:

  1. A from-scratch ViltVqaOnnxConfig(OnnxConfig) registered via @register_onnx_overwrite("vilt", "visual-question-answering").
  2. A _ViltVisualEmbedPatcher(ModelPatcher) that swaps visual_embed for a static-shape replacement using nn.functional.interpolate(spatial_pos, size=(H, W), mode='bilinear', align_corners=True) and a synthesized all-ones token mask.
  3. Pinned static H/W on pixel_values; pixel_mask is intentionally omitted from the export signature since the patched path doesn't read it (leaving it in would create a dead graph input).
  4. Recipe + README row + model class mapping wired into models/hf/__init__.py.

Files changed

File Kind
src/winml/modelkit/models/hf/vilt.py NEW (190 LOC)
src/winml/modelkit/models/hf/__init__.py +3 (wiring)
examples/recipes/dandelin_vilt-b32-finetuned-vqa/visual-question-answering_config.json NEW
examples/recipes/README.md +1 row

Validation (dandelin/vilt-b32-finetuned-vqa @ CPU fp32)

Gate Result
L0 build ✅ Build complete in 62.9s (Export 29.8s, Optimize 32.2s); 449.2 MB optimized ONNX
L1 perf ✅ mean=67.49 ms, p50=65.83 ms, p90=76.52 ms, throughput=14.82 samples/sec, std=5.92 ms (20 iters, warmup 3)
L2 numerics (PT vs ORT) ✅ cos=1.000000, max_abs_diff=4.2e-5, top class match (3129-way head)
Patched-vs-original PT parity ✅ cos=1.000000, max_abs_diff=1.2e-5, same argmax
L3 dataset eval ⏭ skipped (no default VQA dataset wired)

Notes for reviewers

  • Inputs declared: input_ids, attention_mask, token_type_ids (dynamic batch_size/sequence_length), pixel_values (only batch_size dynamic, H/W=384 static).
  • Output: logits (3129-way), batch_size dynamic.
  • Opset 17, fp32, CPU/auto resolution.
  • Recipe value_range for mask-of-ones inputs must be [1, 2] not [0, 1] because randint high is exclusive — relevant if anyone re-derives this recipe.

…support

Adds OnnxConfig + ModelPatcher for ViLT visual-question-answering since vendor optimum coverage is absent and stock ViltEmbeddings.visual_embed is not ONNX-traceable (Python iteration over tensor shapes, torch.multinomial, per-row nonzero loops). Patcher swaps in a static-shape replacement using nn.functional.interpolate for spatial position embeddings and a synthesized all-ones token mask. H/W axes are pinned static; pixel_mask is intentionally dropped since the patched path does not reference it.

Validated on dandelin/vilt-b32-finetuned-vqa @ CPU fp32:
- L0 build: 62.9s, 449.2 MB optimized ONNX
- L1 perf: p50=65.83ms, throughput=14.82 samples/sec (20 iters, warmup 3)
- L2 numerics: cos=1.000000, max_abs_diff=4.2e-5, top-class match (3129-way head)
@ssss141414

Copy link
Copy Markdown
Contributor Author

Reviewer verification: OV cpu / gpu / npu — branch \shzhen/add-vilt-vqa\

Commands

\\powershell

config

uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering -o temp/verify_pr951_vilt_config.json

build (OV CPU, fp32, using recipe)

uv run winml build -c examples/recipes/dandelin_vilt-b32-finetuned-vqa/visual-question-answering_config.json -m dandelin/vilt-b32-finetuned-vqa -o temp/verify_pr951_vilt_build --ep openvino --device cpu --precision fp32 --no-quant --no-compile --rebuild

perf — cpu / gpu / npu (from built ONNX, 5 iters + 2 warmup)

uv run winml perf -m temp/verify_pr951_vilt_build/model.onnx --ep openvino --device cpu --iterations 5 --warmup 2 --skip-build -f json
uv run winml perf -m temp/verify_pr951_vilt_build/model.onnx --ep openvino --device gpu --iterations 5 --warmup 2 --skip-build -f json
uv run winml perf -m temp/verify_pr951_vilt_build/model.onnx --ep openvino --device npu --iterations 5 --warmup 2 --skip-build -f json

eval schema check

uv run winml eval --schema --task visual-question-answering
\\

Results

Command cpu gpu npu
config ✅ PASS
build ✅ PASS (94s, 449.2 MB, autoconf converged in 2 iters)
perf mean ✅ 213 ms/iter ✅ 7.9 ms/iter ✅ 21 ms/iter
perf throughput 4.69 samples/s 125.83 samples/s 46.95 samples/s
eval ❌ CLI-UNSUPPORTED ❌ CLI-UNSUPPORTED ❌ CLI-UNSUPPORTED

Notes:

  • \config\ / \�uild\ / \perf\ pass cleanly on all three OV devices. OV sessions created successfully for cpu, gpu, and npu.
  • \�val\ returns \Task 'visual-question-answering' is not supported by \winml eval.\ — this is a CLI limitation, not an OV EP limitation. VQA task is not yet wired into the eval pipeline.
  • ONNX artifact: 954 nodes, opset 17, fp32, inputs: \input_ids[1,40], \�ttention_mask[1,40], \ oken_type_ids[1,40], \pixel_values[1,3,384,384], output: \logits[1,3129].

@ssss141414

Copy link
Copy Markdown
Contributor Author

Validation results (2026-06-25) for PR #951 on this Windows ARM64 host.

Scope

  • Compare main vs PR branch behavior
  • Verify winml config/build on QNN NPU/GPU where applicable

Main branch baseline (before PR)

  • Command: uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering --ep cpu --device cpu
  • Result: FAIL
  • Error: No OnnxConfig registered for model_type='vilt' with task='visual-question-answering'

PR #951 branch

  • CPU config: PASS
    • uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering --ep cpu --device cpu -o temp/vilt_config_test.json
  • QNN NPU config: PASS
    • uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering --ep qnn --device npu -o temp/vilt_qnn_npu_config.json
    • Resolved to Device=NPU, EP=QNNExecutionProvider
  • QNN NPU build: PASS
    • uv run winml build -c temp/vilt_qnn_npu_config.json -m dandelin/vilt-b32-finetuned-vqa -o temp/vilt_qnn_npu_build --rebuild
    • Build complete in 172.8s
  • QNN GPU config: PASS
    • uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering --ep qnn --device gpu
    • (build/perf on QNN GPU not completed in this run)

Conclusion

  • Confirmed: this PR introduces real support for ViLT visual-question-answering (main fails, PR passes), and QNN NPU path builds successfully.

@ssss141414

Copy link
Copy Markdown
Contributor Author

ADDENDUM: main branch baseline (NO support)

On current \main\ @ HEAD:
\\powershell
uv run winml config -m dandelin/vilt-b32-finetuned-vqa --task visual-question-answering
\
Returns:
\
Error: No OnnxConfig registered for model_type='vilt' with task='visual-question-answering'.
vilt is not supported yet for transformers.
\\

Conclusion: This PR introduces vilt support. The \�ilt.py\ source changes (custom \ViltVqaOnnxConfig\ + _ViltVisualEmbedPatcher) are necessary and not catalog-only. All OV devices now pass config/build/perf validation.

"input_ids": {0: "batch_size", 1: "sequence_length"},
"attention_mask": {0: "batch_size", 1: "sequence_length"},
"token_type_ids": {0: "batch_size", 1: "sequence_length"},
"pixel_values": {0: "batch_size"},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static pixel_values H/W silently restricts the export to square images, and the justification here is inaccurate.

I checked ViltImageProcessor against the runtime. The default is size={"shortest_edge": 384}, size_divisor=32 — it pins the shorter edge to 384 and preserves aspect ratio (longer edge ≈ up to int(1333/800*384), floored to a multiple of 32), then pads to the per-batch max. It does not pad to 384×384. Empirically (batch=1, defaults):

input H×W pixel_values pixel_mask all-ones
384×384 (1,3,384,384)
480×640 (1,3,384,512)
640×480 (1,3,512,384)
800×600 (1,3,512,384)

So the all-ones pixel_mask assumption is correct (good), but "always 384×384 via ViltProcessor" (docstring L183) is not — only square inputs land on 384×384. Because inputs marks only batch_size dynamic, the exported ONNX accepts only 384×384; a standard non-square ViltProcessor output (e.g. 384×512) fails at session.run, or forces callers to square-resize (distorts aspect ratio → VQA-accuracy risk). L2 numerics likely passed because validation used a 384×384 input.

Also: the cited reason for pinning — "dynamic symbols trip Resize shape-inference (H:12 W:12 → H:0 W:0)" (L181–183) — describes the original visual_embed (pixel_mask.sum()→0 under tracing), not the patched path. The patch replaced that with a static-grid bilinear interpolate on real dims, so the 0×0 premise no longer applies and dynamic H/W may export fine now.

Suggest either (a) make H/W dynamic — the patch already interpolates to actual x.shape[2], x.shape[3], so re-test the Resize export — or (b) keep it static but document the real constraint honestly (square-384-only + the preprocessing/accuracy caveat) instead of asserting the processor always emits 384×384.

VisionDecoderIOConfig as _VisionDecoderIOConfig, # triggers registration
)
from .vision_encoder_decoder import VisionEncoderIOConfig as _VisionEncoderIOConfig
from .vilt import MODEL_CLASS_MAPPING as _VILT_CLASS_MAPPING

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail CI lint (ruff I001). The two new .vilt imports are placed after .vision_encoder_decoder, but vilt sorts before vision_encoder_decoder (vil < vis). Ruff's I rule is enabled and this file's per-file-ignores (D104, E402, F401, F403) don't exempt I, so ruff check errors with I001 Import block is un-sorted (confirmed against this branch). Per the repo CLAUDE.md ("Run uv run ruff check --fix after revising Python code"), running that reorders the block and resolves it.

1,
40
],
"value_range": [

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recipe contradicts the PR's own reviewer note on mask value_range. The PR description says "Recipe value_range for mask-of-ones inputs must be [1, 2] not [0, 1] because randint high is exclusive." But here attention_mask (and token_type_ids below) use [0, 2]. It's functionally harmless for export — tracing doesn't depend on input values and the PT-vs-ORT check feeds identical inputs to both — but the note and the file disagree and will mislead anyone re-deriving this recipe. Please reconcile (fix the note or the values).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants