This repository provides the ONNX-converted version of the qwen3-vl-2b multimodal model, optimized for efficient image-to-text generation. The model supports inference on single images with a fixed input resolution of 224×224 and outputs descriptive text based on visual content.
- Model Type: ONNX-exported multimodal large language model (vision-language)
- Input Specification: Single RGB image (224×224 resolution, 3 channels)
- Output: High-quality natural language description of the input image
- Conversion Source: Original qwen3-vl-2b (PyTorch) → ONNX format
- Qwen3.5-VL ONNX: Qwen3.5 can be converted to ONNX, but its inference speed is slower than PyTorch. This is mainly because the torch_chunk_gated_delta_rule function in Qwen3.5 uses a large number of dynamic slicing operations and loops, resulting in a very large static computation graph in ONNX.
- Version: Single RGB image (224×224) of a lemon.
- Language: Describe this image.
This image shows a single, yellow, spherical object that appears to be a small, smooth, and rounded lemon. It is placed on a light-colored, possibly white or off-white, surface with a wood grain texture. The lemon has a rounded, slightly flattened top and a smooth surface. The lighting is even, and the object is the central focus of the image.
- Adapt images of different sizes
- Comparison of Test Torch and ONNX inference Speed
- Convert ONNX to TensorRT to further improve inference speed
- Convert more models from Torch to ONNX
conda create -n onnx python=3.10 -y
conda activate onnx
pip install -r requirements.txt
git clone https://github.com/garlic-byte/Qwen3_VL_Export_ONNX_and_TensorRT.git
cd Qwen3_VL_Export_ONNX_and_TensorRT# Download the model
mkdir qwen3-vl-2b
hf download Qwen/Qwen3-VL-2B-Instruct --local-dir=qwen3-vl-2b/python qwen3_vl_export_onnx.py
python inference_onnx.py- CUDA 12.8
- TensorRT Debian local repo: nv-tensorrt-local-repo-ubuntu2404-10.9.0-cuda-12.8_1.0-1_amd64.deb
- Python TensorRT wheel: tensorrt-10.9.0.34
bash build_engine.sh
python inference_trt.py- The ONNX model is exported from the original PyTorch implementation of qwen3-vl-2b.
- Input resolution is fixed at 224×224 (consistent with the model's training configuration).
- For optimal performance, use ONNX Runtime with GPU acceleration (install
onnxruntime-gpuinstead ofonnxruntime). - The model retains the original qwen3-vl-2b's visual understanding and text generation capabilities.
| Metric | Type | Value |
|---|---|---|
| Latency (1000 runs) | Torch (fp32) | 44.46 (sec) |
| ONNX (fp32) | 26.78 (sec) | |
| ONNX (fp16) | 18.13 (sec) | |
| TensorRT (fp16) | 13.77 (sec) | |
| Generation Speed (10 runs, fp16) | Qwen3-vl | 19.378385 (tokens/sec) (Tokens generated: 1103) |
| ONNX (tokens/sec) | 38.667467 (tokens/sec) (Tokens generated: 1062) |
|
| TensorRT (tokens/sec) | 66.579019 (tokens/sec) (Tokens generated: 842) |
The model is licensed under the same license as the original qwen3-vl-2b (see Qwen Official Repository for details).
- Original qwen3-vl-2b model developed by Alibaba Cloud.
- ONNX conversion leverages PyTorch's
torch.onnx.exportAPI and ONNX Runtime for inference optimization.
