Skip to content

Latest commit

 

History

History
139 lines (110 loc) · 6.98 KB

File metadata and controls

139 lines (110 loc) · 6.98 KB
sidebar-title Profile Vision Language Models with AIPerf

Profile Vision Language Models with AIPerf

AIPerf supports benchmarking Vision Language Models (VLMs) that process both text and images.

This guide covers profiling vision models using OpenAI-compatible chat completions endpoints with vLLM.


Start a vLLM Server

Launch a vLLM server with a vision language model:

docker pull vllm/vllm-openai:latest
docker run --gpus all -p 8000:8000 -e HF_TOKEN vllm/vllm-openai:latest \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --enforce-eager

Verify the server is ready:

timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen2-VL-2B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }

Profile with Synthetic Images

AIPerf can generate synthetic images for benchmarking. By default, images are generated as random noise at the requested dimensions — no on-disk assets required, and the pool is effectively unbounded so servers cannot dedupe on identical inputs. Pass --image-source assets to instead sample and resize the 4 bundled natural images (smaller payload bytes), or --image-source <path> to sample from your own directory.

aiperf profile \
    --model Qwen/Qwen2-VL-2B-Instruct \
    --endpoint-type chat \
    --image-width-mean 512 \
    --image-height-mean 512 \
    --synthetic-input-tokens-mean 100 \
    --streaming \
    --url localhost:8000 \
    --request-count 20 \
    --concurrency 4

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Generating synthetic images (512x512 px)
INFO     AIPerf System is PROFILING

Profiling: 20/20 |████████████████████████| 100% [01:45<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/Qwen_Qwen2-VL-2B-Instruct-chat-concurrency4/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃     min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 2345.67 │ 1890.34 │ 2987.12 │ 2923.45 │ 2312.89 │
│    Time to First Token (ms) │  456.78 │  378.90 │  598.45 │  578.23 │  445.67 │
│    Inter Token Latency (ms) │   18.90 │   14.56 │   25.34 │   24.12 │   18.45 │
│ Output Token Count (tokens) │  150.00 │  120.00 │  180.00 │  178.00 │  148.00 │
│  Request Throughput (req/s) │    4.89 │       - │       - │       - │       - │
└─────────────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/Qwen_Qwen2-VL-2B-Instruct-chat-concurrency4/profile_export_aiperf.json

Profile with Custom Input File

Create a JSONL file with text prompts and image URLs:

cat <<EOF > inputs.jsonl
{"texts": ["Describe this image in detail."], "images": ["https://picsum.photos/512/512?random=1"], "output_length": 200}
{"texts": ["What objects are visible in this image?"], "images": ["https://picsum.photos/512/512?random=2"]}
{"texts": ["Analyze the composition of this photo."], "images": ["https://picsum.photos/512/512?random=3"], "output_length": 300}
{"texts": ["What is the main subject of this image?"], "images": ["https://picsum.photos/512/512?random=4"]}
{"texts": ["Provide a caption for this image."], "images": ["https://picsum.photos/512/512?random=5"], "output_length": 50}
EOF

Run AIPerf using the custom input file:

aiperf profile \
    --model Qwen/Qwen2-VL-2B-Instruct \
    --endpoint-type chat \
    --input-file inputs.jsonl \
    --custom-dataset-type single_turn \
    --streaming \
    --url localhost:8000 \
    --request-count 5

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Loaded 5 entries from inputs.jsonl
INFO     Using single_turn dataset type with custom images
INFO     AIPerf System is PROFILING

Profiling: 5/5 |████████████████████████| 100% [00:25<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/Qwen_Qwen2-VL-2B-Instruct-chat-concurrency1/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃     min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 2456.89 │ 2012.45 │ 2890.34 │ 2890.34 │ 2398.12 │
│    Time to First Token (ms) │  478.90 │  398.23 │  567.89 │  567.89 │  467.34 │
│    Inter Token Latency (ms) │   19.45 │   15.67 │   24.12 │   24.12 │   19.01 │
│ Output Token Count (tokens) │  156.00 │  128.00 │  185.00 │  185.00 │  154.00 │
│  Request Throughput (req/s) │    2.34 │       - │       - │       - │       - │
└─────────────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/Qwen_Qwen2-VL-2B-Instruct-chat-concurrency1/profile_export_aiperf.json