Go to https://github.com/notaDestroyer/vllm-benchmark-suite/wiki for results
Comprehensive benchmarking tool for evaluating vLLM inference performance with automatic backend detection, advanced performance metrics, real-time monitoring, and interactive configuration.
- System Information Collection: Python version, platform, GPU model, VRAM, CUDA version, driver version
- vLLM Server Discovery: Automatic detection of vLLM version, attention backend (FlashInfer/FlashAttention), quantization format, tensor/pipeline parallelism, max batch size, prefix caching status, KV cache usage, and max context length
- Backend Inference: Automatically identifies FP8/AWQ/GPTQ/INT8/INT4/FP16 quantization from model names
- Rich Terminal UI: Beautiful panels, tables, progress bars, and live dashboards powered by the Rich library
- Interactive Configuration: CLI prompts for max context length (up to 1M tokens), concurrent users (up to 100), and output length selection
- Live Test Dashboard: Real-time display of current test progress, GPU metrics, and remaining test queue
- Comprehensive Summaries: Post-benchmark analysis with performance highlights and energy efficiency metrics
- Latency Percentiles: P50, P90, P99 for detailed latency distribution analysis
- Inter-Token Latency (ITL): Average time between generated tokens
- Prefill/Decode Separation: Estimated time breakdown for prefill vs decode phases
- Energy Efficiency: Tokens per watt, watts per token per user, normalized efficiency metrics (per 1K context)
- Energy Consumption: Total watt-hours consumed during benchmark execution
- High-Frequency GPU Polling: 0.1s intervals (vs 1s in v1) for more granular data
- Energy Metrics: Real-time power consumption tracking and efficiency calculations
- System Context: Full hardware and software configuration captured in results
- Model Warmup: Pre-benchmark inference to initialize GPU kernels and caches
- Output Management: Organized results in
./outputsdirectory with timestamped files - Enhanced Metadata: Complete system info, server config, and test parameters in JSON output
- Optional Detailed Reports: Post-benchmark detailed summary tables on demand
The benchmark suite consists of:
- SystemInfo: Collects Python, platform, GPU, CUDA, and driver information
- VLLMServerInfo: Queries vLLM endpoints for configuration and capabilities
- GPUMonitor: High-frequency polling of nvidia-smi for real-time GPU metrics
- Request Generator: Concurrent HTTP request handling with thread pools
- Metrics Collector: Statistical analysis including percentiles, ITL, and energy metrics
- Visualization Engine: matplotlib/seaborn-based chart generation with 15+ performance graphs
- Interactive CLI: Rich-powered terminal interface for configuration and live monitoring
- NVIDIA GPU with CUDA support (tested on RTX Pro 6000 Blackwell, RTX 5090)
- Minimum 8GB VRAM (16GB+ recommended for large models)
- Linux operating system (Ubuntu 22.04+ recommended)
- Python 3.10 or higher
- NVIDIA drivers with nvidia-smi available
- vLLM server running and accessible
git clone https://github.com/notaDestroyer/vllm-benchmark-suite.git
cd vllm-benchmark-suiteUsing uv (recommended):
uv venv venv --python 3.12
source venv/bin/activate.fish # for fish shell
# or
source venv/bin/activate # for bash/zshUsing standard Python:
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtRequirements:
requests>=2.31.0
matplotlib>=3.8.0
seaborn>=0.13.0
pandas>=2.1.0
numpy>=1.26.0
rich>=13.7.0
Before running benchmarks, start your vLLM server:
vllm serve MODEL_NAME --port 8000 --max-model-len 262144 --gpu-memory-utilization 0.95Example configurations:
Qwen3-30B with FlashInfer:
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 \
--port 8000 \
--max-model-len 262144 \
--gpu-memory-utilization 0.95 \
--enable-chunked-prefill \
--enable-prefix-cachingGLM-4.5-Air with AWQ:
vllm serve THUDM/GLM-4.5-Air-AWQ-4bit \
--port 8000 \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--quantization awqBasic usage (interactive mode):
python vllm_benchmark_suitev2.pyThe interactive CLI will guide you through:
- System and server information display
- Max context length selection (32K to 1M tokens)
- Max concurrent users selection (1 to 100)
- Output length selection (short/standard/long/custom)
- Configuration confirmation
- Automatic model warmup
- Live benchmark execution with real-time monitoring
- Comprehensive results and visualizations
The benchmark generates files in the ./outputs directory:
-
benchmark_MODEL_TIMESTAMP.json: Complete performance data with metadata
- System information (Python, GPU, CUDA, driver versions)
- Server configuration (vLLM version, backend, quantization, parallelism)
- Test parameters (context lengths, users, output tokens)
- Detailed metrics for each test (latency, throughput, GPU, energy)
-
benchmark_MODEL_TIMESTAMP.png: Comprehensive visualization (300 DPI, 15+ charts)
- Throughput vs context length
- Latency distribution with percentiles
- Throughput heatmap
- Efficiency metrics (tokens/s per user)
- Request throughput
- TTFT estimates
- Context scaling impact
- Success rates
- GPU utilization and VRAM usage
- Power consumption and temperature
- Clock frequencies
- Energy efficiency metrics
- Average Latency: Mean request duration
- Standard Deviation: Latency variance across requests
- Min/Max Latency: Best and worst case performance
- P50/P90/P99: Latency percentiles for distribution analysis
- TTFT (Time to First Token): Estimated prefill latency
- ITL (Inter-Token Latency): Average time between tokens
- Tokens/Second: Overall generation throughput
- Requests/Second: Request processing rate
- Tokens/Second/User: Per-user efficiency metric
- Utilization: GPU compute usage percentage
- VRAM Usage: Memory consumption in GB
- Temperature: GPU thermal state in Celsius
- Power Draw: Instantaneous power consumption in watts
- Clock Frequencies: GPU core and memory clocks in MHz
- Tokens/Watt: Throughput per watt of power consumed
- Watts/Token/User: Energy efficiency per concurrent user
- Normalized Efficiency: Watts per token per user per 1K context
- Total Energy: Watt-hours consumed during test execution
┌─ vLLM Performance Benchmark Suite ─┐
│ Enhanced Edition v2.0 - Interactive │
└─────────────────────────────────────┘
Initializing system detection...
Querying vLLM server...
╭─ System Information ──────────────╮
│ Python 3.12.1 │
│ Platform Linux-6.8.0-49-generic│
│ GPU NVIDIA RTX Pro 6000 │
│ (96GB) │
│ Driver 570.00 │
│ CUDA 12.8 │
╰───────────────────────────────────╯
╭─ vLLM Configuration ──────────────────╮
│ Model Qwen3-30B-FP8 │
│ vLLM Version 0.6.8 │
│ Attention Backend FlashInfer │
│ Quantization FP8 │
│ Max Context 262,144 tokens │
│ GPU Mem Util 95.0% │
│ Prefix Caching Enabled │
╰───────────────────────────────────────╯
••• Benchmark Configuration •••
Select maximum context length:
[1] 32K
[2] 64K
[3] 128K
[4] 256K
[5] 512K
[6] 1024K (1M)
Select max context: 3
Total tests: 28 (7 contexts × 4 user levels)
Estimated time: 25-35 minutes
╭─ Current Test (12/28) ─────────────╮
│ Context 96K tokens │
│ Users 5 │
│ Elapsed 18.3s │
│ GPU 94% │
│ Status RUNNING │
╰────────────────────────────────────╯
╭─ Queue (16 remaining) ─────────────╮
│ Remaining tests: │
│ │
│ 1. 96K × 10 users │
│ 2. 128K × 1 users │
│ 3. 128K × 2 users │
│ ... │
╰────────────────────────────────────╯
════════════════════════════════════════
Benchmark Complete
════════════════════════════════════════
╔════════════════════════════════════════╗
║ Performance Highlights ║
╠════════════════════════════════════════╣
║ Peak Throughput │ 87.3 tok/s ║
║ │ 10 users @ 64K ║
╠════════════════════════════════════════╣
║ Best Efficiency │ 43.6 tok/s/user ║
║ │ 2 users @ 32K ║
╠════════════════════════════════════════╣
║ Lowest Latency │ 11.47s ║
║ │ 1 users @ 10K ║
╠════════════════════════════════════════╣
║ Peak GPU Util │ 97.8% ║
║ │ 10 users @ 128K ║
╚════════════════════════════════════════╝
╔════════════════════════════════════════╗
║ Energy Efficiency Analysis ║
╠════════════════════════════════════════╣
║ Best Energy │ 0.23 tok/W ║
║ Efficiency │ 5 users @ 64K ║
╠════════════════════════════════════════╣
║ Total Energy │ 15.34 Wh ║
║ Used │ All tests combined ║
╚════════════════════════════════════════╝
Outputs:
* Results: ./outputs/benchmark_qwen3_20251017_143052.json
* Charts: ./outputs/benchmark_qwen3_20251017_143052.png
Total Benchmark Time: 28.3 minutes (1,698 seconds)
Test execution: 1,620s | Overhead (pauses, etc): 78s
Modify constants at the top of the script:
API_BASE_URL = "http://localhost:8000"
REQUEST_TIMEOUT = 900 # seconds
GPU_POLL_INTERVAL = 0.1 # seconds (high frequency)
TEST_PAUSE_DURATION = 5 # seconds between tests
OUTPUT_DIR = "./outputs" # output directoryThe CLI provides pre-configured options and custom settings:
Context Lengths:
- 32K: 1K, 10K, 32K
- 64K: 1K, 10K, 32K, 64K
- 128K: 1K, 10K, 32K, 64K, 96K, 128K
- 256K: All above + 160K, 192K, 224K, 256K
- 512K: All above + 384K, 512K
- 1024K: All above + 768K, 1024K
Concurrent Users:
- Standard: 1, 2, 5, 10, 20, 50
- Custom: Any value
Output Tokens:
- Short summaries: 150 tokens
- Standard responses: 500 tokens
- Long reports: 1500 tokens
- Custom: Any value
Determine optimal configuration for expected workload:
- Context length requirements and scaling behavior
- Concurrent user capacity and batching efficiency
- Hardware utilization targets and bottlenecks
- Energy consumption and cost analysis
Benchmark different models or configurations:
- Quantization formats (FP8 vs AWQ vs GPTQ vs INT4)
- Model sizes (7B vs 30B vs 72B parameters)
- Attention backends (FlashInfer vs FlashAttention)
- MoE architectures vs dense models
Evaluate hardware and configuration changes:
- GPU memory allocation strategies
- Batch size and KV cache tuning
- Prefix caching impact
- Chunked prefill effectiveness
- Tensor parallelism scaling
Optimize for power consumption:
- Tokens per watt across configurations
- Power-limited vs compute-limited scenarios
- Efficiency vs throughput trade-offs
- Cost per token analysis
Track performance across vLLM versions:
- Version upgrade validation
- Performance regression detection
- Optimization verification
- Backend comparison
v2 automatically queries multiple vLLM endpoints:
Model Information (/v1/models):
- Model name and ID
- Creation timestamp
Version Information (/version):
- vLLM version string
Metrics Endpoint (/metrics, Prometheus format):
- KV cache usage percentage
- Number of running requests
- Various internal metrics
Configuration Inference:
- Quantization format from model name
- Backend detection from server response headers
- Tensor/pipeline parallelism from configuration
Tokens per Watt: Instantaneous throughput efficiency
tokens_per_watt = tokens_per_second / avg_power_draw
Watts per Token per User: Normalized energy cost
watts_per_token_per_user = avg_power_draw / (tokens_per_second / concurrent_users)
Normalized Efficiency: Context-length adjusted metric
normalized_efficiency = watts_per_token_per_user / (context_length / 1000)
Total Energy Consumption: Watt-hours for test
energy_wh = (avg_power_draw * test_duration) / 3600
High-frequency polling (100ms) captures:
- GPU utilization percentage
- VRAM usage (used/total in MB)
- GPU temperature (Celsius)
- Power draw (watts)
- GPU clock frequency (MHz)
- Memory clock frequency (MHz)
Statistics computed:
- Mean, max, min for all metrics
- Per-test aggregation
- Full timeline data saved in results
[ERROR] Failed to query model name: Connection refused
Solution: Ensure vLLM server is running:
curl http://localhost:8000/v1/models
# or
curl http://localhost:8000/health[WARNING] GPU monitoring error: nvidia-smi not found
Solution: Install NVIDIA drivers or add nvidia-smi to PATH:
nvidia-smi --version
which nvidia-smi[ERROR] All requests failed!
Error: HTTP 500 (CUDA out of memory)
Solutions:
- Reduce
--gpu-memory-utilization(try 0.85 or 0.80) - Reduce
--max-model-len - Lower concurrent users in benchmark
- Enable
--enable-chunked-prefillfor large contexts
[ERROR] Error: ('Connection aborted.', timeout())
Solutions:
- Increase
REQUEST_TIMEOUTin script (default: 900s) - Reduce output tokens for faster completion
- Check if server is under heavy load
If terminal output is garbled:
# Set TERM environment variable
export TERM=xterm-256color
# Or disable live display by modifying script
# Comment out Live display sections if neededCap power draw for efficiency testing:
sudo nvidia-smi -pl 450 # Set 450W power limit
sudo nvidia-smi -pl 300 # Set 300W power limitReset to default:
sudo nvidia-smi -pl <default_power> # Check nvidia-smi -q for defaultvllm serve MODEL \
--max-model-len 262144 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 256 \
--enable-chunked-prefill \
--enable-prefix-caching \
--dtype autovllm serve MODEL \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 128 \
--enable-prefix-cachingDisable CPU frequency scaling:
sudo cpupower frequency-set -g performanceSet GPU persistence mode:
sudo nvidia-smi -pm 1v1 Features:
- Basic benchmarking
- GPU monitoring (1s intervals)
- 12 visualization charts
- JSON output
- Console summary
v2 Enhancements:
- Automatic system and server detection
- Interactive CLI with Rich UI
- Live test dashboard
- High-frequency GPU monitoring (0.1s intervals)
- Advanced metrics (P50/P90/P99, ITL, prefill/decode)
- Energy efficiency analysis
- Model warmup phase
- Organized output directory
- 15+ visualization charts
- Enhanced metadata and summaries
- Optional detailed reports
Migration from v1: v2 is backward compatible. Existing scripts work with v2, but interactive mode provides better UX.
Contributions are welcome:
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit changes (
git commit -am 'Add energy efficiency metrics') - Push to branch (
git push origin feature/improvement) - Open a Pull Request
Areas for Contribution:
- Additional visualization types
- Multi-GPU benchmarking
- Streaming latency metrics
- Cost analysis features
- Cloud provider integration
- Automated performance regression detection
MIT License - see LICENSE file for details
If you use this benchmark suite in your research or testing:
@software{vllm_benchmark_suite_v2,
title = {vLLM Performance Benchmark Suite v2.0},
author = {amit},
year = {2025},
url = {https://github.com/notaDestroyer/vllm-benchmark-suite}
}- vLLM Team for the inference engine
- FlashInfer for attention kernels
- Community contributors and testers
- Complete UI overhaul with Rich library
- Automatic system and server detection
- Interactive configuration mode
- Energy efficiency metrics
- Advanced latency analysis (P50/P90/P99, ITL)
- High-frequency GPU monitoring (0.1s)
- Model warmup phase
- Enhanced visualizations
- Initial release
- Basic benchmarking functionality
- GPU monitoring
- Visualization suite
- JSON output
For vLLM-specific questions:
- vLLM Documentation: https://docs.vllm.ai/
- vLLM GitHub: https://github.com/vllm-project/vllm