| sidebar-title | Profile with SPEED-Bench Dataset |
|---|
AIPerf supports benchmarking using SPEED-Bench (SPEculative Evaluation Dataset), a benchmark designed for evaluating speculative decoding across diverse semantic domains and input sequence lengths.
This guide covers profiling speculative-decoding-enabled inference servers using SPEED-Bench prompts and collecting server-side acceptance rate metrics per category.
These load all categories combined in a single dataset:
| Dataset Name | Samples | Description |
|---|---|---|
speed_bench_qualitative |
880 | All 11 semantic domains combined |
speed_bench_throughput_1k |
1,536 | ~1K input tokens, all 3 entropy tiers |
speed_bench_throughput_2k |
1,536 | ~2K input tokens, all 3 entropy tiers |
speed_bench_throughput_8k |
1,536 | ~8K input tokens, all 3 entropy tiers |
speed_bench_throughput_16k |
1,536 | ~16K input tokens, all 3 entropy tiers |
speed_bench_throughput_32k |
1,536 | ~32K input tokens, all 3 entropy tiers |
For per-category acceptance rate measurement, each of the 11 qualitative domains is registered separately:
| Dataset Name | Category |
|---|---|
speed_bench_coding |
Code generation and programming |
speed_bench_humanities |
History, philosophy, liberal arts |
speed_bench_math |
Mathematical reasoning |
speed_bench_multilingual |
Tasks across 23 languages |
speed_bench_qa |
Question answering |
speed_bench_rag |
Retrieval-augmented generation |
speed_bench_reasoning |
Logical and analytical reasoning |
speed_bench_roleplay |
Creative roleplay and dialogue |
speed_bench_stem |
Science, technology, engineering |
speed_bench_summarization |
Text summarization |
speed_bench_writing |
Creative and technical writing |
Each throughput ISL bucket is also available filtered by entropy tier:
| Pattern | Tiers | Description |
|---|---|---|
speed_bench_throughput_{ISL}_low_entropy |
Code, sorting | Predictable output patterns |
speed_bench_throughput_{ISL}_mixed |
Needle-in-a-haystack, exams | Moderate unpredictability |
speed_bench_throughput_{ISL}_high_entropy |
Creative writing, dialogue | Highly unpredictable output |
Where {ISL} is one of: 1k, 2k, 8k, 16k, 32k.
Launch an inference server with speculative decoding enabled. For example, with vLLM:
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5Verify the server is ready:
curl -s localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"test"}],"max_tokens":1}'AIPerf auto-discovers the Prometheus endpoint at {url}/metrics. If your server uses a different path, pass it explicitly with --server-metrics:
| Server Type | Metrics Path | Flag Needed |
|---|---|---|
| Standalone vLLM / SGLang | /metrics (default) |
None (auto-discovered) |
| NIM-LLM containers | /v1/metrics |
--server-metrics http://localhost:8000/v1/metrics |
For standard (non-reasoning) models, use temperature=0 and a 4K output length cap:
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--public-dataset speed_bench_coding \
--osl 4096 \
--extra-inputs temperature:0 \
--concurrency 16Do not set ignore_eos — let the model stop naturally at its end-of-sequence token.
For reasoning models (e.g., DeepSeek-R1, QwQ), follow the model card's recommended settings for temperature, top_p, and output length. Reasoning models typically require higher output limits and specific sampling parameters.
To measure acceptance rates per category (matching the SPEED-Bench paper methodology), run each category separately. Each run collects speculative decoding metrics from the server's Prometheus endpoint.
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--public-dataset speed_bench_coding \
--server-metrics http://localhost:8000/metrics \
--osl 4096 \
--extra-inputs temperature:0 \
--concurrency 16 \
--output-artifact-dir ./artifacts/speed_bench_codingLoop through all categories, then assemble results into a per-category matrix:
CATEGORIES="coding humanities math multilingual qa rag reasoning roleplay stem summarization writing"
MODEL="meta/llama-3.1-8b-instruct"
for cat in $CATEGORIES; do
echo "=== Running category: $cat ==="
aiperf profile \
--model "$MODEL" \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--public-dataset "speed_bench_${cat}" \
--server-metrics http://localhost:8000/metrics \
--osl 4096 \
--extra-inputs temperature:0 \
--concurrency 16 \
--output-artifact-dir "./artifacts/speed_bench_${cat}"
done
# Assemble the matrix report
aiperf speed-bench-report ./artifacts/ --format bothThis produces a CSV (speed_bench_report.csv) and console table:
SPEED-Bench Acceptance Length Report
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Model ┃ coding ┃ humanities ┃ math ┃ writing ┃ Overall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ meta/llama-3.1-8b-instruct │ 1.80 │ 1.84 │ 1.78 │ 1.76 │ 1.78 │
└────────────────────────────┴────────┴────────────┴──────┴─────────┴─────────┘
The report script computes acceptance length from vLLM counter metrics (accepted_tokens / num_drafts + 1) and also supports SGLang's direct spec_accept_length gauge.
Additional report metrics:
# Acceptance rate matrix (accepted / draft tokens)
aiperf speed-bench-report ./artifacts/ --metric accept_rate
# Throughput matrix (output tokens/sec per category)
aiperf speed-bench-report ./artifacts/ --metric throughputTo run all 880 prompts in a single benchmark (without per-category breakdown):
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--public-dataset speed_bench_qualitative \
--server-metrics http://localhost:8000/metrics \
--concurrency 16The throughput splits benchmark end-to-end performance at fixed input sequence lengths:
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--public-dataset speed_bench_throughput_1k \
--server-metrics http://localhost:8000/metrics \
--concurrency 64 \
--benchmark-duration 120Replace speed_bench_throughput_1k with any throughput variant (_2k, _8k, _16k, _32k) to test at different input lengths.
To isolate entropy effects on acceptance rate at a given ISL:
for tier in low_entropy mixed high_entropy; do
echo "=== Running throughput_1k tier: $tier ==="
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--public-dataset "speed_bench_throughput_1k_${tier}" \
--server-metrics http://localhost:8000/metrics \
--concurrency 64 \
--benchmark-duration 60
doneServer metrics collection is enabled by default. To disable it:
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--public-dataset speed_bench_qualitative \
--no-server-metrics \
--concurrency 16AIPerf automatically downloads and caches the dataset on first use. To pre-download for container builds or air-gapped environments:
huggingface-cli download nvidia/SPEED-Bench --repo-type datasetOr selectively download specific splits:
from datasets import load_dataset
for subset in ["qualitative", "throughput_1k", "throughput_2k",
"throughput_8k", "throughput_16k", "throughput_32k"]:
load_dataset("nvidia/SPEED-Bench", name=subset, split="test",
trust_remote_code=False)Set HF_HOME to control the cache location (e.g., ENV HF_HOME=/opt/hf_cache in a Dockerfile).