docs: Add comprehensive vLLM speculative decoding documentation by dzier · Pull Request #8774 · triton-inference-server/server

dzier · 2026-05-11T17:34:40Z

What does the PR do?

This PR adds comprehensive documentation and examples for using speculative decoding with the vLLM backend in Triton Inference Server. The documentation was created in response to user questions about speculative decoding support with vLLM (GitHub issue #8699).

Key additions:

vLLM Speculative Decoding Tutorial (docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md)
- Detailed explanation of speculative decoding with vLLM
- Configuration examples using model.json
- Multiple example configurations (basic, tensor parallelism, n-gram lookup)
- Model repository structure guidelines
- Step-by-step running instructions
- Draft model selection guidelines
- Performance tuning recommendations
- Troubleshooting section
Speculative Decoding Overview (docs/tutorials/Feature_Guide/Speculative_Decoding/README.md)
- High-level explanation of speculative decoding
- Benefits and use cases
- Backend comparison (vLLM vs TensorRT-LLM)
- Quick start guide
- Model selection guidelines with recommended pairs
- Performance tuning strategies
- Monitoring and metrics guide
TRT-LLM Reference (docs/tutorials/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md)
- Placeholder linking to TensorRT-LLM backend documentation
Example Configuration (qa/L0_vllm_speculative_decoding/)
- Complete example model repository
- Configuration files demonstrating speculative decoding setup
- Test instructions

Checklist

PR title reflects the change and is of format <commit_type>: <Title>
Changes are described in the pull request.
Related issues are referenced.
Populated github labels field
Added test plan and verified test passes.
Verified that the PR passes existing CI.
Verified copyright is correct on all changed files.
Added succinct git squash message before merging.
All template sections are filled out.

Commit Type:

docs

Related PRs:

None

Where should the reviewer start?

Start with docs/tutorials/Feature_Guide/Speculative_Decoding/README.md for overview
Review docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md for detailed vLLM guide
Check example configuration in qa/L0_vllm_speculative_decoding/

Test plan:

Documentation follows existing format and structure
All links and references are valid
Example configurations use correct syntax
Copyright headers are present on all new files
Manual verification: Example configuration can be used to deploy a model with speculative decoding (requires GPU environment with vLLM backend)

The documentation provides clear, actionable guidance for users wanting to use speculative decoding with vLLM. The example configurations are based on vLLM's documented parameters and follow established Triton model repository patterns.

Caveats:

The TRT-LLM tutorial is a placeholder that links to the TensorRT-LLM backend repository, as detailed TRT-LLM speculative decoding documentation already exists there
Examples use publicly available Llama models which require HuggingFace authentication for gated models
Performance numbers cited are approximate and will vary based on hardware, model, and workload

Background

Issue #8699 asked whether speculative decoding is possible with the vLLM backend and requested examples. While the documentation structure referenced speculative decoding tutorials for vLLM (docs/llm_features/speculative_decoding.rst), the actual tutorial files did not exist.

vLLM has supported speculative decoding for some time through its AsyncEngineArgs parameters, which can be configured via the model.json file in Triton's vLLM backend. However, this capability was not documented in the Triton documentation, leading to user confusion.

This PR fills that documentation gap by providing comprehensive tutorials with working examples.

Related Issues:

Closes GitHub issue: speculative config with vllm backend #8699

Linear Issue: TGH-101

…xamples - Added detailed vLLM speculative decoding tutorial with configuration examples - Created overview documentation explaining speculative decoding feature - Added example model repository configurations for testing - Included performance tuning guidelines and troubleshooting tips - Documented supported backends and model selection guidelines Closes #8699 Co-authored-by: David Zier <dzier@users.noreply.github.com>

…ders - Remove trailing whitespace from markdown files - Update copyright headers to include (c) per style guide - No handler needed for model.json files (consistent with existing files) Co-authored-by: David Zier <dzier@users.noreply.github.com>

Improves navigation by linking to the speculative decoding overview from the vLLM tutorial for better user experience. Co-authored-by: David Zier <dzier@users.noreply.github.com>

Changed copyright headers from 2025-2026 to 2026 across all documentation and configuration files. Co-authored-by: David Zier <dzier@users.noreply.github.com>

dzier requested review from whoisj and yinggeh May 11, 2026 17:42

dzier marked this pull request as ready for review May 11, 2026 18:08

dzier requested a review from tanmayv25 May 11, 2026 18:08

krishung5 reviewed May 11, 2026

View reviewed changes

Comment thread qa/L0_vllm_speculative_decoding/README.md Outdated

cursoragent and others added 2 commits May 11, 2026 18:17

docs: Add reference to overview documentation in vLLM guide

be6ce3c

Improves navigation by linking to the speculative decoding overview from the vLLM tutorial for better user experience. Co-authored-by: David Zier <dzier@users.noreply.github.com>

fix: Update copyright year to 2026 in all generated files

8ed6770

Changed copyright headers from 2025-2026 to 2026 across all documentation and configuration files. Co-authored-by: David Zier <dzier@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add comprehensive vLLM speculative decoding documentation#8774

docs: Add comprehensive vLLM speculative decoding documentation#8774
dzier wants to merge 4 commits into
mainfrom
cursor/add-vllm-speculative-decoding-docs-1dcd

dzier commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

dzier commented May 11, 2026

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants