docs: Add comprehensive vLLM speculative decoding documentation#8774
Open
dzier wants to merge 4 commits into
Open
docs: Add comprehensive vLLM speculative decoding documentation#8774dzier wants to merge 4 commits into
dzier wants to merge 4 commits into
Conversation
…xamples - Added detailed vLLM speculative decoding tutorial with configuration examples - Created overview documentation explaining speculative decoding feature - Added example model repository configurations for testing - Included performance tuning guidelines and troubleshooting tips - Documented supported backends and model selection guidelines Closes #8699 Co-authored-by: David Zier <dzier@users.noreply.github.com>
…ders - Remove trailing whitespace from markdown files - Update copyright headers to include (c) per style guide - No handler needed for model.json files (consistent with existing files) Co-authored-by: David Zier <dzier@users.noreply.github.com>
krishung5
reviewed
May 11, 2026
Improves navigation by linking to the speculative decoding overview from the vLLM tutorial for better user experience. Co-authored-by: David Zier <dzier@users.noreply.github.com>
Changed copyright headers from 2025-2026 to 2026 across all documentation and configuration files. Co-authored-by: David Zier <dzier@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does the PR do?
This PR adds comprehensive documentation and examples for using speculative decoding with the vLLM backend in Triton Inference Server. The documentation was created in response to user questions about speculative decoding support with vLLM (GitHub issue #8699).
Key additions:
vLLM Speculative Decoding Tutorial (
docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md)model.jsonSpeculative Decoding Overview (
docs/tutorials/Feature_Guide/Speculative_Decoding/README.md)TRT-LLM Reference (
docs/tutorials/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md)Example Configuration (
qa/L0_vllm_speculative_decoding/)Checklist
<commit_type>: <Title>Commit Type:
Related PRs:
None
Where should the reviewer start?
docs/tutorials/Feature_Guide/Speculative_Decoding/README.mdfor overviewdocs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.mdfor detailed vLLM guideqa/L0_vllm_speculative_decoding/Test plan:
The documentation provides clear, actionable guidance for users wanting to use speculative decoding with vLLM. The example configurations are based on vLLM's documented parameters and follow established Triton model repository patterns.
Caveats:
Background
Issue #8699 asked whether speculative decoding is possible with the vLLM backend and requested examples. While the documentation structure referenced speculative decoding tutorials for vLLM (
docs/llm_features/speculative_decoding.rst), the actual tutorial files did not exist.vLLM has supported speculative decoding for some time through its
AsyncEngineArgsparameters, which can be configured via themodel.jsonfile in Triton's vLLM backend. However, this capability was not documented in the Triton documentation, leading to user confusion.This PR fills that documentation gap by providing comprehensive tutorials with working examples.
Related Issues:
Linear Issue: TGH-101