Skip to content

docs: Add comprehensive vLLM speculative decoding documentation#8774

Open
dzier wants to merge 4 commits into
mainfrom
cursor/add-vllm-speculative-decoding-docs-1dcd
Open

docs: Add comprehensive vLLM speculative decoding documentation#8774
dzier wants to merge 4 commits into
mainfrom
cursor/add-vllm-speculative-decoding-docs-1dcd

Conversation

@dzier
Copy link
Copy Markdown
Contributor

@dzier dzier commented May 11, 2026

What does the PR do?

This PR adds comprehensive documentation and examples for using speculative decoding with the vLLM backend in Triton Inference Server. The documentation was created in response to user questions about speculative decoding support with vLLM (GitHub issue #8699).

Key additions:

  1. vLLM Speculative Decoding Tutorial (docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md)

    • Detailed explanation of speculative decoding with vLLM
    • Configuration examples using model.json
    • Multiple example configurations (basic, tensor parallelism, n-gram lookup)
    • Model repository structure guidelines
    • Step-by-step running instructions
    • Draft model selection guidelines
    • Performance tuning recommendations
    • Troubleshooting section
  2. Speculative Decoding Overview (docs/tutorials/Feature_Guide/Speculative_Decoding/README.md)

    • High-level explanation of speculative decoding
    • Benefits and use cases
    • Backend comparison (vLLM vs TensorRT-LLM)
    • Quick start guide
    • Model selection guidelines with recommended pairs
    • Performance tuning strategies
    • Monitoring and metrics guide
  3. TRT-LLM Reference (docs/tutorials/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md)

    • Placeholder linking to TensorRT-LLM backend documentation
  4. Example Configuration (qa/L0_vllm_speculative_decoding/)

    • Complete example model repository
    • Configuration files demonstrating speculative decoding setup
    • Test instructions

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging.
  • All template sections are filled out.

Commit Type:

  • docs

Related PRs:

None

Where should the reviewer start?

  1. Start with docs/tutorials/Feature_Guide/Speculative_Decoding/README.md for overview
  2. Review docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md for detailed vLLM guide
  3. Check example configuration in qa/L0_vllm_speculative_decoding/

Test plan:

  • Documentation follows existing format and structure
  • All links and references are valid
  • Example configurations use correct syntax
  • Copyright headers are present on all new files
  • Manual verification: Example configuration can be used to deploy a model with speculative decoding (requires GPU environment with vLLM backend)

The documentation provides clear, actionable guidance for users wanting to use speculative decoding with vLLM. The example configurations are based on vLLM's documented parameters and follow established Triton model repository patterns.

Caveats:

  • The TRT-LLM tutorial is a placeholder that links to the TensorRT-LLM backend repository, as detailed TRT-LLM speculative decoding documentation already exists there
  • Examples use publicly available Llama models which require HuggingFace authentication for gated models
  • Performance numbers cited are approximate and will vary based on hardware, model, and workload

Background

Issue #8699 asked whether speculative decoding is possible with the vLLM backend and requested examples. While the documentation structure referenced speculative decoding tutorials for vLLM (docs/llm_features/speculative_decoding.rst), the actual tutorial files did not exist.

vLLM has supported speculative decoding for some time through its AsyncEngineArgs parameters, which can be configured via the model.json file in Triton's vLLM backend. However, this capability was not documented in the Triton documentation, leading to user confusion.

This PR fills that documentation gap by providing comprehensive tutorials with working examples.

Related Issues:

Linear Issue: TGH-101

Open in Web Open in Cursor 

…xamples

- Added detailed vLLM speculative decoding tutorial with configuration examples
- Created overview documentation explaining speculative decoding feature
- Added example model repository configurations for testing
- Included performance tuning guidelines and troubleshooting tips
- Documented supported backends and model selection guidelines

Closes #8699

Co-authored-by: David Zier <dzier@users.noreply.github.com>
@dzier dzier requested review from whoisj and yinggeh May 11, 2026 17:42
…ders

- Remove trailing whitespace from markdown files
- Update copyright headers to include (c) per style guide
- No handler needed for model.json files (consistent with existing files)

Co-authored-by: David Zier <dzier@users.noreply.github.com>
@dzier dzier marked this pull request as ready for review May 11, 2026 18:08
@dzier dzier requested a review from tanmayv25 May 11, 2026 18:08
Comment thread qa/L0_vllm_speculative_decoding/README.md Outdated
cursoragent and others added 2 commits May 11, 2026 18:17
Improves navigation by linking to the speculative decoding overview
from the vLLM tutorial for better user experience.

Co-authored-by: David Zier <dzier@users.noreply.github.com>
Changed copyright headers from 2025-2026 to 2026 across all
documentation and configuration files.

Co-authored-by: David Zier <dzier@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants