Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
312 changes: 312 additions & 0 deletions docs/tutorials/Feature_Guide/Speculative_Decoding/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,312 @@
<!--
Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of NVIDIA CORPORATION nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

# Speculative Decoding

## Overview

Speculative decoding (also known as speculative sampling or assisted generation) is an inference optimization technique that accelerates Large Language Model (LLM) text generation without compromising output quality. It leverages a smaller, faster "draft" model to propose candidate tokens, which are then verified in parallel by the larger "target" model.

## Why Use Speculative Decoding?

### Key Benefits

1. **Reduced Latency**: Generate tokens faster by using a lightweight draft model for initial proposals
2. **Identical Outputs**: Maintains mathematically identical outputs to standard autoregressive decoding
3. **Lower Cost**: Reduces inference costs by decreasing GPU time for generation
4. **Better GPU Utilization**: Improves batch processing efficiency

### Typical Performance Improvements

- **1.5-3x speedup** in generation latency (varies by model and use case)
- **No quality degradation** - outputs are identical to standard decoding
- **Most effective** for generation-heavy workloads (longer output sequences)

## How It Works

The speculative decoding process follows these steps:

```
┌─────────────────────────────────────────────────────────────┐
│ 1. Draft Model generates K candidate tokens │
│ (Fast, small model - e.g., 1B parameters) │
└────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. Target Model verifies all K tokens in PARALLEL │
│ (Slower, large model - e.g., 70B parameters) │
└────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. Accept matching tokens, correct first mismatch │
│ Continue from corrected position │
└─────────────────────────────────────────────────────────────┘
```

### Example

Suppose we want to generate the sentence: "The cat sat on the mat."

1. **Draft Model proposes**: "The cat sat on the"
2. **Target Model verifies**:
- ✓ "The" - Accept
- ✓ "cat" - Accept
- ✓ "sat" - Accept
- ✓ "on" - Accept
- ✓ "the" - Accept
3. **Result**: All 5 tokens accepted, continue from "the"
4. **Next round**: Draft proposes "mat .", Target verifies and accepts

Instead of 7 sequential forward passes through the large model, we only needed 2, reducing latency by ~3.5x.

## Supported Backends

Triton Inference Server supports speculative decoding with multiple backends:

| Backend | Support Status | Documentation |
|---------|---------------|---------------|
| **vLLM** | ✅ Supported | [vLLM Guide](vLLM/README.md) |
| **TensorRT-LLM** | ✅ Supported | [TRT-LLM Guide](TRT-LLM/README.md) |
| **Python Backend** | ⚠️ Custom Implementation | Manual integration required |

## Choosing a Backend

### vLLM Backend

**Best for:**
- Quick deployment and prototyping
- Dynamic batching with speculative decoding
- Flexible model support (any HuggingFace model)
- Easier configuration via JSON

**Considerations:**
- Slightly higher memory usage
- Less optimized for specific hardware compared to TRT-LLM

[→ See vLLM Speculative Decoding Guide](vLLM/README.md)

### TensorRT-LLM Backend

**Best for:**
- Maximum performance on NVIDIA GPUs
- Production deployments requiring lowest latency
- INT8/FP8 quantization support
- Tightly optimized kernels

**Considerations:**
- Requires engine compilation
- More complex setup process
- Specific model architecture support

[→ See TensorRT-LLM Speculative Decoding Guide](TRT-LLM/README.md)

## Quick Start

Choose your backend and follow the corresponding guide:

### vLLM (Recommended for Getting Started)

```bash
# 1. Create model configuration with speculative decoding
cat > model_repository/my_model/1/model.json << EOF
{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"speculative_model": "meta-llama/Llama-3.2-1B-Instruct",
"num_speculative_tokens": 5,
"gpu_memory_utilization": 0.9
}
EOF

# 2. Launch Triton with vLLM backend
docker run --gpus all --rm \
-v $(pwd)/model_repository:/models \
-e HF_TOKEN=$HF_TOKEN \
nvcr.io/nvidia/tritonserver:26.04-vllm-python-py3 \
tritonserver --model-repository=/models

# 3. Send inference request
curl -X POST http://localhost:8000/v2/models/my_model/generate \
-d '{"text_input": "Explain speculative decoding"}'
```

[→ Full vLLM Setup Guide](vLLM/README.md)

### TensorRT-LLM

TensorRT-LLM requires building engines for both target and draft models. See the [TensorRT-LLM guide](TRT-LLM/README.md) for detailed instructions.

## Model Selection Guidelines

### Choosing a Draft Model

For optimal performance, your draft model should be:

1. **From the same model family** as the target model
- Example: Use Llama 3.2 1B as draft for Llama 3.1 8B/70B
- Example: Use Mistral 7B as draft for Mixtral 8x7B

2. **10-50x smaller** than the target model
- Too small: Low acceptance rate
- Too large: Insufficient speedup

3. **Using the same tokenizer**
- Ensures token-level compatibility

### Recommended Model Pairs

| Target Model | Draft Model | Size Ratio | Expected Speedup |
|--------------|-------------|------------|------------------|
| Llama-3.1-70B | Llama-3.2-1B | 70x | 2.0-2.8x |
| Llama-3.1-8B | Llama-3.2-1B | 8x | 1.5-2.2x |
| Mixtral-8x7B | Mistral-7B-v0.1 | ~8x | 1.6-2.4x |
| CodeLlama-34B | CodeLlama-7B | 4.8x | 1.4-2.0x |
| Qwen2.5-72B | Qwen2.5-7B | 10x | 1.8-2.5x |

## Performance Tuning

### Key Parameters

1. **num_speculative_tokens** (vLLM) / **num_draft_tokens** (TRT-LLM)
- Controls how many tokens the draft model generates per iteration
- Higher values: Better speedup potential, more memory usage
- Typical range: 3-10
- Start with 5 and adjust based on results

2. **gpu_memory_utilization**
- Both models must fit in GPU memory simultaneously
- Reduce if encountering OOM errors
- Typical range: 0.8-0.9

3. **Acceptance Rate**
- Higher is better (target: >70%)
- Low acceptance (<50%) indicates draft model is too different
- Monitor via metrics endpoint

### Memory Optimization

```
Total GPU Memory = Target Model + Draft Model + KV Cache + Activations
```

If memory is constrained:
- Use a smaller draft model
- Reduce `max_model_len` or `max_batch_size`
- Enable tensor parallelism for target model only
- Use quantization (INT8/FP8) where supported

## Monitoring and Metrics

Triton exposes metrics for speculative decoding performance:

```bash
# Access metrics endpoint
curl http://localhost:8002/metrics
```

### Key Metrics to Monitor

- **`speculative_acceptance_rate`**: Percentage of draft tokens accepted (target: >70%)
- **`inter_token_latency`**: Time between tokens (should decrease significantly)
- **`time_to_first_token`**: Should remain similar to non-speculative mode
- **`throughput_tokens_per_second`**: Overall tokens/sec (should increase)

## When NOT to Use Speculative Decoding

Speculative decoding may not be beneficial in these scenarios:

1. **Very short outputs**: Overhead may exceed benefits for <10 token generations
2. **Extremely high batch sizes**: May be memory-constrained with both models
3. **No suitable draft model**: If no compatible smaller model exists
4. **Maximum batch utilization**: When already fully utilizing GPU with standard decoding

## Troubleshooting

### Low or No Speedup

**Possible causes:**
- Draft model too different from target model
- `num_speculative_tokens` too low (try increasing to 7-10)
- Acceptance rate too low (<50%)
- Batch size too small to hide overhead

**Solutions:**
- Use a draft model from the same family
- Increase speculative token count
- Monitor acceptance rate metrics
- Adjust batch size

### Out of Memory Errors

**Solutions:**
- Reduce `gpu_memory_utilization` to 0.8 or lower
- Use a smaller draft model
- Decrease `max_model_len`
- Enable tensor parallelism for target model
- Reduce `num_speculative_tokens`

### Different Outputs Than Expected

**Note:** Speculative decoding should produce **identical outputs** to standard decoding. If outputs differ:
- Check that both models loaded successfully
- Verify model versions match expected
- Review logs for errors or warnings
- This may indicate a bug - please report it

## Additional Resources

### Research Papers

- [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192) (Leviathan et al., 2023)
- [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318) (Chen et al., 2023)
- [SpecInfer: Accelerating LLM Serving with Tree-based Speculative Inference](https://arxiv.org/abs/2305.09781) (Miao et al., 2023)

### Documentation

- [vLLM Backend Repository](https://github.com/triton-inference-server/vllm_backend)
- [TensorRT-LLM Backend Repository](https://github.com/triton-inference-server/tensorrtllm_backend)
- [Triton Server Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/)

### Community

- [NVIDIA Developer Forums](https://forums.developer.nvidia.com/)
- [Triton GitHub Discussions](https://github.com/triton-inference-server/server/discussions)
- [Report Issues](https://github.com/triton-inference-server/server/issues)

## Next Steps

1. **Choose your backend**: [vLLM](vLLM/README.md) or [TensorRT-LLM](TRT-LLM/README.md)
2. **Follow the setup guide** for your chosen backend
3. **Experiment with different draft models** and parameters
4. **Monitor metrics** to optimize performance
5. **Deploy to production** once you've validated the configuration

---

**Questions or feedback?** Please open an issue on the [Triton GitHub repository](https://github.com/triton-inference-server/server/issues).
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
<!--
Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of NVIDIA CORPORATION nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

# Speculative Decoding with TensorRT-LLM Backend

## Overview

TensorRT-LLM backend provides highly optimized speculative decoding support for NVIDIA GPUs. This guide covers configuration and deployment of speculative decoding using the TensorRT-LLM backend.

## Documentation

For comprehensive documentation on speculative decoding with TensorRT-LLM, please refer to the official TensorRT-LLM backend documentation:

[TensorRT-LLM Backend Decoding Documentation](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#decoding)

## Key Features

- Maximum performance on NVIDIA GPUs with optimized kernels
- Support for INT8/FP8 quantization
- Advanced scheduling and batching
- Medusa and standard speculative decoding modes

## Quick Reference

For speculative decoding setup with TensorRT-LLM:

1. Build TensorRT-LLM engines for both target and draft models
2. Configure the model repository with appropriate parameters
3. Deploy using Triton with TensorRT-LLM backend

See the [TensorRT-LLM backend documentation](https://github.com/triton-inference-server/tensorrtllm_backend) for detailed instructions.

## See Also

- [Speculative Decoding Overview](../README.md)
- [vLLM Speculative Decoding](../vLLM/README.md) (alternative backend)
- [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend)
Loading
Loading