diff --git a/docs/tutorials/Feature_Guide/Speculative_Decoding/README.md b/docs/tutorials/Feature_Guide/Speculative_Decoding/README.md
new file mode 100644
index 0000000000..8afc4b7d05
--- /dev/null
+++ b/docs/tutorials/Feature_Guide/Speculative_Decoding/README.md
@@ -0,0 +1,312 @@
+<!--
+Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Speculative Decoding
+
+## Overview
+
+Speculative decoding (also known as speculative sampling or assisted generation) is an inference optimization technique that accelerates Large Language Model (LLM) text generation without compromising output quality. It leverages a smaller, faster "draft" model to propose candidate tokens, which are then verified in parallel by the larger "target" model.
+
+## Why Use Speculative Decoding?
+
+### Key Benefits
+
+1. **Reduced Latency**: Generate tokens faster by using a lightweight draft model for initial proposals
+2. **Identical Outputs**: Maintains mathematically identical outputs to standard autoregressive decoding
+3. **Lower Cost**: Reduces inference costs by decreasing GPU time for generation
+4. **Better GPU Utilization**: Improves batch processing efficiency
+
+### Typical Performance Improvements
+
+- **1.5-3x speedup** in generation latency (varies by model and use case)
+- **No quality degradation** - outputs are identical to standard decoding
+- **Most effective** for generation-heavy workloads (longer output sequences)
+
+## How It Works
+
+The speculative decoding process follows these steps:
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ 1. Draft Model generates K candidate tokens                │
+│    (Fast, small model - e.g., 1B parameters)                │
+└────────────────────┬────────────────────────────────────────┘
+                     │
+                     ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 2. Target Model verifies all K tokens in PARALLEL          │
+│    (Slower, large model - e.g., 70B parameters)             │
+└────────────────────┬────────────────────────────────────────┘
+                     │
+                     ▼
+┌─────────────────────────────────────────────────────────────┐
+│ 3. Accept matching tokens, correct first mismatch          │
+│    Continue from corrected position                         │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### Example
+
+Suppose we want to generate the sentence: "The cat sat on the mat."
+
+1. **Draft Model proposes**: "The cat sat on the"
+2. **Target Model verifies**:
+   - ✓ "The" - Accept
+   - ✓ "cat" - Accept
+   - ✓ "sat" - Accept
+   - ✓ "on" - Accept
+   - ✓ "the" - Accept
+3. **Result**: All 5 tokens accepted, continue from "the"
+4. **Next round**: Draft proposes "mat .", Target verifies and accepts
+
+Instead of 7 sequential forward passes through the large model, we only needed 2, reducing latency by ~3.5x.
+
+## Supported Backends
+
+Triton Inference Server supports speculative decoding with multiple backends:
+
+| Backend | Support Status | Documentation |
+|---------|---------------|---------------|
+| **vLLM** | ✅ Supported | [vLLM Guide](vLLM/README.md) |
+| **TensorRT-LLM** | ✅ Supported | [TRT-LLM Guide](TRT-LLM/README.md) |
+| **Python Backend** | ⚠️ Custom Implementation | Manual integration required |
+
+## Choosing a Backend
+
+### vLLM Backend
+
+**Best for:**
+- Quick deployment and prototyping
+- Dynamic batching with speculative decoding
+- Flexible model support (any HuggingFace model)
+- Easier configuration via JSON
+
+**Considerations:**
+- Slightly higher memory usage
+- Less optimized for specific hardware compared to TRT-LLM
+
+[→ See vLLM Speculative Decoding Guide](vLLM/README.md)
+
+### TensorRT-LLM Backend
+
+**Best for:**
+- Maximum performance on NVIDIA GPUs
+- Production deployments requiring lowest latency
+- INT8/FP8 quantization support
+- Tightly optimized kernels
+
+**Considerations:**
+- Requires engine compilation
+- More complex setup process
+- Specific model architecture support
+
+[→ See TensorRT-LLM Speculative Decoding Guide](TRT-LLM/README.md)
+
+## Quick Start
+
+Choose your backend and follow the corresponding guide:
+
+### vLLM (Recommended for Getting Started)
+
+```bash
+# 1. Create model configuration with speculative decoding
+cat > model_repository/my_model/1/model.json << EOF
+{
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+  "speculative_model": "meta-llama/Llama-3.2-1B-Instruct",
+  "num_speculative_tokens": 5,
+  "gpu_memory_utilization": 0.9
+}
+EOF
+
+# 2. Launch Triton with vLLM backend
+docker run --gpus all --rm \
+  -v $(pwd)/model_repository:/models \
+  -e HF_TOKEN=$HF_TOKEN \
+  nvcr.io/nvidia/tritonserver:26.04-vllm-python-py3 \
+  tritonserver --model-repository=/models
+
+# 3. Send inference request
+curl -X POST http://localhost:8000/v2/models/my_model/generate \
+  -d '{"text_input": "Explain speculative decoding"}'
+```
+
+[→ Full vLLM Setup Guide](vLLM/README.md)
+
+### TensorRT-LLM
+
+TensorRT-LLM requires building engines for both target and draft models. See the [TensorRT-LLM guide](TRT-LLM/README.md) for detailed instructions.
+
+## Model Selection Guidelines
+
+### Choosing a Draft Model
+
+For optimal performance, your draft model should be:
+
+1. **From the same model family** as the target model
+   - Example: Use Llama 3.2 1B as draft for Llama 3.1 8B/70B
+   - Example: Use Mistral 7B as draft for Mixtral 8x7B
+
+2. **10-50x smaller** than the target model
+   - Too small: Low acceptance rate
+   - Too large: Insufficient speedup
+
+3. **Using the same tokenizer**
+   - Ensures token-level compatibility
+
+### Recommended Model Pairs
+
+| Target Model | Draft Model | Size Ratio | Expected Speedup |
+|--------------|-------------|------------|------------------|
+| Llama-3.1-70B | Llama-3.2-1B | 70x | 2.0-2.8x |
+| Llama-3.1-8B | Llama-3.2-1B | 8x | 1.5-2.2x |
+| Mixtral-8x7B | Mistral-7B-v0.1 | ~8x | 1.6-2.4x |
+| CodeLlama-34B | CodeLlama-7B | 4.8x | 1.4-2.0x |
+| Qwen2.5-72B | Qwen2.5-7B | 10x | 1.8-2.5x |
+
+## Performance Tuning
+
+### Key Parameters
+
+1. **num_speculative_tokens** (vLLM) / **num_draft_tokens** (TRT-LLM)
+   - Controls how many tokens the draft model generates per iteration
+   - Higher values: Better speedup potential, more memory usage
+   - Typical range: 3-10
+   - Start with 5 and adjust based on results
+
+2. **gpu_memory_utilization**
+   - Both models must fit in GPU memory simultaneously
+   - Reduce if encountering OOM errors
+   - Typical range: 0.8-0.9
+
+3. **Acceptance Rate**
+   - Higher is better (target: >70%)
+   - Low acceptance (<50%) indicates draft model is too different
+   - Monitor via metrics endpoint
+
+### Memory Optimization
+
+```
+Total GPU Memory = Target Model + Draft Model + KV Cache + Activations
+```
+
+If memory is constrained:
+- Use a smaller draft model
+- Reduce `max_model_len` or `max_batch_size`
+- Enable tensor parallelism for target model only
+- Use quantization (INT8/FP8) where supported
+
+## Monitoring and Metrics
+
+Triton exposes metrics for speculative decoding performance:
+
+```bash
+# Access metrics endpoint
+curl http://localhost:8002/metrics
+```
+
+### Key Metrics to Monitor
+
+- **`speculative_acceptance_rate`**: Percentage of draft tokens accepted (target: >70%)
+- **`inter_token_latency`**: Time between tokens (should decrease significantly)
+- **`time_to_first_token`**: Should remain similar to non-speculative mode
+- **`throughput_tokens_per_second`**: Overall tokens/sec (should increase)
+
+## When NOT to Use Speculative Decoding
+
+Speculative decoding may not be beneficial in these scenarios:
+
+1. **Very short outputs**: Overhead may exceed benefits for <10 token generations
+2. **Extremely high batch sizes**: May be memory-constrained with both models
+3. **No suitable draft model**: If no compatible smaller model exists
+4. **Maximum batch utilization**: When already fully utilizing GPU with standard decoding
+
+## Troubleshooting
+
+### Low or No Speedup
+
+**Possible causes:**
+- Draft model too different from target model
+- `num_speculative_tokens` too low (try increasing to 7-10)
+- Acceptance rate too low (<50%)
+- Batch size too small to hide overhead
+
+**Solutions:**
+- Use a draft model from the same family
+- Increase speculative token count
+- Monitor acceptance rate metrics
+- Adjust batch size
+
+### Out of Memory Errors
+
+**Solutions:**
+- Reduce `gpu_memory_utilization` to 0.8 or lower
+- Use a smaller draft model
+- Decrease `max_model_len`
+- Enable tensor parallelism for target model
+- Reduce `num_speculative_tokens`
+
+### Different Outputs Than Expected
+
+**Note:** Speculative decoding should produce **identical outputs** to standard decoding. If outputs differ:
+- Check that both models loaded successfully
+- Verify model versions match expected
+- Review logs for errors or warnings
+- This may indicate a bug - please report it
+
+## Additional Resources
+
+### Research Papers
+
+- [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192) (Leviathan et al., 2023)
+- [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318) (Chen et al., 2023)
+- [SpecInfer: Accelerating LLM Serving with Tree-based Speculative Inference](https://arxiv.org/abs/2305.09781) (Miao et al., 2023)
+
+### Documentation
+
+- [vLLM Backend Repository](https://github.com/triton-inference-server/vllm_backend)
+- [TensorRT-LLM Backend Repository](https://github.com/triton-inference-server/tensorrtllm_backend)
+- [Triton Server Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/)
+
+### Community
+
+- [NVIDIA Developer Forums](https://forums.developer.nvidia.com/)
+- [Triton GitHub Discussions](https://github.com/triton-inference-server/server/discussions)
+- [Report Issues](https://github.com/triton-inference-server/server/issues)
+
+## Next Steps
+
+1. **Choose your backend**: [vLLM](vLLM/README.md) or [TensorRT-LLM](TRT-LLM/README.md)
+2. **Follow the setup guide** for your chosen backend
+3. **Experiment with different draft models** and parameters
+4. **Monitor metrics** to optimize performance
+5. **Deploy to production** once you've validated the configuration
+
+---
+
+**Questions or feedback?** Please open an issue on the [Triton GitHub repository](https://github.com/triton-inference-server/server/issues).
diff --git a/docs/tutorials/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md b/docs/tutorials/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md
new file mode 100644
index 0000000000..51b2435549
--- /dev/null
+++ b/docs/tutorials/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md
@@ -0,0 +1,62 @@
+<!--
+Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Speculative Decoding with TensorRT-LLM Backend
+
+## Overview
+
+TensorRT-LLM backend provides highly optimized speculative decoding support for NVIDIA GPUs. This guide covers configuration and deployment of speculative decoding using the TensorRT-LLM backend.
+
+## Documentation
+
+For comprehensive documentation on speculative decoding with TensorRT-LLM, please refer to the official TensorRT-LLM backend documentation:
+
+[TensorRT-LLM Backend Decoding Documentation](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#decoding)
+
+## Key Features
+
+- Maximum performance on NVIDIA GPUs with optimized kernels
+- Support for INT8/FP8 quantization
+- Advanced scheduling and batching
+- Medusa and standard speculative decoding modes
+
+## Quick Reference
+
+For speculative decoding setup with TensorRT-LLM:
+
+1. Build TensorRT-LLM engines for both target and draft models
+2. Configure the model repository with appropriate parameters
+3. Deploy using Triton with TensorRT-LLM backend
+
+See the [TensorRT-LLM backend documentation](https://github.com/triton-inference-server/tensorrtllm_backend) for detailed instructions.
+
+## See Also
+
+- [Speculative Decoding Overview](../README.md)
+- [vLLM Speculative Decoding](../vLLM/README.md) (alternative backend)
+- [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend)
diff --git a/docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md b/docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md
new file mode 100644
index 0000000000..046cc4fd20
--- /dev/null
+++ b/docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md
@@ -0,0 +1,349 @@
+<!--
+Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Speculative Decoding with vLLM Backend
+
+## Overview
+
+Speculative decoding is an inference optimization technique that accelerates text generation by using a smaller, faster "draft" model to propose tokens, which are then verified by the larger "target" model. This approach can significantly reduce latency while maintaining the same output quality as standard decoding.
+
+The vLLM backend in Triton Inference Server supports speculative decoding, allowing you to leverage this optimization technique for your LLM deployments.
+
+## How It Works
+
+1. A small draft model quickly generates multiple candidate tokens
+2. The larger target model verifies these candidates in parallel
+3. Accepted tokens are returned; rejected tokens are corrected by the target model
+4. This process continues until the full response is generated
+
+The speedup comes from the fact that the draft model is much faster than the target model, and verification can be done in parallel for multiple tokens.
+
+## Prerequisites
+
+- Triton Inference Server with vLLM backend support
+- A target model (the main model you want to serve)
+- A draft model (a smaller, compatible model from the same family)
+- Docker with NVIDIA Container Runtime
+- Access to HuggingFace models (HF_TOKEN if using gated models)
+
+## Configuration
+
+Speculative decoding with vLLM is configured through the `model.json` file in your model repository. The vLLM backend passes the configuration parameters directly to vLLM's `AsyncEngineArgs`.
+
+### Key Parameters
+
+The following parameters in `model.json` control speculative decoding:
+
+- **`speculative_model`**: The name or path of the draft model to use
+- **`num_speculative_tokens`**: Number of tokens to generate speculatively (default: depends on model)
+- **`speculative_draft_tensor_parallel_size`**: Tensor parallelism size for draft model (optional)
+- **`ngram_prompt_lookup_max`**: For n-gram based speculation (alternative to draft model)
+- **`ngram_prompt_lookup_min`**: Minimum n-gram size for prompt lookup
+
+### Example 1: Basic Speculative Decoding Configuration
+
+Here's a simple example using Llama models:
+
+```json
+{
+  "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
+  "speculative_model": "meta-llama/Meta-Llama-3.2-1B-Instruct",
+  "num_speculative_tokens": 5,
+  "gpu_memory_utilization": 0.9
+}
+```
+
+### Example 2: Speculative Decoding with Tensor Parallelism
+
+For larger deployments with multi-GPU setups:
+
+```json
+{
+  "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
+  "tensor_parallel_size": 4,
+  "speculative_model": "meta-llama/Meta-Llama-3.2-1B-Instruct",
+  "num_speculative_tokens": 5,
+  "speculative_draft_tensor_parallel_size": 1,
+  "gpu_memory_utilization": 0.85
+}
+```
+
+### Example 3: N-gram Prompt Lookup (Alternative Approach)
+
+Instead of using a separate draft model, you can use n-gram based speculation:
+
+```json
+{
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+  "ngram_prompt_lookup_max": 4,
+  "ngram_prompt_lookup_min": 1,
+  "gpu_memory_utilization": 0.9
+}
+```
+
+## Model Repository Structure
+
+Your model repository should follow this structure:
+
+```
+model_repository/
+└── llama-3.1-70b-speculative/
+    ├── config.pbtxt
+    └── 1/
+        └── model.json
+```
+
+### config.pbtxt
+
+```
+# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# (License text omitted for brevity)
+
+name: "llama-3.1-70b-speculative"
+backend: "vllm"
+max_batch_size: 0
+model_transaction_policy {
+  decoupled: True
+}
+
+input [
+  {
+    name: "text_input"
+    data_type: TYPE_STRING
+    dims: [ -1 ]
+  },
+  {
+    name: "stream"
+    data_type: TYPE_BOOL
+    dims: [ 1 ]
+    optional: true
+  }
+]
+
+output [
+  {
+    name: "text_output"
+    data_type: TYPE_STRING
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind: KIND_MODEL
+  }
+]
+```
+
+## Running the Example
+
+### Step 1: Launch Container
+
+```bash
+docker run -it --net=host --gpus all --rm \
+  -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
+  -v ${PWD}/model_repository:/model_repository \
+  -e HF_TOKEN \
+  nvcr.io/nvidia/tritonserver:26.04-vllm-python-py3
+```
+
+### Step 2: Start Triton Server
+
+For native vLLM backend:
+
+```bash
+tritonserver --model-repository=/model_repository
+```
+
+Or using the OpenAI-compatible frontend:
+
+```bash
+cd /opt/tritonserver/python/openai
+python3 openai_frontend/main.py \
+  --model-repository /model_repository \
+  --tokenizer meta-llama/Meta-Llama-3.1-70B-Instruct
+```
+
+### Step 3: Send Inference Requests
+
+Using the OpenAI API:
+
+```bash
+MODEL="llama-3.1-70b-speculative"
+curl -s http://localhost:9000/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "'${MODEL}'",
+    "messages": [
+      {"role": "system", "content": "You are a helpful assistant."},
+      {"role": "user", "content": "Explain quantum computing in simple terms."}
+    ],
+    "max_tokens": 256
+  }' | jq
+```
+
+Or using the native Triton API:
+
+```bash
+curl -X POST http://localhost:8000/v2/models/llama-3.1-70b-speculative/generate \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "text_input": "Explain quantum computing in simple terms.",
+    "parameters": {
+      "max_tokens": 256,
+      "temperature": 0.7
+    }
+  }'
+```
+
+## Choosing the Right Draft Model
+
+For best results:
+
+1. **Same Model Family**: Use a draft model from the same family as your target model (e.g., Llama 3.2 1B for Llama 3.1 70B)
+2. **Size Ratio**: Aim for a draft model that is 10-50x smaller than the target model
+3. **Architecture Compatibility**: Ensure the draft model has compatible architecture (same tokenizer, similar attention mechanisms)
+
+### Popular Model Combinations
+
+| Target Model | Draft Model | Expected Speedup |
+|--------------|-------------|------------------|
+| Llama-3.1-70B | Llama-3.2-1B | 1.5-2.5x |
+| Llama-3.1-8B | Llama-3.2-1B | 1.3-2.0x |
+| Mixtral-8x7B | Mistral-7B-v0.1 | 1.4-2.2x |
+
+> **Note**: Actual speedup depends on hardware, batch size, sequence length, and the similarity between draft and target model outputs.
+
+## Performance Tuning
+
+### Adjusting num_speculative_tokens
+
+- **Higher values (5-10)**: Better speedup potential but higher memory usage
+- **Lower values (2-4)**: More conservative, lower memory overhead
+- **Start with 5** and adjust based on your specific use case
+
+### Memory Considerations
+
+Speculative decoding requires loading both models into GPU memory. Adjust `gpu_memory_utilization` accordingly:
+
+```json
+{
+  "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
+  "speculative_model": "meta-llama/Meta-Llama-3.2-1B-Instruct",
+  "num_speculative_tokens": 5,
+  "gpu_memory_utilization": 0.85
+}
+```
+
+If you encounter OOM errors, try:
+1. Reducing `gpu_memory_utilization` to 0.8 or lower
+2. Decreasing `num_speculative_tokens`
+3. Using a smaller draft model
+4. Enabling tensor parallelism for the target model
+
+## Monitoring and Debugging
+
+### Check Model Loading
+
+When Triton starts, you should see log messages indicating both models are loaded:
+
+```
+I0511 00:00:00.000000 1 llm_engine.py:123] Initializing an LLM engine with config: ...
+I0511 00:00:00.000000 1 llm_engine.py:456] Using speculative decoding with draft model: meta-llama/Meta-Llama-3.2-1B-Instruct
+```
+
+### Metrics
+
+Monitor these metrics to evaluate speculative decoding performance:
+
+- **Acceptance Rate**: Percentage of draft tokens accepted (higher is better)
+- **Time to First Token (TTFT)**: Should be similar to non-speculative
+- **Inter-Token Latency**: Should be significantly lower
+- **Throughput**: Overall tokens/second should increase
+
+Access metrics at `http://localhost:8002/metrics` (or `:9000/metrics` for OpenAI frontend).
+
+## Troubleshooting
+
+### Common Issues
+
+**Issue**: Model fails to load with OOM error
+```
+Solution: Reduce gpu_memory_utilization or use a smaller draft model
+```
+
+**Issue**: No speedup observed
+```
+Solution:
+- Ensure draft model is from the same family as target model
+- Check that num_speculative_tokens > 0
+- Verify both models loaded successfully in logs
+- Try increasing num_speculative_tokens
+```
+
+**Issue**: Different outputs compared to non-speculative mode
+```
+Solution: This should not happen - speculative decoding guarantees identical outputs.
+Check vLLM backend logs for errors. This may indicate a configuration issue.
+```
+
+**Issue**: Draft model not found or fails to load
+```
+Solution:
+- Verify the speculative_model path/name is correct
+- Ensure HF_TOKEN is set if using gated models
+- Check that the draft model is cached or can be downloaded
+```
+
+## Limitations
+
+1. **Memory Overhead**: Requires loading both target and draft models
+2. **Model Compatibility**: Draft model must be compatible with target model
+3. **Batch Size**: Effectiveness may vary with different batch sizes
+4. **Sequence Length**: Longer sequences may see different speedup characteristics
+
+## Additional Resources
+
+- [Speculative Decoding Overview](../README.md) - High-level guide and backend comparison
+- [vLLM Backend Documentation](https://github.com/triton-inference-server/vllm_backend)
+- [vLLM Speculative Decoding](https://docs.vllm.ai/)
+- [Speculative Decoding Paper](https://arxiv.org/abs/2211.17192)
+- [TRT-LLM Speculative Decoding](../TRT-LLM/README.md) (alternative backend)
+
+## References
+
+- Chen, C., et al. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling"
+- Leviathan, Y., et al. (2023). "Fast Inference from Transformers via Speculative Decoding"
+
+## Feedback and Support
+
+For issues or questions:
+- [Triton GitHub Issues](https://github.com/triton-inference-server/server/issues)
+- [vLLM Backend Issues](https://github.com/triton-inference-server/vllm_backend/issues)
+- [NVIDIA Developer Forums](https://forums.developer.nvidia.com/)
diff --git a/qa/L0_vllm_speculative_decoding/README.md b/qa/L0_vllm_speculative_decoding/README.md
new file mode 100644
index 0000000000..2c2a6b80db
--- /dev/null
+++ b/qa/L0_vllm_speculative_decoding/README.md
@@ -0,0 +1,88 @@
+<!--
+Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# vLLM Speculative Decoding Test
+
+This directory contains test configurations for vLLM speculative decoding feature.
+
+## Model Configuration
+
+The example model `llama-speculative` demonstrates how to configure speculative decoding:
+
+```json
+{
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+  "speculative_model": "meta-llama/Llama-3.2-1B-Instruct",
+  "num_speculative_tokens": 5,
+  "gpu_memory_utilization": 0.9,
+  "max_model_len": 2048
+}
+```
+
+## Running the Test
+
+```bash
+# Set your HuggingFace token if using gated models
+export HF_TOKEN="your_token_here"
+
+# Launch container
+docker run -it --net=host --gpus all --rm \
+  -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
+  -v $(pwd)/model_repository:/model_repository \
+  -e HF_TOKEN \
+  nvcr.io/nvidia/tritonserver:26.04-vllm-python-py3
+
+# Inside container, start Triton
+tritonserver --model-repository=/model_repository
+```
+
+## Sending Test Requests
+
+```bash
+# Using Triton's native API
+curl -X POST http://localhost:8000/v2/models/llama-speculative/generate \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "text_input": "What is speculative decoding?",
+    "parameters": {
+      "max_tokens": 100,
+      "temperature": 0.7
+    }
+  }'
+```
+
+## Expected Behavior
+
+- Both target and draft models should load successfully
+- Inference should complete with reduced latency compared to non-speculative mode
+- Output quality should be identical to standard decoding
+- Server logs should show speculative decoding is enabled
+
+## See Also
+
+- [vLLM Speculative Decoding Tutorial](../../../docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md)
diff --git a/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/1/model.json b/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/1/model.json
new file mode 100644
index 0000000000..a3084504f0
--- /dev/null
+++ b/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/1/model.json
@@ -0,0 +1,7 @@
+{
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+  "speculative_model": "meta-llama/Llama-3.2-1B-Instruct",
+  "num_speculative_tokens": 5,
+  "gpu_memory_utilization": 0.9,
+  "max_model_len": 2048
+}
diff --git a/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/config.pbtxt b/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/config.pbtxt
new file mode 100644
index 0000000000..708c1337b4
--- /dev/null
+++ b/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/config.pbtxt
@@ -0,0 +1,74 @@
+# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "llama-speculative"
+backend: "vllm"
+max_batch_size: 0
+
+model_transaction_policy {
+  decoupled: True
+}
+
+input [
+  {
+    name: "text_input"
+    data_type: TYPE_STRING
+    dims: [ -1 ]
+  },
+  {
+    name: "stream"
+    data_type: TYPE_BOOL
+    dims: [ 1 ]
+    optional: true
+  },
+  {
+    name: "sampling_parameters"
+    data_type: TYPE_STRING
+    dims: [ -1 ]
+    optional: true
+  },
+  {
+    name: "exclude_input_in_output"
+    data_type: TYPE_BOOL
+    dims: [ 1 ]
+    optional: true
+  }
+]
+
+output [
+  {
+    name: "text_output"
+    data_type: TYPE_STRING
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind: KIND_MODEL
+  }
+]