diff --git a/docs/tutorials/Feature_Guide/Speculative_Decoding/README.md b/docs/tutorials/Feature_Guide/Speculative_Decoding/README.md new file mode 100644 index 0000000000..8afc4b7d05 --- /dev/null +++ b/docs/tutorials/Feature_Guide/Speculative_Decoding/README.md @@ -0,0 +1,312 @@ + + +# Speculative Decoding + +## Overview + +Speculative decoding (also known as speculative sampling or assisted generation) is an inference optimization technique that accelerates Large Language Model (LLM) text generation without compromising output quality. It leverages a smaller, faster "draft" model to propose candidate tokens, which are then verified in parallel by the larger "target" model. + +## Why Use Speculative Decoding? + +### Key Benefits + +1. **Reduced Latency**: Generate tokens faster by using a lightweight draft model for initial proposals +2. **Identical Outputs**: Maintains mathematically identical outputs to standard autoregressive decoding +3. **Lower Cost**: Reduces inference costs by decreasing GPU time for generation +4. **Better GPU Utilization**: Improves batch processing efficiency + +### Typical Performance Improvements + +- **1.5-3x speedup** in generation latency (varies by model and use case) +- **No quality degradation** - outputs are identical to standard decoding +- **Most effective** for generation-heavy workloads (longer output sequences) + +## How It Works + +The speculative decoding process follows these steps: + +``` +┌─────────────────────────────────────────────────────────────┐ +│ 1. Draft Model generates K candidate tokens │ +│ (Fast, small model - e.g., 1B parameters) │ +└────────────────────┬────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ 2. Target Model verifies all K tokens in PARALLEL │ +│ (Slower, large model - e.g., 70B parameters) │ +└────────────────────┬────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ 3. Accept matching tokens, correct first mismatch │ +│ Continue from corrected position │ +└─────────────────────────────────────────────────────────────┘ +``` + +### Example + +Suppose we want to generate the sentence: "The cat sat on the mat." + +1. **Draft Model proposes**: "The cat sat on the" +2. **Target Model verifies**: + - ✓ "The" - Accept + - ✓ "cat" - Accept + - ✓ "sat" - Accept + - ✓ "on" - Accept + - ✓ "the" - Accept +3. **Result**: All 5 tokens accepted, continue from "the" +4. **Next round**: Draft proposes "mat .", Target verifies and accepts + +Instead of 7 sequential forward passes through the large model, we only needed 2, reducing latency by ~3.5x. + +## Supported Backends + +Triton Inference Server supports speculative decoding with multiple backends: + +| Backend | Support Status | Documentation | +|---------|---------------|---------------| +| **vLLM** | ✅ Supported | [vLLM Guide](vLLM/README.md) | +| **TensorRT-LLM** | ✅ Supported | [TRT-LLM Guide](TRT-LLM/README.md) | +| **Python Backend** | ⚠️ Custom Implementation | Manual integration required | + +## Choosing a Backend + +### vLLM Backend + +**Best for:** +- Quick deployment and prototyping +- Dynamic batching with speculative decoding +- Flexible model support (any HuggingFace model) +- Easier configuration via JSON + +**Considerations:** +- Slightly higher memory usage +- Less optimized for specific hardware compared to TRT-LLM + +[→ See vLLM Speculative Decoding Guide](vLLM/README.md) + +### TensorRT-LLM Backend + +**Best for:** +- Maximum performance on NVIDIA GPUs +- Production deployments requiring lowest latency +- INT8/FP8 quantization support +- Tightly optimized kernels + +**Considerations:** +- Requires engine compilation +- More complex setup process +- Specific model architecture support + +[→ See TensorRT-LLM Speculative Decoding Guide](TRT-LLM/README.md) + +## Quick Start + +Choose your backend and follow the corresponding guide: + +### vLLM (Recommended for Getting Started) + +```bash +# 1. Create model configuration with speculative decoding +cat > model_repository/my_model/1/model.json << EOF +{ + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "speculative_model": "meta-llama/Llama-3.2-1B-Instruct", + "num_speculative_tokens": 5, + "gpu_memory_utilization": 0.9 +} +EOF + +# 2. Launch Triton with vLLM backend +docker run --gpus all --rm \ + -v $(pwd)/model_repository:/models \ + -e HF_TOKEN=$HF_TOKEN \ + nvcr.io/nvidia/tritonserver:26.04-vllm-python-py3 \ + tritonserver --model-repository=/models + +# 3. Send inference request +curl -X POST http://localhost:8000/v2/models/my_model/generate \ + -d '{"text_input": "Explain speculative decoding"}' +``` + +[→ Full vLLM Setup Guide](vLLM/README.md) + +### TensorRT-LLM + +TensorRT-LLM requires building engines for both target and draft models. See the [TensorRT-LLM guide](TRT-LLM/README.md) for detailed instructions. + +## Model Selection Guidelines + +### Choosing a Draft Model + +For optimal performance, your draft model should be: + +1. **From the same model family** as the target model + - Example: Use Llama 3.2 1B as draft for Llama 3.1 8B/70B + - Example: Use Mistral 7B as draft for Mixtral 8x7B + +2. **10-50x smaller** than the target model + - Too small: Low acceptance rate + - Too large: Insufficient speedup + +3. **Using the same tokenizer** + - Ensures token-level compatibility + +### Recommended Model Pairs + +| Target Model | Draft Model | Size Ratio | Expected Speedup | +|--------------|-------------|------------|------------------| +| Llama-3.1-70B | Llama-3.2-1B | 70x | 2.0-2.8x | +| Llama-3.1-8B | Llama-3.2-1B | 8x | 1.5-2.2x | +| Mixtral-8x7B | Mistral-7B-v0.1 | ~8x | 1.6-2.4x | +| CodeLlama-34B | CodeLlama-7B | 4.8x | 1.4-2.0x | +| Qwen2.5-72B | Qwen2.5-7B | 10x | 1.8-2.5x | + +## Performance Tuning + +### Key Parameters + +1. **num_speculative_tokens** (vLLM) / **num_draft_tokens** (TRT-LLM) + - Controls how many tokens the draft model generates per iteration + - Higher values: Better speedup potential, more memory usage + - Typical range: 3-10 + - Start with 5 and adjust based on results + +2. **gpu_memory_utilization** + - Both models must fit in GPU memory simultaneously + - Reduce if encountering OOM errors + - Typical range: 0.8-0.9 + +3. **Acceptance Rate** + - Higher is better (target: >70%) + - Low acceptance (<50%) indicates draft model is too different + - Monitor via metrics endpoint + +### Memory Optimization + +``` +Total GPU Memory = Target Model + Draft Model + KV Cache + Activations +``` + +If memory is constrained: +- Use a smaller draft model +- Reduce `max_model_len` or `max_batch_size` +- Enable tensor parallelism for target model only +- Use quantization (INT8/FP8) where supported + +## Monitoring and Metrics + +Triton exposes metrics for speculative decoding performance: + +```bash +# Access metrics endpoint +curl http://localhost:8002/metrics +``` + +### Key Metrics to Monitor + +- **`speculative_acceptance_rate`**: Percentage of draft tokens accepted (target: >70%) +- **`inter_token_latency`**: Time between tokens (should decrease significantly) +- **`time_to_first_token`**: Should remain similar to non-speculative mode +- **`throughput_tokens_per_second`**: Overall tokens/sec (should increase) + +## When NOT to Use Speculative Decoding + +Speculative decoding may not be beneficial in these scenarios: + +1. **Very short outputs**: Overhead may exceed benefits for <10 token generations +2. **Extremely high batch sizes**: May be memory-constrained with both models +3. **No suitable draft model**: If no compatible smaller model exists +4. **Maximum batch utilization**: When already fully utilizing GPU with standard decoding + +## Troubleshooting + +### Low or No Speedup + +**Possible causes:** +- Draft model too different from target model +- `num_speculative_tokens` too low (try increasing to 7-10) +- Acceptance rate too low (<50%) +- Batch size too small to hide overhead + +**Solutions:** +- Use a draft model from the same family +- Increase speculative token count +- Monitor acceptance rate metrics +- Adjust batch size + +### Out of Memory Errors + +**Solutions:** +- Reduce `gpu_memory_utilization` to 0.8 or lower +- Use a smaller draft model +- Decrease `max_model_len` +- Enable tensor parallelism for target model +- Reduce `num_speculative_tokens` + +### Different Outputs Than Expected + +**Note:** Speculative decoding should produce **identical outputs** to standard decoding. If outputs differ: +- Check that both models loaded successfully +- Verify model versions match expected +- Review logs for errors or warnings +- This may indicate a bug - please report it + +## Additional Resources + +### Research Papers + +- [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192) (Leviathan et al., 2023) +- [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318) (Chen et al., 2023) +- [SpecInfer: Accelerating LLM Serving with Tree-based Speculative Inference](https://arxiv.org/abs/2305.09781) (Miao et al., 2023) + +### Documentation + +- [vLLM Backend Repository](https://github.com/triton-inference-server/vllm_backend) +- [TensorRT-LLM Backend Repository](https://github.com/triton-inference-server/tensorrtllm_backend) +- [Triton Server Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/) + +### Community + +- [NVIDIA Developer Forums](https://forums.developer.nvidia.com/) +- [Triton GitHub Discussions](https://github.com/triton-inference-server/server/discussions) +- [Report Issues](https://github.com/triton-inference-server/server/issues) + +## Next Steps + +1. **Choose your backend**: [vLLM](vLLM/README.md) or [TensorRT-LLM](TRT-LLM/README.md) +2. **Follow the setup guide** for your chosen backend +3. **Experiment with different draft models** and parameters +4. **Monitor metrics** to optimize performance +5. **Deploy to production** once you've validated the configuration + +--- + +**Questions or feedback?** Please open an issue on the [Triton GitHub repository](https://github.com/triton-inference-server/server/issues). diff --git a/docs/tutorials/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md b/docs/tutorials/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md new file mode 100644 index 0000000000..51b2435549 --- /dev/null +++ b/docs/tutorials/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md @@ -0,0 +1,62 @@ + + +# Speculative Decoding with TensorRT-LLM Backend + +## Overview + +TensorRT-LLM backend provides highly optimized speculative decoding support for NVIDIA GPUs. This guide covers configuration and deployment of speculative decoding using the TensorRT-LLM backend. + +## Documentation + +For comprehensive documentation on speculative decoding with TensorRT-LLM, please refer to the official TensorRT-LLM backend documentation: + +[TensorRT-LLM Backend Decoding Documentation](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#decoding) + +## Key Features + +- Maximum performance on NVIDIA GPUs with optimized kernels +- Support for INT8/FP8 quantization +- Advanced scheduling and batching +- Medusa and standard speculative decoding modes + +## Quick Reference + +For speculative decoding setup with TensorRT-LLM: + +1. Build TensorRT-LLM engines for both target and draft models +2. Configure the model repository with appropriate parameters +3. Deploy using Triton with TensorRT-LLM backend + +See the [TensorRT-LLM backend documentation](https://github.com/triton-inference-server/tensorrtllm_backend) for detailed instructions. + +## See Also + +- [Speculative Decoding Overview](../README.md) +- [vLLM Speculative Decoding](../vLLM/README.md) (alternative backend) +- [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend) diff --git a/docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md b/docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md new file mode 100644 index 0000000000..046cc4fd20 --- /dev/null +++ b/docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md @@ -0,0 +1,349 @@ + + +# Speculative Decoding with vLLM Backend + +## Overview + +Speculative decoding is an inference optimization technique that accelerates text generation by using a smaller, faster "draft" model to propose tokens, which are then verified by the larger "target" model. This approach can significantly reduce latency while maintaining the same output quality as standard decoding. + +The vLLM backend in Triton Inference Server supports speculative decoding, allowing you to leverage this optimization technique for your LLM deployments. + +## How It Works + +1. A small draft model quickly generates multiple candidate tokens +2. The larger target model verifies these candidates in parallel +3. Accepted tokens are returned; rejected tokens are corrected by the target model +4. This process continues until the full response is generated + +The speedup comes from the fact that the draft model is much faster than the target model, and verification can be done in parallel for multiple tokens. + +## Prerequisites + +- Triton Inference Server with vLLM backend support +- A target model (the main model you want to serve) +- A draft model (a smaller, compatible model from the same family) +- Docker with NVIDIA Container Runtime +- Access to HuggingFace models (HF_TOKEN if using gated models) + +## Configuration + +Speculative decoding with vLLM is configured through the `model.json` file in your model repository. The vLLM backend passes the configuration parameters directly to vLLM's `AsyncEngineArgs`. + +### Key Parameters + +The following parameters in `model.json` control speculative decoding: + +- **`speculative_model`**: The name or path of the draft model to use +- **`num_speculative_tokens`**: Number of tokens to generate speculatively (default: depends on model) +- **`speculative_draft_tensor_parallel_size`**: Tensor parallelism size for draft model (optional) +- **`ngram_prompt_lookup_max`**: For n-gram based speculation (alternative to draft model) +- **`ngram_prompt_lookup_min`**: Minimum n-gram size for prompt lookup + +### Example 1: Basic Speculative Decoding Configuration + +Here's a simple example using Llama models: + +```json +{ + "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", + "speculative_model": "meta-llama/Meta-Llama-3.2-1B-Instruct", + "num_speculative_tokens": 5, + "gpu_memory_utilization": 0.9 +} +``` + +### Example 2: Speculative Decoding with Tensor Parallelism + +For larger deployments with multi-GPU setups: + +```json +{ + "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", + "tensor_parallel_size": 4, + "speculative_model": "meta-llama/Meta-Llama-3.2-1B-Instruct", + "num_speculative_tokens": 5, + "speculative_draft_tensor_parallel_size": 1, + "gpu_memory_utilization": 0.85 +} +``` + +### Example 3: N-gram Prompt Lookup (Alternative Approach) + +Instead of using a separate draft model, you can use n-gram based speculation: + +```json +{ + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "ngram_prompt_lookup_max": 4, + "ngram_prompt_lookup_min": 1, + "gpu_memory_utilization": 0.9 +} +``` + +## Model Repository Structure + +Your model repository should follow this structure: + +``` +model_repository/ +└── llama-3.1-70b-speculative/ + ├── config.pbtxt + └── 1/ + └── model.json +``` + +### config.pbtxt + +``` +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# (License text omitted for brevity) + +name: "llama-3.1-70b-speculative" +backend: "vllm" +max_batch_size: 0 +model_transaction_policy { + decoupled: True +} + +input [ + { + name: "text_input" + data_type: TYPE_STRING + dims: [ -1 ] + }, + { + name: "stream" + data_type: TYPE_BOOL + dims: [ 1 ] + optional: true + } +] + +output [ + { + name: "text_output" + data_type: TYPE_STRING + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_MODEL + } +] +``` + +## Running the Example + +### Step 1: Launch Container + +```bash +docker run -it --net=host --gpus all --rm \ + -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \ + -v ${PWD}/model_repository:/model_repository \ + -e HF_TOKEN \ + nvcr.io/nvidia/tritonserver:26.04-vllm-python-py3 +``` + +### Step 2: Start Triton Server + +For native vLLM backend: + +```bash +tritonserver --model-repository=/model_repository +``` + +Or using the OpenAI-compatible frontend: + +```bash +cd /opt/tritonserver/python/openai +python3 openai_frontend/main.py \ + --model-repository /model_repository \ + --tokenizer meta-llama/Meta-Llama-3.1-70B-Instruct +``` + +### Step 3: Send Inference Requests + +Using the OpenAI API: + +```bash +MODEL="llama-3.1-70b-speculative" +curl -s http://localhost:9000/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "'${MODEL}'", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Explain quantum computing in simple terms."} + ], + "max_tokens": 256 + }' | jq +``` + +Or using the native Triton API: + +```bash +curl -X POST http://localhost:8000/v2/models/llama-3.1-70b-speculative/generate \ + -H 'Content-Type: application/json' \ + -d '{ + "text_input": "Explain quantum computing in simple terms.", + "parameters": { + "max_tokens": 256, + "temperature": 0.7 + } + }' +``` + +## Choosing the Right Draft Model + +For best results: + +1. **Same Model Family**: Use a draft model from the same family as your target model (e.g., Llama 3.2 1B for Llama 3.1 70B) +2. **Size Ratio**: Aim for a draft model that is 10-50x smaller than the target model +3. **Architecture Compatibility**: Ensure the draft model has compatible architecture (same tokenizer, similar attention mechanisms) + +### Popular Model Combinations + +| Target Model | Draft Model | Expected Speedup | +|--------------|-------------|------------------| +| Llama-3.1-70B | Llama-3.2-1B | 1.5-2.5x | +| Llama-3.1-8B | Llama-3.2-1B | 1.3-2.0x | +| Mixtral-8x7B | Mistral-7B-v0.1 | 1.4-2.2x | + +> **Note**: Actual speedup depends on hardware, batch size, sequence length, and the similarity between draft and target model outputs. + +## Performance Tuning + +### Adjusting num_speculative_tokens + +- **Higher values (5-10)**: Better speedup potential but higher memory usage +- **Lower values (2-4)**: More conservative, lower memory overhead +- **Start with 5** and adjust based on your specific use case + +### Memory Considerations + +Speculative decoding requires loading both models into GPU memory. Adjust `gpu_memory_utilization` accordingly: + +```json +{ + "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", + "speculative_model": "meta-llama/Meta-Llama-3.2-1B-Instruct", + "num_speculative_tokens": 5, + "gpu_memory_utilization": 0.85 +} +``` + +If you encounter OOM errors, try: +1. Reducing `gpu_memory_utilization` to 0.8 or lower +2. Decreasing `num_speculative_tokens` +3. Using a smaller draft model +4. Enabling tensor parallelism for the target model + +## Monitoring and Debugging + +### Check Model Loading + +When Triton starts, you should see log messages indicating both models are loaded: + +``` +I0511 00:00:00.000000 1 llm_engine.py:123] Initializing an LLM engine with config: ... +I0511 00:00:00.000000 1 llm_engine.py:456] Using speculative decoding with draft model: meta-llama/Meta-Llama-3.2-1B-Instruct +``` + +### Metrics + +Monitor these metrics to evaluate speculative decoding performance: + +- **Acceptance Rate**: Percentage of draft tokens accepted (higher is better) +- **Time to First Token (TTFT)**: Should be similar to non-speculative +- **Inter-Token Latency**: Should be significantly lower +- **Throughput**: Overall tokens/second should increase + +Access metrics at `http://localhost:8002/metrics` (or `:9000/metrics` for OpenAI frontend). + +## Troubleshooting + +### Common Issues + +**Issue**: Model fails to load with OOM error +``` +Solution: Reduce gpu_memory_utilization or use a smaller draft model +``` + +**Issue**: No speedup observed +``` +Solution: +- Ensure draft model is from the same family as target model +- Check that num_speculative_tokens > 0 +- Verify both models loaded successfully in logs +- Try increasing num_speculative_tokens +``` + +**Issue**: Different outputs compared to non-speculative mode +``` +Solution: This should not happen - speculative decoding guarantees identical outputs. +Check vLLM backend logs for errors. This may indicate a configuration issue. +``` + +**Issue**: Draft model not found or fails to load +``` +Solution: +- Verify the speculative_model path/name is correct +- Ensure HF_TOKEN is set if using gated models +- Check that the draft model is cached or can be downloaded +``` + +## Limitations + +1. **Memory Overhead**: Requires loading both target and draft models +2. **Model Compatibility**: Draft model must be compatible with target model +3. **Batch Size**: Effectiveness may vary with different batch sizes +4. **Sequence Length**: Longer sequences may see different speedup characteristics + +## Additional Resources + +- [Speculative Decoding Overview](../README.md) - High-level guide and backend comparison +- [vLLM Backend Documentation](https://github.com/triton-inference-server/vllm_backend) +- [vLLM Speculative Decoding](https://docs.vllm.ai/) +- [Speculative Decoding Paper](https://arxiv.org/abs/2211.17192) +- [TRT-LLM Speculative Decoding](../TRT-LLM/README.md) (alternative backend) + +## References + +- Chen, C., et al. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling" +- Leviathan, Y., et al. (2023). "Fast Inference from Transformers via Speculative Decoding" + +## Feedback and Support + +For issues or questions: +- [Triton GitHub Issues](https://github.com/triton-inference-server/server/issues) +- [vLLM Backend Issues](https://github.com/triton-inference-server/vllm_backend/issues) +- [NVIDIA Developer Forums](https://forums.developer.nvidia.com/) diff --git a/qa/L0_vllm_speculative_decoding/README.md b/qa/L0_vllm_speculative_decoding/README.md new file mode 100644 index 0000000000..2c2a6b80db --- /dev/null +++ b/qa/L0_vllm_speculative_decoding/README.md @@ -0,0 +1,88 @@ + + +# vLLM Speculative Decoding Test + +This directory contains test configurations for vLLM speculative decoding feature. + +## Model Configuration + +The example model `llama-speculative` demonstrates how to configure speculative decoding: + +```json +{ + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "speculative_model": "meta-llama/Llama-3.2-1B-Instruct", + "num_speculative_tokens": 5, + "gpu_memory_utilization": 0.9, + "max_model_len": 2048 +} +``` + +## Running the Test + +```bash +# Set your HuggingFace token if using gated models +export HF_TOKEN="your_token_here" + +# Launch container +docker run -it --net=host --gpus all --rm \ + -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \ + -v $(pwd)/model_repository:/model_repository \ + -e HF_TOKEN \ + nvcr.io/nvidia/tritonserver:26.04-vllm-python-py3 + +# Inside container, start Triton +tritonserver --model-repository=/model_repository +``` + +## Sending Test Requests + +```bash +# Using Triton's native API +curl -X POST http://localhost:8000/v2/models/llama-speculative/generate \ + -H 'Content-Type: application/json' \ + -d '{ + "text_input": "What is speculative decoding?", + "parameters": { + "max_tokens": 100, + "temperature": 0.7 + } + }' +``` + +## Expected Behavior + +- Both target and draft models should load successfully +- Inference should complete with reduced latency compared to non-speculative mode +- Output quality should be identical to standard decoding +- Server logs should show speculative decoding is enabled + +## See Also + +- [vLLM Speculative Decoding Tutorial](../../../docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.md) diff --git a/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/1/model.json b/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/1/model.json new file mode 100644 index 0000000000..a3084504f0 --- /dev/null +++ b/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/1/model.json @@ -0,0 +1,7 @@ +{ + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "speculative_model": "meta-llama/Llama-3.2-1B-Instruct", + "num_speculative_tokens": 5, + "gpu_memory_utilization": 0.9, + "max_model_len": 2048 +} diff --git a/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/config.pbtxt b/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/config.pbtxt new file mode 100644 index 0000000000..708c1337b4 --- /dev/null +++ b/qa/L0_vllm_speculative_decoding/model_repository/llama-speculative/config.pbtxt @@ -0,0 +1,74 @@ +# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "llama-speculative" +backend: "vllm" +max_batch_size: 0 + +model_transaction_policy { + decoupled: True +} + +input [ + { + name: "text_input" + data_type: TYPE_STRING + dims: [ -1 ] + }, + { + name: "stream" + data_type: TYPE_BOOL + dims: [ 1 ] + optional: true + }, + { + name: "sampling_parameters" + data_type: TYPE_STRING + dims: [ -1 ] + optional: true + }, + { + name: "exclude_input_in_output" + data_type: TYPE_BOOL + dims: [ 1 ] + optional: true + } +] + +output [ + { + name: "text_output" + data_type: TYPE_STRING + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_MODEL + } +]