This document describes the configuration fields available for creating LLM inference server templates. Configurations are written in JSON format and support three inference engines: SGLang, vLLM, and Custom.
All configurations have a common top-level structure:
{
"engine": "sglang|vllm|custom",
"model": "model-identifier",
"trust_remote_code": true,
"quantization": "fp8",
"gpu_types": ["h100", "h200"],
"load_format": "auto",
"model_loader_extra_config": {
"enable_multithread_load": true,
"num_threads": 8
},
"seed": 42,
"tokenizer": "path/to/tokenizer",
"image_tag": "v0.3.5",
"sglang": { /* SGLang-specific options */ },
"vllm": { /* vLLM-specific options */ },
"custom": { /* Custom engine options */ }
}These fields are available at the top level of all configurations.
Specifies which inference engine to use.
- Valid values:
"sglang","vllm","custom" - Example:
"engine": "sglang"
The model identifier or path to load.
- Format: HuggingFace path (e.g.,
"meta-llama/Llama-3.1-70B-Instruct") or local path - Example:
"model": "deepseek-ai/DeepSeek-V3.2"
Human-readable display title for the template. Used as the selection label when rendering a list of templates for users to choose from.
- Use case: Short, scannable title shown in template pickers and UI
- Example:
"name": "DeepSeek R1 · SGLang · FP8"
Detailed explanation of the configuration's purpose, use case, and key features.
- Use case: Document the configuration for reference and understanding
- Example:
"explanation": "Llama 3.1 70B with basic configuration using 4-way tensor parallelism. This is the baseline setup with 32k context length."
Brief one-line description of the configuration.
- Use case: Quick reference for configuration summary
- Example:
"short_explanation": "Llama 3.1 70B basic configuration"
Whether to trust and execute remote code from model repositories. Used by SGLang and vLLM.
- Default:
false - Warning: Only enable for trusted models
- Example:
"trust_remote_code": true
Quantization method to reduce model size and memory usage. Used by SGLang and vLLM.
- Valid values:
"awq","fp8","fp4","int8","int4", etc. - Example:
"quantization": "fp8"
List of compatible GPU types that can run this model configuration.
- Valid values:
"h100","h200","b200","a100","l40s","v100","rtx6000-ada","rtx6000-pro" - Purpose: Documents hardware requirements for deployment planning
- Example:
"gpu_types": ["h100", "h200", "b200"]
Format for loading model weights. Used by SGLang and vLLM.
- Valid values:
"auto"- Auto-detect format (default)"pt"- PyTorch checkpoint"safetensors"- SafeTensors format"runai_streamer"- Run:ai model streaming"runai_streamer_sharded"- Sharded Run:ai streaming"tensorizer"- Tensorized format"gguf"- GGUF format"bitsandbytes"- BitsAndBytes quantized format- Other values:
"npcache","dummy","sharded_state","mistral"
- Example:
"load_format": "runai_streamer"
Additional configuration for the model loader. Used by SGLang and vLLM.
- Common options:
enable_multithread_load: Enable multi-threaded model loadingnum_threads: Number of threads for loading (SGLang)concurrency: Number of concurrent download streams (vLLM with Run:ai)memory_limit: Memory limit in bytes (vLLM with Run:ai)
- Example for multi-threading:
"model_loader_extra_config": { "enable_multithread_load": true, "num_threads": 64 }
- Example for Run:ai streaming:
"model_loader_extra_config": { "concurrency": 16, "memory_limit": 5368709120 }
Random seed for reproducibility. Used by SGLang and vLLM.
- Use case: Ensure deterministic generation across runs
- Example:
"seed": 42
Override the default tokenizer. Used by SGLang and vLLM.
- Format: HuggingFace path or local path
- Use case: Use a different tokenizer than the model's default
- SGLang flag:
--tokenizer-path - vLLM flag:
--tokenizer - Example:
"tokenizer": "meta-llama/Llama-3.1-70B-Instruct"
Indicates the reasoning capability of the model. Used as metadata for tooling and deployment decisions — not passed as a CLI flag.
- Valid values:
"thinking","hybrid","instruct""thinking"— model always produces reasoning/chain-of-thought tokens (e.g. DeepSeek-R1, Qwen3 thinking variant)"hybrid"— model can operate in both thinking and non-thinking modes depending on the prompt (e.g. DeepSeek-V3, Qwen3 base)"instruct"— standard instruction-tuned model with no reasoning tokens (e.g. Llama, Mistral)
- Example:
"model_mode": "thinking"
Image tag to use for the container image, overriding the default tag.
- Use case: Specify a particular version or variant of the inference engine image
- Example:
"image_tag": "v0.3.5" - Note: For custom engines with the
imagefield, either the image must include a tag (e.g.,"image:tag") or this field must be provided
SGLang-specific configuration options are nested under the "sglang" key.
Tensor parallelism: splits the model across multiple GPUs horizontally.
- Use when: Model is too large for a single GPU
- Typical values: 1, 2, 4, 8
- Example:
"tp": 8(split across 8 GPUs)
Data parallelism: creates multiple replicas to process different requests in parallel.
- Use when: Need higher throughput
- Requires:
enable_dp_attentionset totrue - Example:
"dp": 4
Expert parallelism: distributes experts in MoE (Mixture of Experts) models across GPUs.
- Use when: Running MoE models like DeepSeek, Qwen3, GLM
- Example:
"ep": 8
Fraction of GPU memory to allocate statically for model weights and KV cache.
- Range: 0.0 to 1.0
- Default: Engine default (typically ~0.85)
- Use case: Increase for throughput, decrease if running out of memory
- Example:
"mem_fraction_static": 0.90
Data type for the key-value cache in attention layers.
- Valid values:
"fp8_e4m3","bf16","fp16" - Trade-off:
fp8_e4m3uses less memory but may reduce quality slightly - Example:
"kv_cache_dtype": "fp8_e4m3"
Maximum context window length in tokens.
- Use when: Need to override model's default context length
- Example:
"context_length": 32768
Maximum number of requests the server will process concurrently.
- Use case: Limit concurrency to control memory usage
- Example:
"max_running_requests": 64
Enable distributed attention for data parallelism.
- Required when: Using
dp> 1 - Example:
"enable_dp_attention": true
Parser for extracting reasoning/thinking tokens from model outputs.
- Valid values:
"deepseek-v3","deepseek-r1","qwen3","nano_v3","glm45","kimi" - Use with: Models that support chain-of-thought reasoning
- Example:
"reasoning_parser": "deepseek-v3"
Parser for extracting tool/function calls from model outputs.
- Valid values:
"deepseekv32","qwen3_coder","glm45","kimi" - Use with: Models that support function calling
- Example:
"tool_call_parser": "deepseekv32"
Path to a custom Jinja chat template file.
- Use when: Need to override the model's default chat format
- Example:
"chat_template": "/path/to/template.jinja"
Disable optimization for shared experts in MoE models.
- Use when: Experiencing issues with expert fusion
- Example:
"disable_shared_experts_fusion": true
Algorithm for speculative decoding to improve latency.
- Valid values:
"EAGLE" - Example:
"speculative_algorithm": "EAGLE"
Number of speculative decoding steps per forward pass.
- Typical value: 3
- Example:
"speculative_num_steps": 3
Top-k parameter for EAGLE speculative decoding.
- Example:
"speculative_eagle_topk": 4
Number of draft tokens to generate in speculative decoding.
- Example:
"speculative_num_draft_tokens": 4
Path or identifier for the draft model used in speculative decoding.
- Use case: Specify a smaller, faster model to generate draft tokens
- Example:
"speculative_draft_model_path": "lmsys/sglang-EAGLE3-Llama-4-Scout-17B-16E-Instruct-v1"
Pipeline parallelism: splits model layers vertically across GPUs.
- Use when: Need additional parallelism beyond tensor parallelism
- SGLang flag:
--pp - Example:
"pipeline_parallel_size": 2
Scheduling strategy for request processing.
- Valid values:
"fcfs"(first come first served),"lpm"(longest prompt first),"dfs-weight" - Default:
"fcfs" - Example:
"schedule_policy": "lpm"
Disable radix cache (prefix caching) optimization.
- Use when: Troubleshooting caching issues
- Default:
false(caching enabled) - Example:
"disable_radix_cache": true
Disable CUDA graph optimization.
- Use when: Debugging or when CUDA graphs cause issues
- Default:
false(CUDA graphs enabled) - Example:
"disable_cuda_graph": true
vLLM-specific configuration options are nested under the "vllm" key.
Tensor parallelism: splits the model across multiple GPUs horizontally.
- Use when: Model is too large for a single GPU
- Typical values: 1, 2, 4, 8
- Example:
"tensor_parallel_size": 4
Pipeline parallelism: splits the model across multiple GPUs vertically (by layers).
- Use when: Need additional parallelism beyond tensor parallelism
- Less common: Most deployments use only tensor parallelism
- Example:
"pipeline_parallel_size": 2
Data type for model weights and computation.
- Valid values:
"auto","bfloat16","float16","float32" - Default:
"auto"(infers from model) - Trade-offs:
bfloat16has better numerical stability thanfloat16 - Example:
"dtype": "bfloat16"
Maximum context length in tokens.
- Use when: Need to override model's default max length or limit memory usage
- Example:
"max_model_len": 32768
Fraction of GPU memory to allocate for model execution.
- Range: 0.0 to 1.0
- Default: 0.90
- Trade-off: Higher values allow more concurrent requests but risk OOM
- Example:
"gpu_memory_utilization": 0.95
Maximum number of sequences (requests) to process in parallel.
- Use case: Control concurrency and memory usage
- Example:
"max_num_seqs": 256
Cache and reuse computation for common prompt prefixes.
- Use when: Many requests share the same system prompt or prefix
- Benefit: Reduces latency and improves throughput
- Example:
"enable_prefix_caching": true
Process long prompts in chunks to reduce latency for first token.
- Use when: Serving requests with very long prompts
- Benefit: Better time-to-first-token (TTFT)
- Example:
"enable_chunked_prefill": true
Force eager execution mode instead of using CUDA graphs.
- Use when: Debugging or when CUDA graphs cause issues
- Trade-off: Slower but more flexible
- Example:
"enforce_eager": true
Override the model name returned by the API.
- Use case: Expose a different model name to clients
- Example:
"served_model_name": "my-custom-model"
Number of data parallel replicas.
- Use when: Need to scale throughput with multiple model replicas
- Example:
"data_parallel_size": 2
Backend for distributed execution.
- Valid values:
"ray","mp"(multiprocessing),"uni","external_launcher" - Use case: Choose distributed computing framework
- Example:
"distributed_executor_backend": "ray"
CPU swap space per GPU in GiB.
- Use when: Need additional memory beyond GPU VRAM
- Default: 4
- Example:
"swap_space": 8
Maximum number of tokens processed per iteration.
- Use case: Control batch size and memory usage
- Example:
"max_num_batched_tokens": 8192
Request scheduling policy.
- Valid values:
"fcfs"(first come first served),"priority" - Default:
"fcfs" - Example:
"scheduling_policy": "priority"
Parser for extracting reasoning/thinking tokens from model outputs.
- Valid values:
"deepseek_r1","granite", and other vLLM-supported parsers - Use with: Models that support chain-of-thought reasoning
- vLLM flag:
--reasoning-parser - Example:
"reasoning_parser": "deepseek_r1"
Custom engine configuration is for running inference with non-standard or proprietary engines. Options are nested under the "custom" key.
The base command to execute your custom inference engine.
- Required: Must be specified when using
"engine": "custom" - Example:
"base_command": "python -m my_inference_engine.serve"
The command-line flag used to specify the model path.
- Default:
"--model" - Use when: Your engine uses a different flag for the model argument
- Example:
"model_flag": "--model-path"
The full container image to use for the custom inference engine.
- Format: Image name with optional registry and tag (e.g.,
"myregistry.io/my-engine:v1.0") - Use when: Specifying the complete container image for your custom engine
- Validation: Either the image must include a tag (e.g.,
"image:tag") or the top-levelimage_tagfield must be provided - Example:
"image": "myregistry.io/custom-inference:latest"
List of additional command-line arguments to pass to your engine.
- Format: Each argument as a string (can include both flag and value in one string)
- Example:
"args": [ "--workers 4", "--batch-size 32", "--max-tokens 2048" ]
Key-value pairs for command-line arguments.
- Format: Key is the flag, value is the argument value
- Use when: Prefer structured key-value format over string arrays
- Example:
"kv_args": { "--timeout": 60, "--max-tokens": 2048, "--temperature": 0.7 }
{
"engine": "sglang",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"sglang": {
"tp": 1
}
}{
"engine": "sglang",
"model": "deepseek-ai/DeepSeek-V3.2",
"gpu_types": ["h200", "b200"],
"model_loader_extra_config": {
"enable_multithread_load": true,
"num_threads": 64
},
"sglang": {
"tp": 8,
"dp": 4,
"ep": 2,
"mem_fraction_static": 0.90,
"kv_cache_dtype": "fp8_e4m3",
"enable_dp_attention": true,
"reasoning_parser": "deepseek-v3",
"tool_call_parser": "deepseekv32"
}
}{
"engine": "vllm",
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"vllm": {
"tensor_parallel_size": 1
}
}{
"engine": "vllm",
"model": "meta-llama/Llama-3.1-70B-Instruct",
"quantization": "awq",
"gpu_types": ["h100", "h200", "b200"],
"load_format": "runai_streamer",
"model_loader_extra_config": {
"concurrency": 8
},
"vllm": {
"tensor_parallel_size": 4,
"dtype": "bfloat16",
"max_model_len": 32768,
"gpu_memory_utilization": 0.95,
"max_num_seqs": 512,
"enable_prefix_caching": true,
"enable_chunked_prefill": true
}
}{
"engine": "sglang",
"model": "deepseek-ai/DeepSeek-V3",
"trust_remote_code": true,
"gpu_types": ["h200", "b200"],
"model_loader_extra_config": {
"enable_multithread_load": true,
"num_threads": 64
},
"sglang": {
"tp": 8
}
}{
"engine": "vllm",
"model": "meta-llama/Llama-3.1-70B-Instruct",
"gpu_types": ["h100", "h200"],
"load_format": "runai_streamer",
"model_loader_extra_config": {
"concurrency": 16,
"memory_limit": 5368709120
},
"vllm": {
"tensor_parallel_size": 4
}
}{
"engine": "vllm",
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"gpu_types": ["h100", "h200", "b200", "a100", "l40s", "v100"],
"model_loader_extra_config": {
"enable_multithread_load": true
},
"vllm": {
"tensor_parallel_size": 1
}
}{
"engine": "custom",
"model": "my-org/my-model",
"gpu_types": ["h100", "h200", "b200"],
"image_tag": "v1.0",
"custom": {
"base_command": "python -m my_engine.serve",
"model_flag": "--model-path",
"image": "myregistry.io/custom-inference",
"args": [
"--workers 4",
"--batch-size 32"
],
"kv_args": {
"--timeout": 60,
"--max-tokens": 2048
}
}
}{
"engine": "sglang",
"model": "Kimi-K2-Instruct",
"gpu_types": ["h200", "b200"],
"model_loader_extra_config": {
"enable_multithread_load": true,
"num_threads": 32
},
"sglang": {
"tp": 8,
"dp": 4,
"ep": 4,
"reasoning_parser": "kimi",
"tool_call_parser": "kimi"
}
}Use cmd-gen.go to generate inference server startup commands from your configuration:
go run cmd-gen.go examples/deepseek-sglang.jsonThis will output the complete command-line invocation for starting the inference server with all configured options.
- Start with examples: Use the provided example configurations as templates
- Match GPU types: Ensure your
gpu_typesfield matches your available hardware - Test incrementally: Start with minimal configs, then add optimizations
- Memory management: Adjust
mem_fraction_staticorgpu_memory_utilizationbased on your memory constraints - Parallelism: Use
tp(tensor parallelism) for model size,dpfor throughput - Model-specific parsers: Enable
reasoning_parserandtool_call_parseronly for models that support them