Configuration Specification

This document describes the configuration fields available for creating LLM inference server templates. Configurations are written in JSON format and support three inference engines: SGLang, vLLM, and Custom.

Configuration Structure

All configurations have a common top-level structure:

{
  "engine": "sglang|vllm|custom",
  "model": "model-identifier",
  "trust_remote_code": true,
  "quantization": "fp8",
  "gpu_types": ["h100", "h200"],
  "load_format": "auto",
  "model_loader_extra_config": {
    "enable_multithread_load": true,
    "num_threads": 8
  },
  "seed": 42,
  "tokenizer": "path/to/tokenizer",
  "image_tag": "v0.3.5",

  "sglang": { /* SGLang-specific options */ },
  "vllm": { /* vLLM-specific options */ },
  "custom": { /* Custom engine options */ }
}

Field Reference

Common Fields (All Engines)

These fields are available at the top level of all configurations.

`engine` (string, required)

Specifies which inference engine to use.

Valid values: "sglang", "vllm", "custom"
Example: "engine": "sglang"

`model` (string, required)

The model identifier or path to load.

Format: HuggingFace path (e.g., "meta-llama/Llama-3.1-70B-Instruct") or local path
Example: "model": "deepseek-ai/DeepSeek-V3.2"

`name` (string, optional)

Human-readable display title for the template. Used as the selection label when rendering a list of templates for users to choose from.

Use case: Short, scannable title shown in template pickers and UI
Example: "name": "DeepSeek R1 · SGLang · FP8"

`explanation` (string, optional)

Detailed explanation of the configuration's purpose, use case, and key features.

Use case: Document the configuration for reference and understanding
Example: "explanation": "Llama 3.1 70B with basic configuration using 4-way tensor parallelism. This is the baseline setup with 32k context length."

`short_explanation` (string, optional)

Brief one-line description of the configuration.

Use case: Quick reference for configuration summary
Example: "short_explanation": "Llama 3.1 70B basic configuration"

`trust_remote_code` (boolean, optional)

Whether to trust and execute remote code from model repositories. Used by SGLang and vLLM.

Default: false
Warning: Only enable for trusted models
Example: "trust_remote_code": true

`quantization` (string, optional)

Quantization method to reduce model size and memory usage. Used by SGLang and vLLM.

Valid values: "awq", "fp8", "fp4", "int8", "int4", etc.
Example: "quantization": "fp8"

`gpu_types` (array of strings, optional)

List of compatible GPU types that can run this model configuration.

Valid values: "h100", "h200", "b200", "a100", "l40s", "v100", "rtx6000-ada", "rtx6000-pro"
Purpose: Documents hardware requirements for deployment planning
Example: "gpu_types": ["h100", "h200", "b200"]

`load_format` (string, optional)

Format for loading model weights. Used by SGLang and vLLM.

Valid values:
- "auto" - Auto-detect format (default)
- "pt" - PyTorch checkpoint
- "safetensors" - SafeTensors format
- "runai_streamer" - Run:ai model streaming
- "runai_streamer_sharded" - Sharded Run:ai streaming
- "tensorizer" - Tensorized format
- "gguf" - GGUF format
- "bitsandbytes" - BitsAndBytes quantized format
- Other values: "npcache", "dummy", "sharded_state", "mistral"
Example: "load_format": "runai_streamer"

`model_loader_extra_config` (object, optional)

Additional configuration for the model loader. Used by SGLang and vLLM.

Common options:
- enable_multithread_load: Enable multi-threaded model loading
- num_threads: Number of threads for loading (SGLang)
- concurrency: Number of concurrent download streams (vLLM with Run:ai)
- memory_limit: Memory limit in bytes (vLLM with Run:ai)

Example for multi-threading:

"model_loader_extra_config": {
  "enable_multithread_load": true,
  "num_threads": 64
}

Example for Run:ai streaming:

"model_loader_extra_config": {
  "concurrency": 16,
  "memory_limit": 5368709120
}

`seed` (integer, optional)

Random seed for reproducibility. Used by SGLang and vLLM.

Use case: Ensure deterministic generation across runs
Example: "seed": 42

`tokenizer` (string, optional)

Override the default tokenizer. Used by SGLang and vLLM.

Format: HuggingFace path or local path
Use case: Use a different tokenizer than the model's default
SGLang flag: --tokenizer-path
vLLM flag: --tokenizer
Example: "tokenizer": "meta-llama/Llama-3.1-70B-Instruct"

`model_mode` (string, optional)

Indicates the reasoning capability of the model. Used as metadata for tooling and deployment decisions — not passed as a CLI flag.

Valid values: "thinking", "hybrid", "instruct"
- "thinking" — model always produces reasoning/chain-of-thought tokens (e.g. DeepSeek-R1, Qwen3 thinking variant)
- "hybrid" — model can operate in both thinking and non-thinking modes depending on the prompt (e.g. DeepSeek-V3, Qwen3 base)
- "instruct" — standard instruction-tuned model with no reasoning tokens (e.g. Llama, Mistral)
Example: "model_mode": "thinking"

`image_tag` (string, optional)

Image tag to use for the container image, overriding the default tag.

Use case: Specify a particular version or variant of the inference engine image
Example: "image_tag": "v0.3.5"
Note: For custom engines with the image field, either the image must include a tag (e.g., "image:tag") or this field must be provided

SGLang Fields

SGLang-specific configuration options are nested under the "sglang" key.

Parallelism Options

`tp` (integer, optional)

Tensor parallelism: splits the model across multiple GPUs horizontally.

Use when: Model is too large for a single GPU
Typical values: 1, 2, 4, 8
Example: "tp": 8 (split across 8 GPUs)

`dp` (integer, optional)

Data parallelism: creates multiple replicas to process different requests in parallel.

Use when: Need higher throughput
Requires: enable_dp_attention set to true
Example: "dp": 4

`ep` (integer, optional)

Expert parallelism: distributes experts in MoE (Mixture of Experts) models across GPUs.

Use when: Running MoE models like DeepSeek, Qwen3, GLM
Example: "ep": 8

Memory Management

`mem_fraction_static` (float, optional)

Fraction of GPU memory to allocate statically for model weights and KV cache.

Range: 0.0 to 1.0
Default: Engine default (typically ~0.85)
Use case: Increase for throughput, decrease if running out of memory
Example: "mem_fraction_static": 0.90

`kv_cache_dtype` (string, optional)

Data type for the key-value cache in attention layers.

Valid values: "fp8_e4m3", "bf16", "fp16"
Trade-off: fp8_e4m3 uses less memory but may reduce quality slightly
Example: "kv_cache_dtype": "fp8_e4m3"

Model Configuration

`context_length` (integer, optional)

Maximum context window length in tokens.

Use when: Need to override model's default context length
Example: "context_length": 32768

`max_running_requests` (integer, optional)

Maximum number of requests the server will process concurrently.

Use case: Limit concurrency to control memory usage
Example: "max_running_requests": 64

Features

`enable_dp_attention` (boolean, optional)

Enable distributed attention for data parallelism.

Required when: Using dp > 1
Example: "enable_dp_attention": true

`reasoning_parser` (string, optional)

Parser for extracting reasoning/thinking tokens from model outputs.

Valid values: "deepseek-v3", "deepseek-r1", "qwen3", "nano_v3", "glm45", "kimi"
Use with: Models that support chain-of-thought reasoning
Example: "reasoning_parser": "deepseek-v3"

`tool_call_parser` (string, optional)

Parser for extracting tool/function calls from model outputs.

Valid values: "deepseekv32", "qwen3_coder", "glm45", "kimi"
Use with: Models that support function calling
Example: "tool_call_parser": "deepseekv32"

`chat_template` (string, optional)

Path to a custom Jinja chat template file.

Use when: Need to override the model's default chat format
Example: "chat_template": "/path/to/template.jinja"

`disable_shared_experts_fusion` (boolean, optional)

Disable optimization for shared experts in MoE models.

Use when: Experiencing issues with expert fusion
Example: "disable_shared_experts_fusion": true

Speculative Decoding (Multi-Token Prediction)

`speculative_algorithm` (string, optional)

Algorithm for speculative decoding to improve latency.

Valid values: "EAGLE"
Example: "speculative_algorithm": "EAGLE"

`speculative_num_steps` (integer, optional)

Number of speculative decoding steps per forward pass.

Typical value: 3
Example: "speculative_num_steps": 3

`speculative_eagle_topk` (integer, optional)

Top-k parameter for EAGLE speculative decoding.

Example: "speculative_eagle_topk": 4

`speculative_num_draft_tokens` (integer, optional)

Number of draft tokens to generate in speculative decoding.

Example: "speculative_num_draft_tokens": 4

`speculative_draft_model_path` (string, optional)

Path or identifier for the draft model used in speculative decoding.

Use case: Specify a smaller, faster model to generate draft tokens
Example: "speculative_draft_model_path": "lmsys/sglang-EAGLE3-Llama-4-Scout-17B-16E-Instruct-v1"

Pipeline & Scheduling

`pipeline_parallel_size` (integer, optional)

Pipeline parallelism: splits model layers vertically across GPUs.

Use when: Need additional parallelism beyond tensor parallelism
SGLang flag: --pp
Example: "pipeline_parallel_size": 2

`schedule_policy` (string, optional)

Scheduling strategy for request processing.

Valid values: "fcfs" (first come first served), "lpm" (longest prompt first), "dfs-weight"
Default: "fcfs"
Example: "schedule_policy": "lpm"

Cache & Optimization

`disable_radix_cache` (boolean, optional)

Disable radix cache (prefix caching) optimization.

Use when: Troubleshooting caching issues
Default: false (caching enabled)
Example: "disable_radix_cache": true

`disable_cuda_graph` (boolean, optional)

Disable CUDA graph optimization.

Use when: Debugging or when CUDA graphs cause issues
Default: false (CUDA graphs enabled)
Example: "disable_cuda_graph": true

vLLM Fields

vLLM-specific configuration options are nested under the "vllm" key.

Parallelism Options

`tensor_parallel_size` (integer, optional)

Tensor parallelism: splits the model across multiple GPUs horizontally.

Use when: Model is too large for a single GPU
Typical values: 1, 2, 4, 8
Example: "tensor_parallel_size": 4

`pipeline_parallel_size` (integer, optional)

Pipeline parallelism: splits the model across multiple GPUs vertically (by layers).

Use when: Need additional parallelism beyond tensor parallelism
Less common: Most deployments use only tensor parallelism
Example: "pipeline_parallel_size": 2

Model Configuration

`dtype` (string, optional)

Data type for model weights and computation.

Valid values: "auto", "bfloat16", "float16", "float32"
Default: "auto" (infers from model)
Trade-offs: bfloat16 has better numerical stability than float16
Example: "dtype": "bfloat16"

`max_model_len` (integer, optional)

Maximum context length in tokens.

Use when: Need to override model's default max length or limit memory usage
Example: "max_model_len": 32768

Memory Management

`gpu_memory_utilization` (float, optional)

Fraction of GPU memory to allocate for model execution.

Range: 0.0 to 1.0
Default: 0.90
Trade-off: Higher values allow more concurrent requests but risk OOM
Example: "gpu_memory_utilization": 0.95

`max_num_seqs` (integer, optional)

Maximum number of sequences (requests) to process in parallel.

Use case: Control concurrency and memory usage
Example: "max_num_seqs": 256

Features

`enable_prefix_caching` (boolean, optional)

Cache and reuse computation for common prompt prefixes.

Use when: Many requests share the same system prompt or prefix
Benefit: Reduces latency and improves throughput
Example: "enable_prefix_caching": true

`enable_chunked_prefill` (boolean, optional)

Process long prompts in chunks to reduce latency for first token.

Use when: Serving requests with very long prompts
Benefit: Better time-to-first-token (TTFT)
Example: "enable_chunked_prefill": true

`enforce_eager` (boolean, optional)

Force eager execution mode instead of using CUDA graphs.

Use when: Debugging or when CUDA graphs cause issues
Trade-off: Slower but more flexible
Example: "enforce_eager": true

`served_model_name` (string, optional)

Override the model name returned by the API.

Use case: Expose a different model name to clients
Example: "served_model_name": "my-custom-model"

Parallelism & Distribution

`data_parallel_size` (integer, optional)

Number of data parallel replicas.

Use when: Need to scale throughput with multiple model replicas
Example: "data_parallel_size": 2

`distributed_executor_backend` (string, optional)

Backend for distributed execution.

Valid values: "ray", "mp" (multiprocessing), "uni", "external_launcher"
Use case: Choose distributed computing framework
Example: "distributed_executor_backend": "ray"

Memory & Performance

`swap_space` (integer, optional)

CPU swap space per GPU in GiB.

Use when: Need additional memory beyond GPU VRAM
Default: 4
Example: "swap_space": 8

`max_num_batched_tokens` (integer, optional)

Maximum number of tokens processed per iteration.

Use case: Control batch size and memory usage
Example: "max_num_batched_tokens": 8192

Scheduling

`scheduling_policy` (string, optional)

Request scheduling policy.

Valid values: "fcfs" (first come first served), "priority"
Default: "fcfs"
Example: "scheduling_policy": "priority"

`reasoning_parser` (string, optional)

Parser for extracting reasoning/thinking tokens from model outputs.

Valid values: "deepseek_r1", "granite", and other vLLM-supported parsers
Use with: Models that support chain-of-thought reasoning
vLLM flag: --reasoning-parser
Example: "reasoning_parser": "deepseek_r1"

Custom Engine Fields

Custom engine configuration is for running inference with non-standard or proprietary engines. Options are nested under the "custom" key.

`base_command` (string, required)

The base command to execute your custom inference engine.

Required: Must be specified when using "engine": "custom"
Example: "base_command": "python -m my_inference_engine.serve"

`model_flag` (string, optional)

The command-line flag used to specify the model path.

Default: "--model"
Use when: Your engine uses a different flag for the model argument
Example: "model_flag": "--model-path"

`image` (string, optional)

The full container image to use for the custom inference engine.

Format: Image name with optional registry and tag (e.g., "myregistry.io/my-engine:v1.0")
Use when: Specifying the complete container image for your custom engine
Validation: Either the image must include a tag (e.g., "image:tag") or the top-level image_tag field must be provided
Example: "image": "myregistry.io/custom-inference:latest"

`args` (array of strings, optional)

List of additional command-line arguments to pass to your engine.

Format: Each argument as a string (can include both flag and value in one string)

Example:

"args": [
  "--workers 4",
  "--batch-size 32",
  "--max-tokens 2048"
]

`kv_args` (object, optional)

Key-value pairs for command-line arguments.

Format: Key is the flag, value is the argument value
Use when: Prefer structured key-value format over string arrays

Example:

"kv_args": {
  "--timeout": 60,
  "--max-tokens": 2048,
  "--temperature": 0.7
}

JSON Examples

SGLang - Minimal

{
  "engine": "sglang",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "sglang": {
    "tp": 1
  }
}

SGLang - Full Featured

{
  "engine": "sglang",
  "model": "deepseek-ai/DeepSeek-V3.2",
  "gpu_types": ["h200", "b200"],
  "model_loader_extra_config": {
    "enable_multithread_load": true,
    "num_threads": 64
  },
  "sglang": {
    "tp": 8,
    "dp": 4,
    "ep": 2,
    "mem_fraction_static": 0.90,
    "kv_cache_dtype": "fp8_e4m3",
    "enable_dp_attention": true,
    "reasoning_parser": "deepseek-v3",
    "tool_call_parser": "deepseekv32"
  }
}

vLLM - Minimal

{
  "engine": "vllm",
  "model": "mistralai/Mistral-7B-Instruct-v0.3",
  "vllm": {
    "tensor_parallel_size": 1
  }
}

vLLM - Full Featured

{
  "engine": "vllm",
  "model": "meta-llama/Llama-3.1-70B-Instruct",
  "quantization": "awq",
  "gpu_types": ["h100", "h200", "b200"],
  "load_format": "runai_streamer",
  "model_loader_extra_config": {
    "concurrency": 8
  },
  "vllm": {
    "tensor_parallel_size": 4,
    "dtype": "bfloat16",
    "max_model_len": 32768,
    "gpu_memory_utilization": 0.95,
    "max_num_seqs": 512,
    "enable_prefix_caching": true,
    "enable_chunked_prefill": true
  }
}

SGLang - With Multithread Model Loading

{
  "engine": "sglang",
  "model": "deepseek-ai/DeepSeek-V3",
  "trust_remote_code": true,
  "gpu_types": ["h200", "b200"],
  "model_loader_extra_config": {
    "enable_multithread_load": true,
    "num_threads": 64
  },
  "sglang": {
    "tp": 8
  }
}

vLLM - With RunAI Model Streamer

{
  "engine": "vllm",
  "model": "meta-llama/Llama-3.1-70B-Instruct",
  "gpu_types": ["h100", "h200"],
  "load_format": "runai_streamer",
  "model_loader_extra_config": {
    "concurrency": 16,
    "memory_limit": 5368709120
  },
  "vllm": {
    "tensor_parallel_size": 4
  }
}

vLLM - With Multithread Model Loading

{
  "engine": "vllm",
  "model": "mistralai/Mistral-7B-Instruct-v0.3",
  "gpu_types": ["h100", "h200", "b200", "a100", "l40s", "v100"],
  "model_loader_extra_config": {
    "enable_multithread_load": true
  },
  "vllm": {
    "tensor_parallel_size": 1
  }
}

Custom Engine

{
  "engine": "custom",
  "model": "my-org/my-model",
  "gpu_types": ["h100", "h200", "b200"],
  "image_tag": "v1.0",
  "custom": {
    "base_command": "python -m my_engine.serve",
    "model_flag": "--model-path",
    "image": "myregistry.io/custom-inference",
    "args": [
      "--workers 4",
      "--batch-size 32"
    ],
    "kv_args": {
      "--timeout": 60,
      "--max-tokens": 2048
    }
  }
}

KimiK2 - Reasoning Model

{
  "engine": "sglang",
  "model": "Kimi-K2-Instruct",
  "gpu_types": ["h200", "b200"],
  "model_loader_extra_config": {
    "enable_multithread_load": true,
    "num_threads": 32
  },
  "sglang": {
    "tp": 8,
    "dp": 4,
    "ep": 4,
    "reasoning_parser": "kimi",
    "tool_call_parser": "kimi"
  }
}

Usage

Generating Commands

Use cmd-gen.go to generate inference server startup commands from your configuration:

go run cmd-gen.go examples/deepseek-sglang.json

This will output the complete command-line invocation for starting the inference server with all configured options.

Configuration Best Practices

Start with examples: Use the provided example configurations as templates
Match GPU types: Ensure your gpu_types field matches your available hardware
Test incrementally: Start with minimal configs, then add optimizations
Memory management: Adjust mem_fraction_static or gpu_memory_utilization based on your memory constraints
Parallelism: Use tp (tensor parallelism) for model size, dp for throughput
Model-specific parsers: Enable reasoning_parser and tool_call_parser only for models that support them

FilesExpand file tree

SPEC.md

Latest commit

History

SPEC.md

File metadata and controls

Configuration Specification

Configuration Structure

Field Reference

Common Fields (All Engines)

engine (string, required)

model (string, required)

name (string, optional)

explanation (string, optional)

short_explanation (string, optional)

trust_remote_code (boolean, optional)

quantization (string, optional)

gpu_types (array of strings, optional)

load_format (string, optional)

model_loader_extra_config (object, optional)

seed (integer, optional)

tokenizer (string, optional)

model_mode (string, optional)

image_tag (string, optional)

SGLang Fields

Parallelism Options

tp (integer, optional)

dp (integer, optional)

ep (integer, optional)

Memory Management

mem_fraction_static (float, optional)

kv_cache_dtype (string, optional)

Model Configuration

context_length (integer, optional)

max_running_requests (integer, optional)

Features

enable_dp_attention (boolean, optional)

reasoning_parser (string, optional)

tool_call_parser (string, optional)

chat_template (string, optional)

disable_shared_experts_fusion (boolean, optional)

Speculative Decoding (Multi-Token Prediction)

speculative_algorithm (string, optional)

speculative_num_steps (integer, optional)

speculative_eagle_topk (integer, optional)

speculative_num_draft_tokens (integer, optional)

speculative_draft_model_path (string, optional)

Pipeline & Scheduling

pipeline_parallel_size (integer, optional)

schedule_policy (string, optional)

Cache & Optimization

disable_radix_cache (boolean, optional)

disable_cuda_graph (boolean, optional)

vLLM Fields

Parallelism Options

tensor_parallel_size (integer, optional)

pipeline_parallel_size (integer, optional)

Model Configuration

dtype (string, optional)

max_model_len (integer, optional)

Memory Management

gpu_memory_utilization (float, optional)

max_num_seqs (integer, optional)

Features

enable_prefix_caching (boolean, optional)

enable_chunked_prefill (boolean, optional)

enforce_eager (boolean, optional)

served_model_name (string, optional)

Parallelism & Distribution

data_parallel_size (integer, optional)

distributed_executor_backend (string, optional)

Memory & Performance

swap_space (integer, optional)

max_num_batched_tokens (integer, optional)

Scheduling

scheduling_policy (string, optional)

reasoning_parser (string, optional)

Custom Engine Fields

base_command (string, required)

model_flag (string, optional)

`engine` (string, required)

`model` (string, required)

`name` (string, optional)

`explanation` (string, optional)

`short_explanation` (string, optional)

`trust_remote_code` (boolean, optional)

`quantization` (string, optional)

`gpu_types` (array of strings, optional)

`load_format` (string, optional)

`model_loader_extra_config` (object, optional)

`seed` (integer, optional)

`tokenizer` (string, optional)

`model_mode` (string, optional)

`image_tag` (string, optional)

`tp` (integer, optional)

`dp` (integer, optional)

`ep` (integer, optional)

`mem_fraction_static` (float, optional)

`kv_cache_dtype` (string, optional)

`context_length` (integer, optional)

`max_running_requests` (integer, optional)

`enable_dp_attention` (boolean, optional)

`reasoning_parser` (string, optional)

`tool_call_parser` (string, optional)

`chat_template` (string, optional)

`disable_shared_experts_fusion` (boolean, optional)

`speculative_algorithm` (string, optional)

`speculative_num_steps` (integer, optional)

`speculative_eagle_topk` (integer, optional)

`speculative_num_draft_tokens` (integer, optional)

`speculative_draft_model_path` (string, optional)

`pipeline_parallel_size` (integer, optional)

`schedule_policy` (string, optional)

`disable_radix_cache` (boolean, optional)

`disable_cuda_graph` (boolean, optional)

`tensor_parallel_size` (integer, optional)

`pipeline_parallel_size` (integer, optional)

`dtype` (string, optional)

`max_model_len` (integer, optional)

`gpu_memory_utilization` (float, optional)

`max_num_seqs` (integer, optional)

`enable_prefix_caching` (boolean, optional)

`enable_chunked_prefill` (boolean, optional)

`enforce_eager` (boolean, optional)

`served_model_name` (string, optional)

`data_parallel_size` (integer, optional)

`distributed_executor_backend` (string, optional)

`swap_space` (integer, optional)

`max_num_batched_tokens` (integer, optional)

`scheduling_policy` (string, optional)

`reasoning_parser` (string, optional)

`base_command` (string, required)

`model_flag` (string, optional)

`image` (string, optional)

`args` (array of strings, optional)

`kv_args` (object, optional)