title	Configuration

Configuration Reference

Complete configuration reference for tuning SMG behavior.

Configuration Methods

SMG can be configured through:

Command-line arguments (highest priority)
Environment variables
Default values (lowest priority)

Worker Configuration

Host

Network interface to bind to.

Option	`--host`
Environment	-
Default	`0.0.0.0`

Value	Description
`127.0.0.1`	Localhost only
`0.0.0.0`	All IPv4 interfaces
`::`	All IPv6 interfaces
`::1`	IPv6 localhost

Port

Port for the main API server.

Option	`--port`
Environment	-
Default	`30000`

Worker URLs

List of worker URLs to route requests to.

Option	`--worker-urls`
Environment	-
Default	Empty
Format	Space-separated URLs

Examples:

--worker-urls http://worker1:8000 http://worker2:8000
--worker-urls http://[::1]:8000 http://192.168.1.1:8000  # IPv6 and IPv4
--worker-urls grpc://worker1:50051  # gRPC mode

Routing Policy Configuration

Load Balancing Policy

Controls how requests are distributed across workers.

Option	`--policy`
Environment	-
Default	`cache_aware`
Values	`random`, `round_robin`, `cache_aware`, `power_of_two`, `prefix_hash`, `consistent_hashing`, `bucket`, `manual`

Policy Comparison:

Policy	Use Case	KV Cache	Load Balance
`random`	Simple deployments	Poor	Fair
`round_robin`	Uniform workloads	Poor	Good
`power_of_two`	Variable workloads	Poor	Excellent
`cache_aware`	LLM inference	Excellent	Good
`prefix_hash`	Consistent routing by prefix	Good	Good
`consistent_hashing`	Session affinity via hash ring	Good	Good
`bucket`	Load balancing with bucket boundaries	Poor	Excellent
`manual`	Sticky sessions with LRU eviction	Good	Manual

Recommendation: Use cache_aware for LLM workloads to maximize KV cache hit rates.

Cache-Aware Policy Options

Option	Description	Default
`--cache-threshold`	Cache threshold (0.0-1.0) for cache-aware routing	`0.3`
`--balance-abs-threshold`	Absolute threshold for load balancing trigger	`64`
`--balance-rel-threshold`	Relative threshold for load balancing trigger	`1.5`
`--eviction-interval`	Interval in seconds between cache eviction operations	`120`
`--max-tree-size`	Maximum size of the approximation tree	`67108864`
`--block-size`	KV cache block size for event-driven cache-aware routing	`16`

Prefix Hash Policy Options

Option	Description	Default
`--prefix-token-count`	Number of prefix tokens to use for hashing	`256`
`--prefix-hash-load-factor`	Load factor threshold for rebalancing	`1.25`

Manual Policy Options

Option	Description	Default
`--max-idle-secs`	Maximum idle time before eviction	`14400` (4 hours)
`--assignment-mode`	Mode for new routing key assignment	`random`

Assignment Modes:

random - Assign to a random worker
min_load - Assign to worker with fewest active requests
min_group - Assign to worker with fewest routing keys

Advanced Routing Options

Option	Description	Default
`--dp-aware`	Enable data parallelism aware scheduling	`false`
`--enable-igw`	Enable IGW (Inference Gateway) mode for multi-model support	`false`
`--dp-minimum-tokens-scheduler`	Enable minimum tokens scheduler for data parallel group	`false`
`--load-monitor-interval`	Interval in seconds between load monitor checks for PowerOfTwo routing	`10`

PD Disaggregation Configuration

Prefill-Decode disaggregated mode separates prefill and decode operations across different workers.

Enable PD Mode

Option	`--pd-disaggregation`
Environment	-
Default	`false`

Prefill Servers

Option	`--prefill`
Format	`URL [BOOTSTRAP_PORT]`
Multiple	Yes (specify multiple times)

Examples:

--prefill http://prefill1:30001 9001 \
--prefill http://prefill2:30002 9002 \
--prefill http://prefill3:30003 none  # No bootstrap port

Decode Servers

Option	`--decode`
Format	URL
Multiple	Yes (specify multiple times)

Example:

--decode http://decode1:30003 \
--decode http://decode2:30004

PD-Specific Policies

Option	Description	Default
`--prefill-policy`	Specific policy for prefill nodes	Uses main `--policy`
`--decode-policy`	Specific policy for decode nodes	Uses main `--policy`

Worker Startup Configuration

Option	Description	Default
`--worker-startup-timeout-secs`	Timeout for worker startup and registration	`1800` (30 min)
`--worker-startup-check-interval`	Interval between worker startup checks	`30`

Service Discovery (Kubernetes)

Enable Service Discovery

Option	`--service-discovery`
Environment	-
Default	`false`

Note: Enabling service discovery automatically enables IGW mode.

Label Selector

Option	`--selector`
Format	`key=value` (space-separated for multiple)

Example:

--selector app=sglang-worker tier=inference

Namespace

Option	`--service-discovery-namespace`
Environment	-
Default	All namespaces

Worker Port

Option	`--service-discovery-port`
Environment	-
Default	`80`

PD Service Discovery Selectors

Option	Description
`--prefill-selector`	Label selector for prefill server pods
`--decode-selector`	Label selector for decode server pods

HA Mesh Router Discovery

Option	Description
`--router-selector`	Label selector for router pod discovery in HA mesh mode (format: `key=value`)

Per-Worker Model ID Override

Option	Description
`--model-id-from`	Override each worker's `model_id` from pod metadata. Accepted values: `namespace`, `label:<key>`, or `annotation:<key>`.

Tokenizer Configuration

Model Path

Option	`--model-path`
Environment	-
Default	None
Description	HuggingFace model ID or local path for loading tokenizer

Tokenizer Path

Option	`--tokenizer-path`
Environment	-
Default	None
Description	Explicit tokenizer path (overrides model_path tokenizer)

Chat Template

Option	`--chat-template`
Environment	-
Default	None
Description	Path to chat template file

Disable Tokenizer Autoload

Option	`--disable-tokenizer-autoload`
Environment	-
Default	`false`
Description	Disable automatic tokenizer loading at startup and during worker registration. Useful when tokenizers are loaded on-demand via the API.

Tokenizer Cache (L0 - Exact Match)

Option	Description	Default
`--tokenizer-cache-enable-l0`	Enable L0 exact match cache	`false`
`--tokenizer-cache-l0-max-entries`	Maximum entries in L0 cache	`10000`

Tokenizer Cache (L1 - Prefix Matching)

Option	Description	Default
`--tokenizer-cache-enable-l1`	Enable L1 prefix matching cache	`false`
`--tokenizer-cache-l1-max-memory`	Maximum memory for L1 cache (bytes)	`52428800` (50MB)

Parser Configuration

Reasoning Parser

Option	`--reasoning-parser`
Environment	-
Default	None
Values	`deepseek-r1`, `qwen3`, etc.
Description	Parser for reasoning models with thinking tokens

Tool Call Parser

Option	`--tool-call-parser`
Environment	-
Default	None
Values	`json`, `qwen`, etc.
Description	Parser for tool-call/function-calling interactions

MCP Configuration

MCP Config Path

Option	`--mcp-config-path`
Environment	-
Default	None
Description	Path to MCP (Model Context Protocol) server configuration file

Backend Configuration

Backend Runtime

Option	`--backend`
Environment	-
Default	None (auto-detected)
Values	`sglang`, `vllm`, `trtllm`, `openai`, `anthropic`, `gemini`

History Backend

Option	`--history-backend`
Environment	-
Default	`memory`
Values	`memory`, `none`, `oracle`, `postgres`, `redis`

Storage Configuration

Oracle Database

Option	Environment	Description
`--oracle-wallet-path`	`ATP_WALLET_PATH`	Path to Oracle ATP wallet directory
`--oracle-tns-alias`	`ATP_TNS_ALIAS`	Oracle TNS alias from tnsnames.ora
`--oracle-dsn`	`ATP_DSN`	Oracle connection descriptor/DSN
`--oracle-user`	`ATP_USER`	Oracle database username
`--oracle-password`	`ATP_PASSWORD`	Oracle database password
`--oracle-external-auth`	`ATP_EXTERNAL_AUTH`	Enable Oracle external authentication (default: `false`)
`--oracle-pool-min`	`ATP_POOL_MIN`	Minimum connection pool size (default: 1)
`--oracle-pool-max`	`ATP_POOL_MAX`	Maximum connection pool size (default: 16)
`--oracle-pool-timeout-secs`	`ATP_POOL_TIMEOUT_SECS`	Pool timeout in seconds (default: 30)

PostgreSQL Database

Option	Environment	Description	Default
`--postgres-db-url`	`POSTGRES_DB_URL`	PostgreSQL connection URL	-
`--postgres-pool-max-size`	`POSTGRES_POOL_MAX`	Maximum pool size	`16`

Redis Database

Option	Environment	Description	Default
`--redis-url`	`REDIS_URL`	Redis connection URL	-
`--redis-pool-max-size`	`REDIS_POOL_MAX`	Maximum pool size	`16`
`--redis-retention-days`	`REDIS_RETENTION_DAYS`	Data retention (-1 for persistent)	`30`

WASM Configuration

Enable WebAssembly

Option	`--enable-wasm`
Environment	-
Default	`false`
Description	Enable WebAssembly support

Storage Hook WASM Component

Option	`--storage-hook-wasm-path`
Environment	-
Default	None
Description	Path to a WASM component implementing storage hooks. When set, wraps all storage backends with hook-based interceptors.

Schema Config File

Option	`--schema-config`
Environment	-
Default	None
Description	Path to a YAML schema config file for storage table/column remapping.

WebRTC Configuration

Option	Description	Default
`--webrtc-bind-addr`	Bind address for WebRTC UDP sockets (client-facing ICE candidate IP). Set to `127.0.0.1` for local development on the same machine.	`0.0.0.0` (auto-detect via routing table)
`--webrtc-stun-server`	STUN server for ICE candidate gathering (`host:port`). Set to your own STUN server for enterprise deployments that restrict outbound traffic to external STUN servers.	`stun.l.google.com:19302`

Mesh Server Configuration

High-availability mesh networking for multi-router coordination.

Option	Description	Default
`--enable-mesh`	Enable mesh server for HA multi-router coordination. Requires at least two SMG instances.	`false`
`--mesh-server-name`	Name for this mesh node. If not set, a random name is generated (e.g., `Mesh_a1b2`).	Auto-generated
`--mesh-host`	Bind address for the mesh server.	`0.0.0.0`
`--mesh-advertise-host`	Routable address advertised to other mesh peers. Required when `--mesh-host` is an unspecified bind address such as `0.0.0.0`.	`--mesh-host`
`--mesh-port`	Port for the mesh server.	`39527`
`--mesh-peer-urls`	Peer mesh node addresses to join (format: `host:port`). Used for initial cluster formation.	(none)

Example:

smg \
  --enable-mesh \
  --mesh-server-name router-1 \
  --mesh-advertise-host 192.168.1.10 \
  --mesh-port 39527 \
  --mesh-peer-urls 192.168.1.10:39527

Request Handling Configuration

Request Timeout

Option	`--request-timeout-secs`
Environment	-
Default	`1800` (30 minutes)
Description	Maximum time for request processing

Shutdown Grace Period

Option	`--shutdown-grace-period-secs`
Environment	-
Default	`180` (3 minutes)
Description	Time to wait for in-flight requests during shutdown

Maximum Payload Size

Option	`--max-payload-size`
Environment	-
Default	`536870912` (512MB)
Description	Maximum request payload size in bytes

CORS Configuration

Option	`--cors-allowed-origins`
Environment	-
Default	Empty
Format	Space-separated URLs

Example:

--cors-allowed-origins http://localhost:3000 https://example.com

Request ID Headers

Option	`--request-id-headers`
Environment	-
Default	None (uses common defaults)
Description	Custom HTTP headers to check for request IDs

Example:

--request-id-headers x-request-id x-trace-id x-correlation-id

Storage Context Headers

Option	`--storage-context-headers`
Environment	-
Default	Empty
Format	Space-separated `header=context_key` entries
Description	Maps request headers into storage hook request context

Example:

--storage-context-headers x-tenant-id=tenant_id x-user-id=user_id

This lets storage hooks read values such as tenant_id and user_id from the request context without hard-coding specific headers in the gateway.

Only map headers that are injected or sanitized by a trusted upstream. Client-supplied headers can otherwise spoof storage hook request context values.

Rate Limiting Configuration

Concurrent Request Limit

Option	`--max-concurrent-requests`
Environment	-
Default	`-1` (unlimited)
Range	`-1` or `1+`

Sizing Guide:

max_concurrent_requests = num_workers * requests_per_worker_capacity

Worker GPU Memory	Suggested per Worker
16GB	4-8
40GB	8-16
80GB	16-32

Queue Configuration

Option	Description	Default
`--queue-size`	Maximum requests waiting when rate limit reached	`100`
`--queue-timeout-secs`	Maximum time a request can wait in queue	`60`

Token Bucket Rate Limiting

Option	`--rate-limit-tokens-per-second`
Environment	-
Default	Same as `max-concurrent-requests`
Description	Token bucket refill rate

Retry Configuration

Retry Options

Option	Description	Default
`--retry-max-retries`	Maximum retry attempts	`5`
`--retry-initial-backoff-ms`	Initial backoff delay (ms)	`50`
`--retry-max-backoff-ms`	Maximum backoff delay (ms)	`30000`
`--retry-backoff-multiplier`	Exponential backoff multiplier	`1.5`
`--retry-jitter-factor`	Jitter factor (0.0-1.0)	`0.2`
`--disable-retries`	Disable automatic retries	`false`

Backoff Formula:

delay = min(initial_backoff * multiplier^attempt, max_backoff) * (1 + random(0, jitter_factor))

Circuit Breaker Configuration

Option	Description	Default
`--cb-failure-threshold`	Failures before circuit opens	`10`
`--cb-success-threshold`	Successes needed to close in half-open state	`3`
`--cb-timeout-duration-secs`	Time before attempting recovery	`60`
`--cb-window-duration-secs`	Sliding window for tracking failures	`120`
`--disable-circuit-breaker`	Disable circuit breaker	`false`

Circuit Breaker States:

Closed: Normal operation, tracking failures
Open: All requests fail fast, circuit tripped
Half-Open: Testing if service recovered

Health Check Configuration

Option	Description	Default
`--health-failure-threshold`	Failures before marking unhealthy	`3`
`--health-success-threshold`	Successes before marking healthy	`2`
`--health-check-timeout-secs`	Timeout for health check requests	`5`
`--health-check-interval-secs`	Interval between health checks	`60`
`--health-check-endpoint`	Health check endpoint path	`/health`
`--disable-health-check`	Disable all health checks	`false`
`--remove-unhealthy-workers`	Remove workers from the registry when marked unhealthy by health checks. Useful for ephemeral worker pools where failed workers should be deregistered.	`false`

Prometheus Metrics Configuration

Metrics Server

Option	Description	Default
`--prometheus-port`	Port for Prometheus metrics endpoint	`29000`
`--prometheus-host`	Host for Prometheus metrics server	`0.0.0.0`
`--prometheus-duration-buckets`	Custom histogram buckets	Default buckets

Example:

--prometheus-duration-buckets 0.001 0.005 0.01 0.025 0.05 0.1 0.25 0.5 1.0 2.5 5.0 10.0

OpenTelemetry Configuration

Enable Tracing

Option	`--enable-trace`
Environment	-
Default	`false`

OTLP Endpoint

Option	`--otlp-traces-endpoint`
Environment	-
Default	`localhost:4317`
Format	`host:port`

Example:

smg --enable-trace --otlp-traces-endpoint jaeger:4317

TLS/mTLS Security Configuration

Server TLS

For HTTPS on the gateway:

Option	Description
`--tls-cert-path`	Path to server certificate (PEM format)
`--tls-key-path`	Path to server private key (PEM format)

Client mTLS

For secure communication to workers (Python bindings):

Option	Description
`--client-cert-path`	Path to client certificate
`--client-key-path`	Path to client private key
`--ca-cert-paths`	Path(s) to CA certificate(s)

Control Plane Authentication

API Key (Worker Authorization)

Option	`--api-key`
Environment	-
Default	None
Description	API key for worker authorization (useful with dp-aware scheduling)

Control Plane API Keys

Option	`--control-plane-api-keys`
Environment	`CONTROL_PLANE_API_KEYS`
Format	`id:name:role:key`
Multiple	Yes

Example:

--control-plane-api-keys 'key1:Admin:admin:secret123' 'key2:ReadOnly:user:secret456'

JWT/OIDC Authentication

Option	Environment	Description
`--jwt-issuer`	`JWT_ISSUER`	OIDC issuer URL
`--jwt-audience`	`JWT_AUDIENCE`	Expected audience claim
`--jwt-jwks-uri`	`JWT_JWKS_URI`	Explicit JWKS URI (auto-discovered if not set)
`--jwt-role-claim`	-	JWT claim containing role (default: `roles`)
`--jwt-role-mapping`	-	Role mapping from IDP to gateway role

JWT Role Mapping Example:

--jwt-role-mapping 'Gateway.Admin=admin' 'Gateway.User=user'

Audit Logging

Option	`--disable-audit-logging`
Environment	-
Default	`false` (audit logging enabled)

Logging Configuration

Log Level

Option	`--log-level`
Environment	`RUST_LOG`
Default	`info`
Values	`debug`, `info`, `warn`, `error`

Per-Module Logging:

RUST_LOG=smg=debug,hyper=warn smg ...

Log Directory

Option	`--log-dir`
Environment	-
Default	None (console only)
Description	Directory to store log files

JSON Logs

Option	`--log-json`
Environment	-
Default	`false`
Description	Output logs as JSON (structured). Defaults to human-readable text logs.

Configuration Examples

Minimal Configuration

smg --worker-urls http://localhost:8000

High-Throughput Configuration

smg \
  --worker-urls http://w1:8000 http://w2:8000 http://w3:8000 http://w4:8000 \
  --policy cache_aware \
  --max-concurrent-requests 200 \
  --queue-size 400 \
  --queue-timeout-secs 60 \
  --retry-max-retries 3

Low-Latency Configuration

smg \
  --worker-urls http://w1:8000 http://w2:8000 \
  --policy power_of_two \
  --max-concurrent-requests 50 \
  --queue-size 25 \
  --queue-timeout-secs 5 \
  --health-check-interval-secs 5 \
  --request-timeout-secs 30

PD Disaggregated Mode

smg \
  --pd-disaggregation \
  --prefill http://prefill1:30001 9001 \
  --prefill http://prefill2:30002 9002 \
  --decode http://decode1:30003 \
  --decode http://decode2:30004 \
  --prefill-policy cache_aware \
  --decode-policy round_robin

Kubernetes Service Discovery

smg \
  --service-discovery \
  --selector app=sglang-worker \
  --service-discovery-namespace inference \
  --service-discovery-port 8000 \
  --policy cache_aware

High-Availability Mesh

# Router 1
smg \
  --enable-mesh \
  --mesh-server-name router-1 \
  --mesh-advertise-host 192.168.1.10 \
  --mesh-port 39527 \
  --mesh-peer-urls 192.168.1.11:39527 \
  --worker-urls http://worker1:8000

# Router 2
smg \
  --enable-mesh \
  --mesh-server-name router-2 \
  --mesh-advertise-host 192.168.1.11 \
  --mesh-port 39527 \
  --mesh-peer-urls 192.168.1.10:39527 \
  --worker-urls http://worker2:8000

Secure Production Configuration

smg \
  --service-discovery \
  --selector app=sglang-worker \
  --service-discovery-namespace inference \
  --policy cache_aware \
  --max-concurrent-requests 100 \
  --tls-cert-path /etc/certs/server.crt \
  --tls-key-path /etc/certs/server.key \
  --jwt-issuer https://login.microsoftonline.com/tenant/v2.0 \
  --jwt-audience api://smg-gateway \
  --jwt-role-mapping 'Gateway.Admin=admin' 'Gateway.User=user' \
  --enable-trace \
  --otlp-traces-endpoint jaeger:4317 \
  --host 0.0.0.0 \
  --port 443

With Tokenizer and Parsers

smg \
  --worker-urls http://localhost:8000 \
  --model-path meta-llama/Llama-3-8B-Instruct \
  --tokenizer-cache-enable-l0 \
  --tokenizer-cache-l0-max-entries 50000 \
  --reasoning-parser deepseek-r1 \
  --tool-call-parser json

With Database Backend

# PostgreSQL
smg \
  --worker-urls http://localhost:8000 \
  --history-backend postgres \
  --postgres-db-url "postgres://user:pass@localhost:5432/smg" \
  --postgres-pool-max-size 32

# Redis
smg \
  --worker-urls http://localhost:8000 \
  --history-backend redis \
  --redis-url "redis://localhost:6379" \
  --redis-pool-max-size 32 \
  --redis-retention-days 7

Environment Variable Reference

Environment Variable	CLI Option	Description
`RUST_LOG`	`--log-level`	Log level
`ATP_WALLET_PATH`	`--oracle-wallet-path`	Oracle wallet path
`ATP_TNS_ALIAS`	`--oracle-tns-alias`	Oracle TNS alias
`ATP_DSN`	`--oracle-dsn`	Oracle DSN
`ATP_USER`	`--oracle-user`	Oracle username
`ATP_PASSWORD`	`--oracle-password`	Oracle password
`ATP_EXTERNAL_AUTH`	`--oracle-external-auth`	Enable Oracle external authentication
`ATP_POOL_MIN`	`--oracle-pool-min`	Oracle min pool size
`ATP_POOL_MAX`	`--oracle-pool-max`	Oracle max pool size
`ATP_POOL_TIMEOUT_SECS`	`--oracle-pool-timeout-secs`	Oracle pool timeout
`POSTGRES_DB_URL`	`--postgres-db-url`	PostgreSQL URL
`POSTGRES_POOL_MAX`	`--postgres-pool-max-size`	PostgreSQL max pool
`REDIS_URL`	`--redis-url`	Redis URL
`REDIS_POOL_MAX`	`--redis-pool-max-size`	Redis max pool
`REDIS_RETENTION_DAYS`	`--redis-retention-days`	Redis retention
`JWT_ISSUER`	`--jwt-issuer`	JWT issuer URL
`JWT_AUDIENCE`	`--jwt-audience`	JWT audience
`JWT_JWKS_URI`	`--jwt-jwks-uri`	JWKS URI
`CONTROL_PLANE_API_KEYS`	`--control-plane-api-keys`	Control plane API keys

FilesExpand file tree

configuration.md

Latest commit

History

configuration.md

File metadata and controls

Configuration Reference

Configuration Methods

Worker Configuration

Host

Port

Worker URLs

Routing Policy Configuration

Load Balancing Policy

Cache-Aware Policy Options

Prefix Hash Policy Options

Manual Policy Options

Advanced Routing Options

PD Disaggregation Configuration

Enable PD Mode

Prefill Servers

Decode Servers

PD-Specific Policies

Worker Startup Configuration

Service Discovery (Kubernetes)

Enable Service Discovery

Label Selector

Namespace

Worker Port

PD Service Discovery Selectors

HA Mesh Router Discovery

Per-Worker Model ID Override

Tokenizer Configuration

Model Path

Tokenizer Path

Chat Template

Disable Tokenizer Autoload

Tokenizer Cache (L0 - Exact Match)

Tokenizer Cache (L1 - Prefix Matching)

Parser Configuration

Reasoning Parser

Tool Call Parser

MCP Configuration

MCP Config Path

Backend Configuration

Backend Runtime

History Backend

Storage Configuration

Oracle Database

PostgreSQL Database

Redis Database

WASM Configuration

Enable WebAssembly

Storage Hook WASM Component

Schema Config File

WebRTC Configuration

Mesh Server Configuration

Request Handling Configuration

Request Timeout

Shutdown Grace Period

Maximum Payload Size

CORS Configuration

Request ID Headers

Storage Context Headers

Rate Limiting Configuration

Concurrent Request Limit

Queue Configuration

Token Bucket Rate Limiting

Retry Configuration

Retry Options

Circuit Breaker Configuration

Health Check Configuration

Prometheus Metrics Configuration

Metrics Server

OpenTelemetry Configuration

Enable Tracing

OTLP Endpoint

TLS/mTLS Security Configuration

Server TLS

Client mTLS