Deploying Inference Graphs to Kubernetes

High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.

1. Install Platform First

# 1. Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases

# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace

For more details or customization options (including multinode deployments), see Installation Guide for Dynamo Kubernetes Platform.

2. Choose Your Backend

Each backend has deployment examples and configuration options:

Backend	Available Configurations
vLLM	Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner, Disaggregated Multi-node
SGLang	Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node
TensorRT-LLM	Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated Multi-node

3. Deploy Your First Model

export NAMESPACE=dynamo-cloud
kubectl create namespace ${NAMESPACE}

# to pull model from HF
export HF_TOKEN=<Token-Here>
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="$HF_TOKEN" \
  -n ${NAMESPACE};

# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}

# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}

# Test it
kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models

What's a DynamoGraphDeployment?

It's a Kubernetes Custom Resource that defines your inference pipeline:

Model configuration
Resource allocation (GPUs, memory)
Scaling policies
Frontend/backend connections

Refer to the API Reference and Documentation for more details.

📖 API Reference & Documentation

For detailed technical specifications of Dynamo's Kubernetes resources:

API Reference - Complete CRD field specifications for DynamoGraphDeployment and DynamoComponentDeployment
Operator Guide - Dynamo operator configuration and management
Create Deployment - Step-by-step deployment creation examples

Choosing Your Architecture Pattern

When creating a deployment, select the architecture pattern that best fits your use case:

Development / Testing - Use agg.yaml as the base configuration
Production with Load Balancing - Use agg_router.yaml to enable scalable, load-balanced inference
High Performance / Disaggregated - Use disagg_router.yaml for maximum throughput and modular scalability

Frontend and Worker Components

You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:

Provides OpenAI-compatible /v1/chat/completions endpoint
Auto-discovers backend workers via etcd
Routes requests and handles load balancing
Validates and preprocesses requests

Customizing Your Deployment

Example structure:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-llm
spec:
  services:
    Frontend:
      dynamoNamespace: my-llm
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: your-image
    VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
      dynamoNamespace: dynamo-dev
      componentType: worker
      replicas: 1
      envFromSecret: hf-token-secret  # for HuggingFace models
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: your-image
          command: ["/bin/sh", "-c"]
          args:
            - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]

Worker command examples per backend:

# vLLM worker
args:
  - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B

# SGLang worker
args:
  - >-
    python3 -m dynamo.sglang
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --tp 1
    --trust-remote-code

# TensorRT-LLM worker
args:
  - python3 -m dynamo.trtllm
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --extra-engine-args engine_configs/agg.yaml

Key customization points include:

Model Configuration: Specify model in the args command
Resource Allocation: Configure GPU requirements under resources.limits
Scaling: Set replicas for number of worker instances
Routing Mode: Enable KV-cache routing by setting DYN_ROUTER_MODE=kv in Frontend envs
Worker Specialization: Add --is-prefill-worker flag for disaggregated prefill workers

Additional Resources

Examples - Complete working examples
Create Custom Deployments - Build your own CRDs
Operator Documentation - How the platform works
Helm Charts - For advanced users
GitOps Deployment with FluxCD - For advanced users
Logging - For logging setup
Multinode Deployment - For multinode deployment
Grove - For grove details and custom installation
Monitoring - For monitoring setup
Model Caching with Fluid - For model caching with Fluid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying Inference Graphs to Kubernetes

1. Install Platform First

2. Choose Your Backend

3. Deploy Your First Model

What's a DynamoGraphDeployment?

📖 API Reference & Documentation

Choosing Your Architecture Pattern

Frontend and Worker Components

Customizing Your Deployment

Additional Resources

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Deploying Inference Graphs to Kubernetes

1. Install Platform First

2. Choose Your Backend

3. Deploy Your First Model

What's a DynamoGraphDeployment?

📖 API Reference & Documentation

Choosing Your Architecture Pattern

Frontend and Worker Components

Customizing Your Deployment

Additional Resources