AI-Hypercomputer · jyj0w0 · Dec 25, 2025 · Dec 26, 2025 · Dec 27, 2025 · Dec 27, 2025
diff --git a/inference/a4x/disaggregated-serving/dynamo/README.md b/inference/a4x/disaggregated-serving/dynamo/README.md
@@ -0,0 +1,292 @@
+# Disaggregated Multi-Node Inference with NVIDIA Dynamo on A4X GKE
+
+This document outlines the steps to deploy and serve Large Language Models (LLMs) using [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo) disaggregated inference platform on [A4X GKE Node pools](https://cloud.google.com/kubernetes-engine).
+
+Dynamo provides a disaggregated architecture that separates prefill and decode operations for optimized inference performance, supporting both single-node (8 GPUs) and multi-node (16 GPUs) configurations. Dynamo also supports various inference framework backends like [vLLM](https://docs.nvidia.com/dynamo/latest/components/backends/vllm/README.html) and [SGLang](https://docs.nvidia.com/dynamo/latest/components/backends/sglang/README.html). In this recipe, we will focus on serving using the SGLang backend. 
+
+<a name="table-of-contents"></a>
+## Table of Contents
+
+* [1. Test Environment](#test-environment)
+* [2. Environment Setup (One-Time)](#environment-setup)
+  * [2.1. Clone the Repository](#clone-repo)
+  * [2.2. Configure Environment Variables](#configure-vars)
+  * [2.3. Connect to your GKE Cluster](#connect-cluster)
+  * [2.4. Create Secrets](#create-secrets)
+  * [2.5. Install Dynamo Platform](#install-platform)
+* [3. Deploy with SGLang Backend](#deploy-sglang)
+  * [3.1. Multi-Node SGLang Deployment (16 GPUs)](#sglang-multi-node)
+* [4. Inference Request](#inference-request)
+* [5. Monitoring and Troubleshooting](#monitoring)
+* [6. Cleanup](#cleanup)
+
+<a name="test-environment"></a>
+## 1. Test Environment
+
+[Back to Top](#table-of-contents)
+
+This recipe has been tested with the following configuration:
+
+* **GKE Cluster**:
+    * GPU node pools with [a4x-highgpu-4g](https://docs.cloud.google.com/compute/docs/gpus#gb200-gpus) machines:
+      * For multi-node deployment: 4 machines with 4 GPUs each (16 GPUs total)
+    * [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled
+    * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled
+
+> [!IMPORTANT]
+> To prepare the required environment, see the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a4x.md).
+
+<a name="environment-setup"></a>
+## 2. Environment Setup (One-Time)
+
+[Back to Top](#table-of-contents)
+
+<a name="clone-repo"></a>
+### 2.1. Clone the Repository
+
+```bash
+git clone https://github.com/ai-hypercomputer/gpu-recipes.git
+cd gpu-recipes
+export REPO_ROOT=$(pwd)
+export RECIPE_ROOT=$REPO_ROOT/inference/a4x/disaggregated-serving/dynamo
+```
+
+<a name="configure-vars"></a>
+### 2.2. Configure Environment Variables
+
+```bash
+export PROJECT_ID=<PROJECT_ID>
+export CLUSTER_REGION=<REGION_of_your_cluster>
+export CLUSTER_NAME=<YOUR_GKE_CLUSTER_NAME>
+export NAMESPACE=dynamo-cloud
+export NGC_API_KEY=<YOUR_NGC_API_KEY>
+export HF_TOKEN=<YOUR_HF_TOKEN>
+export RELEASE_VERSION=0.7.0
+
+# Set the project for gcloud commands
+gcloud config set project $PROJECT_ID
+```
+
+Replace the following values:
+
+| Variable | Description | Example |
+| -------- | ----------- | ------- |
+| `PROJECT_ID` | Your Google Cloud Project ID | `gcp-project-12345` |
+| `CLUSTER_REGION` | The GCP region where your GKE cluster is located | `us-central1` |
+| `CLUSTER_NAME` | The name of your GKE cluster | `a4x-cluster` |
+| `NGC_API_KEY` | Your NVIDIA NGC API key (get from [NGC](https://ngc.nvidia.com)) | `nvapi-xxx...` |
+| `HF_TOKEN` | Your Hugging Face access token | `hf_xxx...` |
+
+<a name="connect-cluster"></a>
+### 2.3. Connect to your GKE Cluster
+
+```bash
+gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
+```
+
+<a name="create-secrets"></a>
+### 2.4. Create Secrets
+
+Create the namespace:
+```bash
+kubectl create namespace ${NAMESPACE}
+kubectl config set-context --current --namespace=$NAMESPACE
+```
+
+Create the Docker registry secret for NVIDIA Container Registry:
+```bash
+kubectl create secret docker-registry nvcr-secret \
+  --namespace=${NAMESPACE} \
+  --docker-server=nvcr.io \
+  --docker-username='$oauthtoken' \
+  --docker-password=${NGC_API_KEY}
+```
+
+Create the secret for the Hugging Face token:
+```bash
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${NAMESPACE}
+```
+
+<a name="install-platform"></a>
+### 2.5. Install Dynamo Platform (One-Time Setup)
+
+Add the NVIDIA Helm repository:
+```bash
+helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
+  --username='$oauthtoken' --password=${NGC_API_KEY}
+helm repo update
+```
+
+Fetch the Dynamo Helm charts:
+```bash
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
+helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
+```
+
+Install the Dynamo CRDs:
+```bash
+helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz \
+  --namespace default \
+  --wait \
+  --atomic
+```
+
+Install the Dynamo Platform with Grove & Kai scheduler enabled:
+```bash
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
+  --namespace ${NAMESPACE} --set grove.enabled=true --set kai-scheduler.enabled=true
+```
+
+Verify the installation:
+```bash
+kubectl get pods -n ${NAMESPACE}
+```
+
+Wait until all pods show a `Running` status before proceeding.
+
+<a name="deploy-sglang"></a>
+## 3. Deploy with SGLang Backend
+
+[Back to Top](#table-of-contents)
+
+Deploy Dynamo with SGLang backend for high-performance inference.
+
+<a name="sglang-multi-node"></a>
+### 3.1. Multi-Node vLLM Deployment (16 GPUs)
+
+Multi-node deployment uses 16 GPUs across 4 A4X machines, providing increased capacity for larger models or higher throughput.
+
+#### DeepSeekR1 671B Model
+
+Deploy DeepSeekR1-671B across multiple nodes for production workloads. Note the use of `--set-file prefill_serving_config` and `--set-file decode_serving_config` pointing to the correct model config file for a multi node deployment scenario: 
+
+```bash
+cd $RECIPE_ROOT
+helm install -f values.yaml \
+--set-file prefill_serving_config=$REPO_ROOT/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-multi-node-prefill.yaml \
+--set-file decode_serving_config=$REPO_ROOT/src/frameworks/a4x/dynamo-configs/deepseekr1-fp8-multi-node-decode.yaml \
+$USER-dynamo-a4x-multi-node \
+$REPO_ROOT/src/helm-charts/a4x/inference-templates/dynamo-deployment
+```
+
+<a name="inference-request"></a>
+## 4. Inference Request
+[Back to Top](#table-of-contents)
+
+To make an inference request to test the server, we can first run a health check against the server using `curl`
+
+```bash
+kubectl exec -it -n ${NAMESPACE} deployment/$USER-dynamo-a4x-multi-node -- curl http://localhost:8000/health | jq
+```
+
+You should see a server status like this. Wait for it to be in a `healthy` state.
+
+```json
+{
+  "instances": [
+    {
+      "component": "backend",
+      "endpoint": "load_metrics",
+      "instance_id": 3994861215823793160,
+      "namespace": "dynamo",
+      "transport": {
+        "nats_tcp": "dynamo_backend.load_metrics-3770991c30298c08"
+      }
+    },
+    {
+      "component": "prefill",
+      "endpoint": "clear_kv_blocks",
+      "instance_id": 3994861215823793153,
+      "namespace": "dynamo",
+      "transport": {
+        "nats_tcp": "dynamo_prefill.clear_kv_blocks-3770991c30298c01"
+      }
+    },
+    {
+      "component": "prefill",
+      "endpoint": "generate",
+      "instance_id": 3994861215823793153,
+      "namespace": "dynamo",
+      "transport": {
+        "nats_tcp": "dynamo_prefill.generate-3770991c30298c01"
+      }
+    }
+  ],
+  "message": "No endpoints available",
+  "status": "unhealthy"
+}
+``` 
+
+Then we can send a benchmark request with like this:
+
+```bash
+kubectl exec -n ${NAMESPACE} $USER-dynamo-multi-node-serving-frontend -- python3 -u -m sglang.bench_serving    --backend sglang-oai-chat    --base-url http://localhost:8000    --model "deepseek-ai/DeepSeek-R1"    --tokenizer /data/model/deepseek-ai/DeepSeek-R1    --dataset-name random    --num-prompts 2048    --random-input-len 2048    --random-output-len 512    --max-concurrency 512
+```
+
+<a name="monitoring"></a>
+## 5. Monitoring and Troubleshooting
+
+[Back to Top](#table-of-contents)
+
+View logs for different components (replace with your deployment name):
+
+You can find the exact pod name by:
+```bash
+kubectl get pods -n ${NAMESPACE}
+```
+
+Frontend logs:
+```bash
+kubectl logs -f deployment/$USER-dynamo-multi-node-serving-frontend -n ${NAMESPACE}
+```
+
+Decode worker logs:
+```bash
+kubectl logs -f deployment/$USER-dynamo-multi-node-serving-decode-worker -n ${NAMESPACE}
+```
+
+Prefill worker logs:
+```bash
+kubectl logs -f deployment/$USER-dynamo-multi-node-serving-prefill-worker -n ${NAMESPACE}
+```
+
+Common issues:
+
+* **Pods stuck in Pending**: Check if nodes have sufficient resources (especially for multi-node deployments)
+* **Model download slow**: Large models like DeepSeekR1 671B can take 30 minutes to download
+* **Multi-node issues**: Verify network connectivity between nodes and proper subnet configuration
+
+<a name="cleanup"></a>
+## 6. Cleanup
+
+[Back to Top](#table-of-contents)
+
+List deployed releases:
+```bash
+helm list -n ${NAMESPACE} --filter $USER-dynamo-
+```
+
+Uninstall specific deployments:
+```bash
+helm uninstall $USER-dynamo-multi-node-serving -n ${NAMESPACE}
+```
+
+Uninstall Dynamo platform (if no longer needed):
+```bash
+helm uninstall dynamo-platform -n ${NAMESPACE}
+helm uninstall dynamo-crds -n default
+```
+
+Delete namespace and secrets:
+```bash
+kubectl delete namespace ${NAMESPACE}
+```
+
+Clean up downloaded charts:
+```bash
+rm -f dynamo-crds-${RELEASE_VERSION}.tgz
+rm -f dynamo-platform-${RELEASE_VERSION}.tgz
+```
+