| copyright |
|
||
|---|---|---|---|
| lastupdated | 2026-05-20 | ||
| keywords | kubernetes, gpu, nvidia, driver, migration, 1.36 | ||
| subcollection | containers |
{{site.data.keyword.attribute-definition-list}}
{: #gpu-migrate-136}
Starting with Kubernetes version 1.36, {{site.data.keyword.containerlong_notm}} no longer automatically installs NVIDIA GPU drivers on GPU worker nodes. You must install and manage GPU drivers yourself to run GPU workloads. {: shortdesc}
{: #gpu-migrate-what-changed}
Version 1.36 and later
: GPU drivers are not preinstalled on new or replaced GPU worker nodes. You must install and maintain the following components:
- NVIDIA kernel driver
- Container runtime components (such as nvidia-container-toolkit)
- Kubernetes device plugin
Versions 1.35 and earlier : GPU drivers are automatically installed and managed by IBM on all GPU worker nodes.
{: #gpu-migrate-impact}
Pods that request GPU resources remain in Pending state until you install the required GPU drivers on the worker node. After the drivers are installed, pending pods automatically transition to Running state.
{: #gpu-migrate-process}
The migration to self-managed GPU drivers follows a specific sequence to ensure your GPU workloads continue running during the upgrade:
-
Pre-installation phase: You install the NVIDIA GPU Operator on your cluster while it's still running version 1.35 or earlier. During installation, you label the existing GPU nodes to prevent the operator from deploying driver resources, which would conflict with the pre-installed drivers.
-
Control plane upgrade: You'll upgrade the cluster control plane to version 1.36.
-
Worker node upgrade: Replace each worker node to upgrade them version 1.36. When you do this, the labels that prevented driver deployment are automatically removed. This allows the NVIDIA GPU Operator to deploy its driver stack on the new nodes.
-
Automatic workload recovery: Once the operator installs drivers on a replaced node, any pending GPU workloads automatically transition to
Runningstate.
{: #gpu-migrate-prepare}
You can complete the pre-installation steps before version 1.36 is released to prepare your cluster for a smoother migration:
-
Label your existing GPU worker nodes to prevent the operator from deploying resources that would conflict with pre-installed drivers.
kubectl label node/<node_name> nvidia.com/gpu.deploy.operands=false kubectl label node/<node_name> nvidia.com/gpu.deploy.driver=false
{: pre}
-
Install the NVIDIA GPU Operator following the NVIDIA GPU Operator installation guide{: external}.
-
Verify that the operator is installed but not deploying driver resources on your labeled nodes.
kubectl get pods -n gpu-operator -o wide
{: pre}
By completing these preparation steps early, you reduce the work required during the actual upgrade to version 1.36. When version 1.36 becomes available, you only need to upgrade the control plane and replace the worker nodes.
{: #gpu-migrate-prereqs}
- Review the NVIDIA GPU Operator documentation{: external}.
- Ensure you have cluster administrator access.
- Plan your upgrade strategy based on your cluster configuration (single GPU node vs. multiple GPU nodes).
{: #gpu-migrate-examples}
The following examples demonstrate how to migrate your cluster based on the number of GPU nodes.
{: #gpu-migrate-single-node}
This example demonstrates migrating a cluster with a single GPU node. Because the sole GPU node will be unavailable during the upgrade, you add a temporary second GPU worker to maintain capacity.
{: #gpu-migrate-single-initial-state}
-
Check the cluster control plane version.
ibmcloud ks cluster get -c <cluster_name>
{: pre}
Example output showing version 1.35:
Master Status: Ready State: deployed Health: normal Version: 1.35.4_1528
{: screen}
-
Check the worker node version.
ibmcloud ks worker ls -c <cluster_name>
{: pre}
Example output showing a single GPU node:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-000001e4 10.240.0.64 gx3.16x80.l4 normal Ready us-south-1 1.35.4_1528 UBUNTU_24_64
{: screen}
{: #gpu-migrate-single-install-operator}
If you followed the steps in Preparing for migration before version 1.36 is available, you might have already completed this step.
-
Label the existing GPU worker node to prevent the operator from deploying resources that would conflict with pre-installed drivers.
kubectl label node/10.240.0.64 nvidia.com/gpu.deploy.operands=false kubectl label node/10.240.0.64 nvidia.com/gpu.deploy.driver=false
{: pre}
-
Add the NVIDIA Helm repository and install the GPU operator.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator
{: pre}
-
Verify that the GPU operator pods are running. Note that driver installer, device plugin, container toolkit, and DCGM exporter should NOT be running on the labeled node.
kubectl get pods -n gpu-operator -o wide
{: pre}
Example output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-operator-1778819096-node-feature-discovery-gc-84d98bd6nqw2z 1/1 Running 0 2m21s 172.17.64.94 10.240.0.64 <none> <none> gpu-operator-1778819096-node-feature-discovery-master-6b6cpnm9w 1/1 Running 0 2m21s 172.17.64.93 10.240.0.64 <none> <none> gpu-operator-1778819096-node-feature-discovery-worker-hm7ft 1/1 Running 0 2m21s 172.17.64.91 10.240.0.64 <none> <none> gpu-operator-76c686b9df-kn4dw 1/1 Running 0 2m21s 172.17.64.92 10.240.0.64 <none> <none>
{: screen}
{: #gpu-migrate-single-upgrade-master}
-
Upgrade the cluster control plane to version 1.36.
ibmcloud ks cluster master update --cluster <cluster_name> --version 1.36.0
{: pre}
-
Verify the control plane upgrade.
ibmcloud ks cluster get -c <cluster_name>
{: pre}
Example output:
Master Status: Ready State: deployed Health: normal Version: 1.36.0_1506
{: screen}
{: #gpu-migrate-single-add-temp}
-
Add a temporary second GPU worker to the cluster with Kubernetes version 1.36.
ibmcloud ks worker-pool create vpc-gen2 --name temp-gpu-pool --cluster <cluster_name> --flavor gx3.16x80.l4 --size-per-zone 1 --zone us-south-1
{: pre}
-
Verify the temporary node is ready.
ibmcloud ks worker ls -c <cluster_name>
{: pre}
Example output showing both the original and temporary nodes:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-000001e4 10.240.0.64 gx3.16x80.l4 normal Ready us-south-1 1.35.4_1528 UBUNTU_24_64 test-d8397vk20kb65iocenn0-tempgpupool-default-00000371 10.240.0.72 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64
{: screen}
-
Verify GPU operator pods are running on the new temporary node.
kubectl get pods -n gpu-operator -o wide
{: pre}
Example output showing driver and device plugin running on the temporary node:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-feature-discovery-mw5km 1/1 Running 0 4m36s 172.17.116.201 10.240.0.72 <none> <none> nvidia-container-toolkit-daemonset-ns6p8 1/1 Running 0 4m36s 172.17.116.199 10.240.0.72 <none> <none> nvidia-cuda-validator-vgj45 0/1 Completed 0 96s 172.17.116.203 10.240.0.72 <none> <none> nvidia-dcgm-exporter-52cwh 1/1 Running 0 4m36s 172.17.116.204 10.240.0.72 <none> <none> nvidia-device-plugin-daemonset-2ql7x 1/1 Running 0 4m36s 172.17.116.202 10.240.0.72 <none> <none> nvidia-driver-daemonset-zql6m 1/1 Running 0 5m29s 172.17.116.197 10.240.0.72 <none> <none> nvidia-operator-validator-44xtz 1/1 Running 0 4m36s 172.17.116.200 10.240.0.72 <none> <none>
{: screen}
-
Verify GPU readiness on the temporary node.
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable}'
{: pre}
{: #gpu-migrate-single-upgrade-original}
-
Migrate your GPU workloads to the temporary 1.36 worker node. You can use node selectors, taints, or manual pod deletion to move workloads.
-
Replace the original GPU node.
ibmcloud ks worker replace -w test-d8397vk20kb65iocenn0-btspstggput-default-000001e4 -c <cluster_name> --update
{: pre}
-
Verify the node upgrade.
ibmcloud ks worker ls -c <cluster_name>
{: pre}
Example output showing both nodes now on version 1.36:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-00000485 10.240.0.68 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64 test-d8397vk20kb65iocenn0-tempgpupool-default-00000371 10.240.0.72 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64
{: screen}
-
Verify GPU operator pods are running on the upgraded original node.
kubectl get pods -n gpu-operator -o wide
{: pre}
-
Verify all GPU workloads are running.
kubectl get pods -o wide
{: pre}
Example output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-burn-46pml 1/1 Running 0 6m8s 172.17.116.205 10.240.0.72 <none> <none> gpu-burn-z6xnh 1/1 Running 0 44m 172.17.121.28 10.240.0.68 <none> <none>
{: screen}
{: #gpu-migrate-single-cleanup}
After the original node is healthy and workloads are stable, you can optionally remove the temporary GPU node.
-
Delete the temporary worker pool.
ibmcloud ks worker-pool rm --cluster <cluster_name> --worker-pool temp-gpu-pool
{: pre}
-
Verify only the original node remains.
ibmcloud ks worker ls -c <cluster_name>
{: pre}
{: #gpu-migrate-multiple-nodes}
This example demonstrates migrating a cluster with two GPU nodes from Kubernetes version 1.35 to 1.36. With multiple nodes, you can upgrade nodes one at a time while maintaining GPU capacity.
{: #gpu-migrate-multiple-initial-state}
-
Check the cluster control plane version.
ibmcloud ks cluster get -c <cluster_name>
{: pre}
Example output showing version 1.35:
Master Status: Ready State: deployed Health: normal Version: 1.35.4_1528
{: screen}
-
Check the worker node versions.
ibmcloud ks worker ls -c <cluster_name>
{: pre}
Example output showing two GPU nodes:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-000001e4 10.240.0.64 gx3.16x80.l4 normal Ready us-south-1 1.35.4_1528 UBUNTU_24_64 test-d8397vk20kb65iocenn0-btspstggput-default-0000024b 10.240.0.66 gx3.16x80.l4 normal Ready us-south-1 1.35.4_1528 UBUNTU_24_64
{: screen}
{: #gpu-migrate-multiple-install-operator}
If you followed the steps in Preparing for migration before version 1.36 is available, you might have already completed this step.
-
Label the existing GPU worker nodes to prevent the operator from deploying resources that would conflict with pre-installed drivers.
kubectl label node/10.240.0.64 nvidia.com/gpu.deploy.operands=false kubectl label node/10.240.0.64 nvidia.com/gpu.deploy.driver=false kubectl label node/10.240.0.66 nvidia.com/gpu.deploy.operands=false kubectl label node/10.240.0.66 nvidia.com/gpu.deploy.driver=false
{: pre}
-
Add the NVIDIA Helm repository and install the GPU operator.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator
{: pre}
-
Verify that the GPU operator pods are running. Note that driver installer, device plugin, container toolkit, and DCGM exporter should NOT be running on the labeled nodes.
kubectl get pods -n gpu-operator -o wide
{: pre}
Example output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-operator-1778819096-node-feature-discovery-gc-84d98bd6nqw2z 1/1 Running 0 2m21s 172.17.64.94 10.240.0.64 <none> <none> gpu-operator-1778819096-node-feature-discovery-master-6b6cpnm9w 1/1 Running 0 2m21s 172.17.64.93 10.240.0.64 <none> <none> gpu-operator-1778819096-node-feature-discovery-worker-hm7ft 1/1 Running 0 2m21s 172.17.64.91 10.240.0.64 <none> <none> gpu-operator-1778819096-node-feature-discovery-worker-xh4qz 1/1 Running 0 2m21s 172.17.121.29 10.240.0.66 <none> <none> gpu-operator-76c686b9df-kn4dw 1/1 Running 0 2m21s 172.17.64.92 10.240.0.64 <none> <none>
{: screen}
{: #gpu-migrate-multiple-upgrade-master}
-
Upgrade the cluster control plane to version 1.36.
ibmcloud ks cluster master update --cluster <cluster_name> --version 1.36.0
{: pre}
-
Verify the control plane upgrade.
ibmcloud ks cluster get -c <cluster_name>
{: pre}
Example output:
Master Status: Ready State: deployed Health: normal Version: 1.36.0_1506
{: screen}
{: #gpu-migrate-multiple-upgrade-first-worker}
-
Replace the first worker node.
ibmcloud ks worker replace -w test-d8397vk20kb65iocenn0-btspstggput-default-000001e4 -c <cluster_name> --update
{: pre}
-
Verify the node upgrade.
ibmcloud ks worker ls -c <cluster_name>
{: pre}
Example output:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-0000024b 10.240.0.66 gx3.16x80.l4 normal Ready us-south-1 1.35.4_1528 UBUNTU_24_64 test-d8397vk20kb65iocenn0-btspstggput-default-00000371 10.240.0.72 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64
{: screen}
-
Check the GPU workload status. GPU workload scheduled on the new node will be in
Pendingstate until the driver is installed.kubectl get pods -o wide
{: pre}
Example output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-burn-46pml 0/1 Pending 0 18s <none> <none> <none> <none> gpu-burn-z6xnh 1/1 Running 0 38m 172.17.121.28 10.240.0.66 <none> <none>
{: screen}
-
Verify GPU operator pods are running on the new node.
kubectl get pods -n gpu-operator -o wide
{: pre}
Example output showing driver and device plugin running on the new node:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-feature-discovery-mw5km 1/1 Running 0 4m36s 172.17.116.201 10.240.0.72 <none> <none> nvidia-container-toolkit-daemonset-ns6p8 1/1 Running 0 4m36s 172.17.116.199 10.240.0.72 <none> <none> nvidia-cuda-validator-vgj45 0/1 Completed 0 96s 172.17.116.203 10.240.0.72 <none> <none> nvidia-dcgm-exporter-52cwh 1/1 Running 0 4m36s 172.17.116.204 10.240.0.72 <none> <none> nvidia-device-plugin-daemonset-2ql7x 1/1 Running 0 4m36s 172.17.116.202 10.240.0.72 <none> <none> nvidia-driver-daemonset-zql6m 1/1 Running 0 5m29s 172.17.116.197 10.240.0.72 <none> <none> nvidia-operator-validator-44xtz 1/1 Running 0 4m36s 172.17.116.200 10.240.0.72 <none> <none>
{: screen}
-
Verify the GPU workload scheduled on the new node.
kubectl get pods -o wide
{: pre}
Example output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-burn-46pml 1/1 Running 0 6m8s 172.17.116.205 10.240.0.72 <none> <none> gpu-burn-z6xnh 1/1 Running 0 44m 172.17.121.28 10.240.0.66 <none> <none>
{: screen}
{: #gpu-migrate-multiple-upgrade-remaining}
-
Repeat Step 4 for each remaining GPU node, upgrading one node at a time.
-
After all nodes are upgraded, verify all worker nodes are running version 1.36.
ibmcloud ks worker ls -c <cluster_name>
{: pre}
Example output:
ID Primary IP Flavor State Status Zone Version Operating System test-d8397vk20kb65iocenn0-btspstggput-default-00000371 10.240.0.72 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64 test-d8397vk20kb65iocenn0-btspstggput-default-0000048c 10.240.0.73 gx3.16x80.l4 normal Ready us-south-1 1.36.0_1507 UBUNTU_24_64
{: screen}
-
Verify all GPU operator pods are running.
kubectl get pods -n gpu-operator -o wide
{: pre}
-
Verify all GPU workloads are running.
kubectl get pods -o wide
{: pre}
Example output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-burn-46pml 1/1 Running 0 18m 172.17.116.205 10.240.0.72 <none> <none> gpu-burn-ttcbt 1/1 Running 0 6m46s 172.17.75.77 10.240.0.73 <none> <none>
{: screen}
{: #gpu-migrate-next-steps}
- Monitor your GPU workloads to ensure they are running correctly.
- Review the NVIDIA GPU Operator documentation{: external} for advanced configuration options.
- Set up monitoring for GPU metrics using the NVIDIA DCGM exporter.
{: #gpu-migrate-related}