-
Notifications
You must be signed in to change notification settings - Fork 55
Description
Problem
When using autoscaler.min: 0 on a node pool, the node is provisioned on scale-up but never removed when idle. Tested over 8+ cluster rebuilds across two days, with and without GPU Operator, KubeAI, and taints. Scale-down never triggered in any test.
Possible Root Cause
The cluster-autoscaler enters a permanently stuck state after the initial scale-up. During node provisioning, the autoscaler reports "readiness not found" for the new node group. After this phase, it stops producing any meaningful logs, only node_instances_cache refreshes. It never evaluates scale-down candidates, even when the node has zero workload pods.
Cluster-autoscaler logs (stuck after 16:29, never recovers):
16:29:20Z clusterstate.go:700 Readiness for node group gpu-gcp-eexjr4t not found
16:29:20Z orchestrator.go:623 Node group gpu-gcp-eexjr4t is not ready for scaleup - unhealthy
16:29:30Z clusterstate.go:495 Failed to find readiness information for gpu-gcp-eexjr4t
... then only cache refreshes forever, no scale-down evaluation
Adapter sidecar is functional: when it starts successfully, it handles Refresh, NodeGroups, NodeGroupForNode, NodeGroupTargetSize requests. The problem is the cluster-autoscaler process itself gets stuck.
Additional issue: The adapter has a startup race condition, it takes ~6 min to bind gRPC port 50000 (waits for Manager). On pod restart it sometimes never binds at all.
Tested Configurations
All produced the same result (no scale-down):
| Test | GPU Operator | KubeAI | Taints | Result |
|---|---|---|---|---|
| 1 | Yes | Yes (Ollama) | Yes | No scale-down |
| 2 | Yes | Yes (vLLM) | Yes | No scale-down |
| 3 | Yes | No | Yes | No scale-down |
| 4 | No | No | No | No scale-down |
Test 4 is the minimal reproduction, bare cluster, no GPU Operator, no KubeAI, no taints. Simple busybox pod scheduled on GPU node, then deleted. GPU node stayed indefinitely.
Reproduction (Minimal)
- Apply the InputManifest below
- Wait for cluster to build (control node only, GPU pool at 0)
- Remove control plane taint:
kubectl taint nodes -l node-role.kubernetes.io/control-plane node-role.kubernetes.io/control-plane- - Deploy a pod targeting the GPU node:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
containers:
- name: sleep
image: busybox
command: ["sleep", "3600"]
resources:
requests:
cpu: "1"
memory: "1Gi"- Wait for GPU node to provision and pod to become Running (~10 min)
- Delete the pod:
kubectl delete pod gpu-test - Wait 20+ min GPU node is never removed
InputManifest
apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
name: inference-cluster
spec:
providers:
- name: hetzner-control
providerType: hetzner
templates:
repository: "https://github.com/berops/claudie-config"
tag: v0.10.0
path: "templates/terraformer/hetzner"
secretRef:
name: hetzner-secret
namespace: e2e-secrets
- name: gcp-gpu
providerType: gcp
templates:
repository: "https://github.com/berops/claudie-config"
tag: v0.10.0
path: "templates/terraformer/gcp"
secretRef:
name: gcp-secret
namespace: e2e-secrets
nodePools:
dynamic:
- name: control-hzn
providerSpec:
name: hetzner-control
region: hel1
count: 1
serverType: cx33
image: ubuntu-24.04
- name: gpu-gcp
providerSpec:
name: gcp-gpu
region: us-central1
zone: us-central1-a
autoscaler:
min: 0
max: 1
serverType: g2-standard-8
image: ubuntu-2404-noble-amd64-v20251001
machineSpec:
nvidiaGpuCount: 1
nvidiaGpuType: nvidia-l4
kubernetes:
clusters:
- name: inference
version: v1.32.0
network: 192.168.10.0/24
pools:
control:
- control-hzn
compute:
- gpu-gcpAutoscaler Pod Details
containers:
- name: cluster-autoscaler (connects to localhost:50000, watches inference cluster API)
- name: autoscaler-adapter (gRPC server on :50000, bridges to Claudie Manager)
args: --cloud-provider=externalgrpc
--ignore-daemonsets-utilization=true
--balance-similar-node-groups=true