Skip to content

Bug: GPU node never scales down to 0 #2023

@samuelstolicny

Description

@samuelstolicny

Problem

When using autoscaler.min: 0 on a node pool, the node is provisioned on scale-up but never removed when idle. Tested over 8+ cluster rebuilds across two days, with and without GPU Operator, KubeAI, and taints. Scale-down never triggered in any test.

Possible Root Cause

The cluster-autoscaler enters a permanently stuck state after the initial scale-up. During node provisioning, the autoscaler reports "readiness not found" for the new node group. After this phase, it stops producing any meaningful logs, only node_instances_cache refreshes. It never evaluates scale-down candidates, even when the node has zero workload pods.

Cluster-autoscaler logs (stuck after 16:29, never recovers):

16:29:20Z  clusterstate.go:700  Readiness for node group gpu-gcp-eexjr4t not found
16:29:20Z  orchestrator.go:623  Node group gpu-gcp-eexjr4t is not ready for scaleup - unhealthy
16:29:30Z  clusterstate.go:495  Failed to find readiness information for gpu-gcp-eexjr4t
... then only cache refreshes forever, no scale-down evaluation

Adapter sidecar is functional: when it starts successfully, it handles Refresh, NodeGroups, NodeGroupForNode, NodeGroupTargetSize requests. The problem is the cluster-autoscaler process itself gets stuck.

Additional issue: The adapter has a startup race condition, it takes ~6 min to bind gRPC port 50000 (waits for Manager). On pod restart it sometimes never binds at all.

Tested Configurations

All produced the same result (no scale-down):

Test GPU Operator KubeAI Taints Result
1 Yes Yes (Ollama) Yes No scale-down
2 Yes Yes (vLLM) Yes No scale-down
3 Yes No Yes No scale-down
4 No No No No scale-down

Test 4 is the minimal reproduction, bare cluster, no GPU Operator, no KubeAI, no taints. Simple busybox pod scheduled on GPU node, then deleted. GPU node stayed indefinitely.

Reproduction (Minimal)

  1. Apply the InputManifest below
  2. Wait for cluster to build (control node only, GPU pool at 0)
  3. Remove control plane taint: kubectl taint nodes -l node-role.kubernetes.io/control-plane node-role.kubernetes.io/control-plane-
  4. Deploy a pod targeting the GPU node:
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: DoesNotExist
  containers:
    - name: sleep
      image: busybox
      command: ["sleep", "3600"]
      resources:
        requests:
          cpu: "1"
          memory: "1Gi"
  1. Wait for GPU node to provision and pod to become Running (~10 min)
  2. Delete the pod: kubectl delete pod gpu-test
  3. Wait 20+ min GPU node is never removed

InputManifest

apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
  name: inference-cluster
spec:
  providers:
    - name: hetzner-control
      providerType: hetzner
      templates:
        repository: "https://github.com/berops/claudie-config"
        tag: v0.10.0
        path: "templates/terraformer/hetzner"
      secretRef:
        name: hetzner-secret
        namespace: e2e-secrets
    - name: gcp-gpu
      providerType: gcp
      templates:
        repository: "https://github.com/berops/claudie-config"
        tag: v0.10.0
        path: "templates/terraformer/gcp"
      secretRef:
        name: gcp-secret
        namespace: e2e-secrets

  nodePools:
    dynamic:
      - name: control-hzn
        providerSpec:
          name: hetzner-control
          region: hel1
        count: 1
        serverType: cx33
        image: ubuntu-24.04

      - name: gpu-gcp
        providerSpec:
          name: gcp-gpu
          region: us-central1
          zone: us-central1-a
        autoscaler:
          min: 0
          max: 1
        serverType: g2-standard-8
        image: ubuntu-2404-noble-amd64-v20251001
        machineSpec:
          nvidiaGpuCount: 1
          nvidiaGpuType: nvidia-l4

  kubernetes:
    clusters:
      - name: inference
        version: v1.32.0
        network: 192.168.10.0/24
        pools:
          control:
            - control-hzn
          compute:
            - gpu-gcp

Autoscaler Pod Details

containers:
  - name: cluster-autoscaler (connects to localhost:50000, watches inference cluster API)
  - name: autoscaler-adapter (gRPC server on :50000, bridges to Claudie Manager)

args: --cloud-provider=externalgrpc
      --ignore-daemonsets-utilization=true
      --balance-similar-node-groups=true

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions