Skip to content

Support static node role transitions (compute <-> control-plane) #1965

@samuelstolicny

Description

@samuelstolicny

Summary

Claudie currently does not support transitioning static nodes between compute (worker) and control-plane roles. When a user moves a static node from a compute pool to a control-plane pool (or vice versa), KubeOne fails with a validation error because the node is simply re-added without first being properly removed from its previous role.

Goal

Implement support for static node role transitions (compute <-> control-plane). When a static node's role is changed in the InputManifest, Claudie should handle the transition gracefully by:

  1. Draining the node from its current role (with a configurable timeout, e.g. 10 minutes, to allow running workloads to complete)
  2. Removing the node from the cluster
  3. Re-adding the node in its new role (control-plane or compute)

Steps to Reproduce (current broken behavior)

  1. Apply an InputManifest with a static node in the compute pool:
nodePools:
  static:
    - name: static-pool-02
      nodes:
        - endpoint: 78.47.18.140
          secretRef:
            name: static-nodes-key
            namespace: e2e-secrets

kubernetes:
  clusters:
    - name: hybrid-cluster
      pools:
        control:
          - aws-ctrl-nodes
        compute:
          - static-pool-02  # static node used as worker
  1. Wait for the cluster to be fully provisioned

  2. Apply an updated InputManifest that moves the same static node to the control-plane:

nodePools:
  static:
    - name: static-pool-01
      nodes:
        - endpoint: 78.47.18.140  # Same IP, now in control pool
          secretRef:
            name: static-nodes-key
            namespace: e2e-secrets

kubernetes:
  clusters:
    - name: hybrid-cluster
      pools:
        control:
          - static-pool-01  # static node now as control-plane
        compute:
          - aws-cmpt-nodes

Current Behavior

The configuration is accepted and processed until it reaches Kube-Eleven, where KubeOne fails with:

Error: configuration validation
staticWorkers.hosts[0]: Invalid value: "PublicAddress": public IP address "78.47.18.140" already used for control-plane node

Expected Behavior

Claudie should detect that a static node's role has changed and perform a safe transition:

  1. Drain the node with a timeout (e.g. 10 minutes) to allow workloads to gracefully terminate
  2. Remove the node from the cluster in its old role
  3. Re-add the node to the cluster in its new role

This should work for both directions: compute -> control-plane and control-plane -> compute.

Additional Context

  • The same transition logic should apply in both directions (worker to control-plane and control-plane to worker)
  • A drain timeout should be enforced to prevent indefinite hangs on workloads that refuse to terminate
  • The workflow should be: detect role change -> drain -> remove -> re-add in new role

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions