-
Notifications
You must be signed in to change notification settings - Fork 55
Description
Summary
Claudie currently does not support transitioning static nodes between compute (worker) and control-plane roles. When a user moves a static node from a compute pool to a control-plane pool (or vice versa), KubeOne fails with a validation error because the node is simply re-added without first being properly removed from its previous role.
Goal
Implement support for static node role transitions (compute <-> control-plane). When a static node's role is changed in the InputManifest, Claudie should handle the transition gracefully by:
- Draining the node from its current role (with a configurable timeout, e.g. 10 minutes, to allow running workloads to complete)
- Removing the node from the cluster
- Re-adding the node in its new role (control-plane or compute)
Steps to Reproduce (current broken behavior)
- Apply an InputManifest with a static node in the compute pool:
nodePools:
static:
- name: static-pool-02
nodes:
- endpoint: 78.47.18.140
secretRef:
name: static-nodes-key
namespace: e2e-secrets
kubernetes:
clusters:
- name: hybrid-cluster
pools:
control:
- aws-ctrl-nodes
compute:
- static-pool-02 # static node used as worker-
Wait for the cluster to be fully provisioned
-
Apply an updated InputManifest that moves the same static node to the control-plane:
nodePools:
static:
- name: static-pool-01
nodes:
- endpoint: 78.47.18.140 # Same IP, now in control pool
secretRef:
name: static-nodes-key
namespace: e2e-secrets
kubernetes:
clusters:
- name: hybrid-cluster
pools:
control:
- static-pool-01 # static node now as control-plane
compute:
- aws-cmpt-nodesCurrent Behavior
The configuration is accepted and processed until it reaches Kube-Eleven, where KubeOne fails with:
Error: configuration validation
staticWorkers.hosts[0]: Invalid value: "PublicAddress": public IP address "78.47.18.140" already used for control-plane node
Expected Behavior
Claudie should detect that a static node's role has changed and perform a safe transition:
- Drain the node with a timeout (e.g. 10 minutes) to allow workloads to gracefully terminate
- Remove the node from the cluster in its old role
- Re-add the node to the cluster in its new role
This should work for both directions: compute -> control-plane and control-plane -> compute.
Additional Context
- The same transition logic should apply in both directions (worker to control-plane and control-plane to worker)
- A drain timeout should be enforced to prevent indefinite hangs on workloads that refuse to terminate
- The workflow should be: detect role change -> drain -> remove -> re-add in new role