You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've tried to restore an etcd backup taken from a freshly deployed k0s cluster and perform a disaster recovery on a different cluster, deployed from the same template. It seems there are several issues, which I'd like to showcase here. Unfortunately, no manual attempts allowed me to get the cluster to a healthy state after that, beyond partially working single controller node instance.
➜ mke3 git:(main) ✗ kubectl get ns
NAME STATUS AGE
calico-apiserver Active 11m
calico-system Active 11m
default Active 12m
k0rdent Active 11m
k0s-autopilot Active 11m
k0s-system Active 11m
kube-node-lease Active 12m
kube-public Active 12m
kube-system Active 12m
mgmt Active 6m46s
mke Active 6m32s
projectsveltos Active 7m24s
tigera-operator Active 11m
➜ mke3 git:(main) ✗ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-0-125.eu-central-1.compute.internal Ready control-plane 11m v1.32.6+k0s
ip-172-31-0-233.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s
ip-172-31-0-32.eu-central-1.compute.internal Ready control-plane 11m v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal Ready control-plane 12m v1.32.6+k0s
ip-172-31-0-68.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s
ip-172-31-0-88.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s
but overall apply eventually failed, trying to join second controller instance.
0.5. Old nodes present
It seems that etcd restore preserves previous nodes even in k0sctl apply scenario. Shouldn't these always be removed, as apply will join them once more with up-to-date IP addresses and configuration?
➜ mke3 git:(main) ✗ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-0-125.eu-central-1.compute.internal Ready control-plane 11m v1.32.6+k0s
ip-172-31-0-233.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s
ip-172-31-0-32.eu-central-1.compute.internal Ready control-plane 11m v1.32.6+k0s
ip-172-31-0-60.eu-central-1.compute.internal Ready control-plane 12m v1.32.6+k0s
ip-172-31-0-68.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s
ip-172-31-0-88.eu-central-1.compute.internal Ready <none> 11m v1.32.6+k0s
1. Kube-proxy errors
Here is a state I was able to observe, by manually cancelling the restore before the second node join:
The newly added node was not moving to a Ready state.
Cause - calico did not roll out properly due to kube-proxy. No amount of time spent waiting lead to the issue resolving itself.
Here are the logs from the kube-proxy pod:
➜ mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system --insecure-skip-tls-verify-backend=true
...
2025-10-27T13:40:22.146156841Z stderr F E1027 13:40:22.146060 1 event_broadcaster.go:279] "Unable to write event (may retry after sleeping)" err="Post \"https://84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com:6443/apis/events.k8s.io/v1/namespaces/default/events\": dial tcp: lookup 84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com on 172.31.0.2:53: no such host"
Restarting the kube-proxy pod helps:
➜ mke3 git:(main) ✗ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-0-60.eu-central-1.compute.internal Ready control-plane 30m v1.32.6+k0s
➜ mke3 git:(main) ✗ kubectl get pods -n calico-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-7669678ddb-z9z7t 1/1 Running 0 10m
calico-node-5gc9h 1/1 Running 0 9m36s
calico-typha-6b975b87d6-2js7b 1/1 Running 0 10m
csi-node-driver-rvb95 2/2 Running 0 32m
goldmane-77b796bd9-vtg2t 1/1 Running 0 10m
whisker-5d756b79c5-bdw9x 2/2 Running 0 10m
2. k0s kc logs - kubelet certificate issues
k0s kc logs reports certificate error, and requires to use --insecure-skip-tls-verify-backend=true flag.
➜ mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system
Error from server: Get "https://172.31.0.51:10250/containerLogs/kube-system/kube-proxy-dcmfv/kube-proxy": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error"while trying to verify candidate authority certificate "kubernetes-ca")
➜ mke3 git:(main) ✗ kubectl logs kube-proxy-dcmfv -n kube-system --insecure-skip-tls-verify-backend=true
...
2025-10-27T13:40:22.146156841Z stderr F E1027 13:40:22.146060 1 event_broadcaster.go:279] "Unable to write event (may retry after sleeping)" err="Post \"https://84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com:6443/apis/events.k8s.io/v1/namespaces/default/events\": dial tcp: lookup 84kcma-mke4-lb-0c6d75d0c71c7034.elb.eu-central-1.amazonaws.com on 172.31.0.2:53: no such host"
3. ETCD Cluster id mismatch on second node join
After joining the second controller node, during the waiting phase on the node readiness, etcd on the initial node starts to timeout. It appears that during restore the new controller node assumes that it is an etcd leader and the only member:
Kube-proxy startup issues → Requires manual pod restart to recover.
Is it possible to automate this step in k0sctl?
Certificate verification errors → Logs only accessible with --insecure-skip-tls-verify-backend=true.
Could kubelet apiServer certificates be restored/regenerated to include newly aded nodes? No other certificates seemed to cause an issue at this stage.
ETCD cluster ID mismatch on controller join → Causes etcd timeouts and broken multi-controller state.
Should restored clusters assume current Cluster ID / use restored Cluster ID or manage etcd members in other way?
I've tried to restore an etcd backup taken from a freshly deployed
k0scluster and perform a disaster recovery on a different cluster, deployed from the same template. It seems there are several issues, which I'd like to showcase here. Unfortunately, no manual attempts allowed me to get the cluster to a healthy state after that, beyond partially working single controller node instance.Here is a list of issues I observed:
Steps performed:
Restore step completed without issues:
but overall apply eventually failed, trying to join second controller instance.
0.5. Old nodes present
It seems that etcd restore preserves previous nodes even in
k0sctl applyscenario. Shouldn't these always be removed, asapplywill join them once more with up-to-date IP addresses and configuration?1.
Kube-proxyerrorsHere is a state I was able to observe, by manually cancelling the restore before the second node join:
The newly added node was not moving to a
Readystate.Cause - calico did not roll out properly due to kube-proxy. No amount of time spent waiting lead to the issue resolving itself.
Here are the logs from the kube-proxy pod:
Restarting the kube-proxy pod helps:
2.
k0s kc logs- kubelet certificate issuesk0s kc logsreports certificate error, and requires to use--insecure-skip-tls-verify-backend=trueflag.3. ETCD Cluster id mismatch on second node join
After joining the second controller node, during the waiting phase on the node readiness, etcd on the initial node starts to timeout. It appears that during restore the new controller node assumes that it is an etcd leader and the only member:
Logs from k0scontroller on the second node report the following:
Questions
k0sctl?--insecure-skip-tls-verify-backend=true.kubeletapiServer certificates be restored/regenerated to include newly aded nodes? No other certificates seemed to cause an issue at this stage.