Failed node update will leave autoscaler in disabled state

When an eks-rolling-update job failes, the previous cluster state is not automatically recovered, instead requiring manual intervention:

```
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,138 INFO     InstanceId i-026bce300ffa7d8d0 is node ip-10-208-33-228.eu-central-1.compute.internal in kubernetes land                                                 │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,138 INFO     Draining worker node with kubectl drain ip-10-208-33-228.eu-central-1.compute.internal --ignore-daemonsets --delete-emptydir-data --timeout=300s...      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd node/ip-10-208-33-228.eu-central-1.compute.internal already cordoned                                                                                                                      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd error: unable to drain node "ip-10-208-33-228.eu-central-1.compute.internal" due to error:cannot delete Pods declare no controller (use --force to override): gitlab-runner/runner-8e3ydb │
│ hn-project-1998-concurrent-8-ek1wq7qg, continuing command...                                                                                                                                                                               │
│ iris-devops-rolling-node-update-manual-42b-hrvhd There are pending nodes to be drained:                                                                                                                                                    │
│ iris-devops-rolling-node-update-manual-42b-hrvhd  ip-10-208-33-228.eu-central-1.compute.internal                                                                                                                                           │
│ iris-devops-rolling-node-update-manual-42b-hrvhd cannot delete Pods declare no controller (use --force to override): gitlab-runner/runner-8e3ydbhn-project-1998-concurrent-8-ek1wq7qg                                                      │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 INFO     Node not drained properly. Exiting                                                                                                                       │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    ('Rolling update on ASG failed', 'ci-runner-kas-20230710121010942300000012')                                                                             │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    *** Rolling update of ASG has failed. Exiting ***                                                                                                        │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    AWS Auto Scaling Group processes will need resuming manually                                                                                             │
│ iris-devops-rolling-node-update-manual-42b-hrvhd 2023-10-09 10:25:13,990 ERROR    Kubernetes Cluster Autoscaler will need resuming manually                                                                                                │
```

Most notably, auto-scaling will be scaled to 0. This is an issue as our workloads (especially CI) heavily depend on functioning auto-scaling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failed node update will leave autoscaler in disabled state #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed node update will leave autoscaler in disabled state #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions