fix: change priorityclass default for nllb pod#7787
Conversation
Signed-off-by: Case Wylie <cmwylie19@gmail.com>
|
Does the priority actually matter? We concluded in #6958, where this was considered before, that it probably doesn't, since nllb is running as static pods and the pod resource doesn't actually influence scheduling and execution of these pods. |
| // The Envoy Pod is the worker's load-balanced path to the control | ||
| // plane, so it must outlive ordinary workloads during graceful node | ||
| // shutdown and be protected from node-pressure eviction. | ||
| PriorityClassName: "system-node-critical", |
There was a problem hiding this comment.
We need the same for the Traefik pod, as well, then.
There was a problem hiding this comment.
AFAIK we need to use the numerical priorities only for static pods as kubelet does NOT understand the priority classes directly. It's the api-server / controller-manager that maps the prio class to numerical value at pod creation time.
There was a problem hiding this comment.
Gotcha, TY for the feedback. I will update this to adjust to priority only!
There was a problem hiding this comment.
✅ @jnummelin let me know if you or anyone else wants to see any more changes
|
hmm, interesting. Based on this I personally think eviction/drain cases are treated the same as for other pods. I think we really need to do some testing on this. Also, I don't think we can use |
|
I think more importantly, priority affects the order of things in the case of a gracefull node shutdown: ref: https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/ |
Gotcha, I am going to add some debugging notes @Josh-Tracy from testing. First config out of the box > k get cm -n kube-system worker-config-default-1.32 -o yaml | grep Grace
\"shutdownGracePeriod\":\"0s\",\"shutdownGracePeriodCriticalPods\":\"0s\"
> systemd-inhibit --list | grep kubelet #no output, good.Updated config for this test - > cat /etc/k0s/k0s.yaml
apiVersion: k0s.k0sproject.io/v1beta1
kind: Cluster
metadata:
name: k0s
spec:
featureGates:
- name: ImageVolume
enabled: true
network:
nodeLocalLoadBalancing:
enabled: true
type: EnvoyProxy
workerProfiles:
- name: default
values:
shutdownGracePeriod: 30s
shutdownGracePeriodCriticalPods: 15s
> systemctl restart k0sworker
> systemctl restart k0scontrollerWhat the worker logs > journalctl -u k0sworker -f
The system will power off now!
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.128688 1722 nodeshutdown_manager_linux.go:236] \"Shutdown manager detected new shutdown event, isNodeShuttingDownNow\" event=true" component=kubelet stream=stderr
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.128729 1722 nodeshutdown_manager_linux.go:244] \"Shutdown manager detected new shutdown event\" event=\"shutdown\"" component=kubelet stream=stderr
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.128761 1722 nodeshutdown_manager_linux.go:293] \"Shutdown manager processing shutdown event\"" component=kubelet stream=stderr
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.129848 1722 setters.go:602] \"Node became not ready\" node=\"canes-da6a\" condition={\"type\":\"Ready\",\"status\":\"False\",\"lastHeartbeatTime\":\"2026-06-11T14:34:17Z\",\"lastTransitionTime\":\"2026-06-11T14:34:17Z\",\"reason\":\"KubeletNotReady\",\"message\":\"node is shutting down\"}" component=kubelet stream=stderr
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.130053 1722 nodeshutdown_manager.go:153] \"Shutdown manager killing pod with gracePeriod\" pod=\"kube-system/coredns-9f76996f-vlrq2\" gracePeriod=15" component=kubelet stream=stderr
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.130068 1722 nodeshutdown_manager.go:153] \"Shutdown manager killing pod with gracePeriod\" pod=\"kube-system/nllb-canes-da6a\" gracePeriod=15" component=kubelet stream=stderrResult:
Failed to update status for pod ... Patch "https://[::1]:7443/...": unexpected EOF
... dial tcp [::1]:7443: connect: connection refusedBased on your link, would it be reasonable to bump/adjust the priority of Traefik and NLLB pod? Once enhancement we did on our end was to adjust the workerProfile, but we are unable to assign priority to these critical workloads which is why our shutdowns are not happening gracefully. workerProfiles:
- name: default
values:
shutdownGracePeriodByPodPriority:
- priority: 0
shutdownGracePeriodSeconds: 15
- priority: 1000000 # postgres-operator-pod
shutdownGracePeriodSeconds: 20
- priority: 1000000000 # kubevirt-cluster-critical
shutdownGracePeriodSeconds: 30
- priority: 2000000000 # system-cluster-critical
shutdownGracePeriodSeconds: 20
- priority: 2000001000 # system-node-critical
shutdownGracePeriodSeconds: 20 |
|
Want to clarify that shutdown does happen gracefully as expected. The status of the workloads that were running just never updates as expected. |
Yes, that is my understanding. I have to say, the upstream docs are not really clear on this topic for static pods though. |
Signed-off-by: Case Wylie <cmwylie19@gmail.com>
Signed-off-by: Case Wylie <cmwylie19@gmail.com>
Had to add |
Signed-off-by: Case Wylie <cmwylie19@gmail.com>
|
Could you please revert the timeout changes? They won't make the integration tests less flaky. We already have that on the radar. I'll just retrigger them until they pass. That's kinda annoying, but some of the requred fixes require upstream commits and aren't that trivial. |
Signed-off-by: Case Wylie <cmwylie19@gmail.com>
perfect, done! |
|
Successfully created backport PR for |
Description
The nllb pod has a very low priority class, during a graceful shutdown kubelet removes it first which results in a severed path from the worker to the kas (kube-apiserver) before the other pods can drain or report statuses.
This was a suggestion from @jnummelin. I still think we need a way to add graceful shutdown too (7783) but @jnummelin suggests there may be another way.
Noticed an error in the tests when setting only the numeric
Priorityon the static Pod broke thecheck-nllbjobs. Tthe kubelet runs the static Pod fine locally, but for every static Pod it also creates a read-only "mirror" Pod in the API so the Pod is visible tokubectlin the cluster, and that create call goes through normal admission. When it registers the mirror Pod in the kas, the Priority admission controller rejects an explicit integer priority that doesn't match apriorityClassName(priority admission controller computed 0 from the given PriorityClass name). So the Pod never became ready and the tests timed out. So that is why i needed to add thepriorityClassNametooFixes #7786
Relates to #7783
Type of change
How Has This Been Tested?
Checklist