Skip to content

fix: change priorityclass default for nllb pod#7787

Merged
twz123 merged 5 commits into
k0sproject:mainfrom
cmwylie19:7786-nlln-pc
Jun 19, 2026
Merged

fix: change priorityclass default for nllb pod#7787
twz123 merged 5 commits into
k0sproject:mainfrom
cmwylie19:7786-nlln-pc

Conversation

@cmwylie19

@cmwylie19 cmwylie19 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Description

The nllb pod has a very low priority class, during a graceful shutdown kubelet removes it first which results in a severed path from the worker to the kas (kube-apiserver) before the other pods can drain or report statuses.

This was a suggestion from @jnummelin. I still think we need a way to add graceful shutdown too (7783) but @jnummelin suggests there may be another way.

Noticed an error in the tests when setting only the numeric Priority on the static Pod broke the check-nllb jobs. Tthe kubelet runs the static Pod fine locally, but for every static Pod it also creates a read-only "mirror" Pod in the API so the Pod is visible to kubectl in the cluster, and that create call goes through normal admission. When it registers the mirror Pod in the kas, the Priority admission controller rejects an explicit integer priority that doesn't match a priorityClassName (priority admission controller computed 0 from the given PriorityClass name). So the Pod never became ready and the tests timed out. So that is why i needed to add the priorityClassName too

Fixes #7786
Relates to #7783

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • [n/a] Auto test added

Checklist

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Signed-off-by: Case Wylie <cmwylie19@gmail.com>
@cmwylie19 cmwylie19 requested review from a team as code owners June 11, 2026 22:03
@cmwylie19 cmwylie19 requested review from jnummelin and kke June 11, 2026 22:03
@pschichtel

pschichtel commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Does the priority actually matter? We concluded in #6958, where this was considered before, that it probably doesn't, since nllb is running as static pods and the pod resource doesn't actually influence scheduling and execution of these pods.

// The Envoy Pod is the worker's load-balanced path to the control
// plane, so it must outlive ordinary workloads during graceful node
// shutdown and be protected from node-pressure eviction.
PriorityClassName: "system-node-critical",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the same for the Traefik pod, as well, then.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we need to use the numerical priorities only for static pods as kubelet does NOT understand the priority classes directly. It's the api-server / controller-manager that maps the prio class to numerical value at pod creation time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, TY for the feedback. I will update this to adjust to priority only!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnummelin let me know if you or anyone else wants to see any more changes

@jnummelin

Copy link
Copy Markdown
Member

hmm, interesting. Based on this I personally think eviction/drain cases are treated the same as for other pods.

I think we really need to do some testing on this.

Also, I don't think we can use priorityClass on a static pod. That's something the api-server uses to map into a priority number. So AFAIK kubelet doesn't really understand the priorityClass per se at all, only the numerical values.

@jnummelin

Copy link
Copy Markdown
Member

I think more importantly, priority affects the order of things in the case of a gracefull node shutdown:

During a graceful shutdown, kubelet terminates pods in two phases:

Terminate regular pods running on the node.
Terminate [critical pods](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical) running on the node.

ref: https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/

@cmwylie19

cmwylie19 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

hmm, interesting. Based on this I personally think eviction/drain cases are treated the same as for other pods.

I think we really need to do some testing on this.

Also, I don't think we can use priorityClass on a static pod. That's something the api-server uses to map into a priority number. So AFAIK kubelet doesn't really understand the priorityClass per se at all, only the numerical values.

Gotcha, I am going to add some debugging notes @Josh-Tracy from testing.

First config out of the box

> k get cm -n kube-system worker-config-default-1.32 -o yaml | grep Grace
\"shutdownGracePeriod\":\"0s\",\"shutdownGracePeriodCriticalPods\":\"0s\"

> systemd-inhibit --list | grep kubelet #no output, good.

Updated config for this test - 30s gives use 15 seconds for normal pods and 15 seconds for critical pods.

> cat /etc/k0s/k0s.yaml 
apiVersion: k0s.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: k0s
spec:
  featureGates:
    - name: ImageVolume
      enabled: true
  network:
    nodeLocalLoadBalancing:
      enabled: true
      type: EnvoyProxy
  workerProfiles:
    - name: default
      values:
        shutdownGracePeriod: 30s
        shutdownGracePeriodCriticalPods: 15s
> systemctl restart k0sworker
> systemctl restart k0scontroller

What the worker logs

> journalctl -u k0sworker -f
The system will power off now!

Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.128688    1722 nodeshutdown_manager_linux.go:236] \"Shutdown manager detected new shutdown event, isNodeShuttingDownNow\" event=true" component=kubelet stream=stderr
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.128729    1722 nodeshutdown_manager_linux.go:244] \"Shutdown manager detected new shutdown event\" event=\"shutdown\"" component=kubelet stream=stderr
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.128761    1722 nodeshutdown_manager_linux.go:293] \"Shutdown manager processing shutdown event\"" component=kubelet stream=stderr
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.129848    1722 setters.go:602] \"Node became not ready\" node=\"canes-da6a\" condition={\"type\":\"Ready\",\"status\":\"False\",\"lastHeartbeatTime\":\"2026-06-11T14:34:17Z\",\"lastTransitionTime\":\"2026-06-11T14:34:17Z\",\"reason\":\"KubeletNotReady\",\"message\":\"node is shutting down\"}" component=kubelet stream=stderr
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.130053    1722 nodeshutdown_manager.go:153] \"Shutdown manager killing pod with gracePeriod\" pod=\"kube-system/coredns-9f76996f-vlrq2\" gracePeriod=15" component=kubelet stream=stderr
Jun 11 14:34:17 canes-da6a k0s[1346]: time="2026-06-11 14:34:17" level=info msg="I0611 14:34:17.130068    1722 nodeshutdown_manager.go:153] \"Shutdown manager killing pod with gracePeriod\" pod=\"kube-system/nllb-canes-da6a\" gracePeriod=15" component=kubelet stream=stderr

Result:

the pods just sat in running, pending, or unknown when watching from the control node and the worker node's k0sworker was reporting the following for the pods. Might need to fix this? I think its the nllb pod being terminated immediately meaning the worker cannot communicate with the controllers API server and report status due to nllb having a priorityClass of 0

Failed to update status for pod ... Patch "https://[::1]:7443/...": unexpected EOF
... dial tcp [::1]:7443: connect: connection refused

Based on your link, would it be reasonable to bump/adjust the priority of Traefik and NLLB pod?

Once enhancement we did on our end was to adjust the workerProfile, but we are unable to assign priority to these critical workloads which is why our shutdowns are not happening gracefully.

  workerProfiles:
    - name: default
      values:
        shutdownGracePeriodByPodPriority:
          - priority: 0
            shutdownGracePeriodSeconds: 15
          - priority: 1000000           # postgres-operator-pod
            shutdownGracePeriodSeconds: 20
          - priority: 1000000000        # kubevirt-cluster-critical
            shutdownGracePeriodSeconds: 30
          - priority: 2000000000        # system-cluster-critical
            shutdownGracePeriodSeconds: 20
          - priority: 2000001000        # system-node-critical
            shutdownGracePeriodSeconds: 20

@Josh-Tracy

Copy link
Copy Markdown

Want to clarify that shutdown does happen gracefully as expected. The status of the workloads that were running just never updates as expected.

@jnummelin

Copy link
Copy Markdown
Member

Based on your link, would it be reasonable to bump/adjust the priority of Traefik and NLLB pod?

Yes, that is my understanding. I have to say, the upstream docs are not really clear on this topic for static pods though.

Signed-off-by: Case Wylie <cmwylie19@gmail.com>
twz123
twz123 previously approved these changes Jun 16, 2026

@twz123 twz123 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@twz123 twz123 added bug Something isn't working area/worker backport/release-1.36 PR that needs to be backported/cherrypicked to the release-1.36 branch labels Jun 16, 2026
Signed-off-by: Case Wylie <cmwylie19@gmail.com>
@cmwylie19

Copy link
Copy Markdown
Contributor Author

Thanks!

Had to add priorityClassName to satisfy the kas requirements for the priority admission controller of the static pod because the priority did not match an actual priority class. Didn't notice until i saw failing tests

Signed-off-by: Case Wylie <cmwylie19@gmail.com>
@twz123

twz123 commented Jun 18, 2026

Copy link
Copy Markdown
Member

Could you please revert the timeout changes? They won't make the integration tests less flaky. We already have that on the radar. I'll just retrigger them until they pass. That's kinda annoying, but some of the requred fixes require upstream commits and aren't that trivial.

Signed-off-by: Case Wylie <cmwylie19@gmail.com>
@cmwylie19

Copy link
Copy Markdown
Contributor Author

Could you please revert the timeout changes? They won't make the integration tests less flaky. We already have that on the radar. I'll just retrigger them until they pass. That's kinda annoying, but some of the requred fixes require upstream commits and aren't that trivial.

perfect, done!

@twz123 twz123 merged commit 955832c into k0sproject:main Jun 19, 2026
217 of 220 checks passed
@k0s-bot

k0s-bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Successfully created backport PR for release-1.36:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/worker backport/release-1.36 PR that needs to be backported/cherrypicked to the release-1.36 branch bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NLLB Envoy Pod should default its priorityClassName to system-node-critical

6 participants