Skip to content

Make applyingUpdate reconciler idempotent#7706

Open
bainaichang wants to merge 1 commit into
k0sproject:mainfrom
bainaichang:fix-apply-update-idempotent
Open

Make applyingUpdate reconciler idempotent#7706
bainaichang wants to merge 1 commit into
k0sproject:mainfrom
bainaichang:fix-apply-update-idempotent

Conversation

@bainaichang

Copy link
Copy Markdown

Description

When the client.Update call fails after successfully renaming k0s.tmp to k0s
(e.g. due to a resourceVersion conflict), the reconciler retries but fails
because k0s.tmp no longer exists. This results in an infinite error loop that
leaves the node stuck in ApplyingUpdate status.

Fix this by checking if the target k0s binary is already in place before
attempting the file rename. If it is, skip the file operations and proceed
directly to updating the signaling status to Restart.

Fixes: #7703

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added
  • go build passes
  • Existing unit tests pass (make check-unit)

The logic follows the same pattern as the fix in PR #6994 for schedulable.go,
which addressed a similar non-idempotent reconciler issue.

Checklist

  • My code follows the style of this project
  • My commit messages are in the proper format
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

@jnummelin jnummelin marked this pull request as ready for review May 30, 2026 21:00
@jnummelin jnummelin requested review from a team as code owners May 30, 2026 21:00
@jnummelin jnummelin requested review from makhov and ncopa May 30, 2026 21:00
@jnummelin jnummelin marked this pull request as draft May 30, 2026 21:00
@bainaichang

Copy link
Copy Markdown
Author

Thanks for taking a look! The 6 failing checks appear to be pre-existing
flaky tests unrelated to this change (calico-ipv6, kuberouter-ipv6, etc.).
Let me know if any changes are needed.

@twz123

twz123 commented Jun 1, 2026

Copy link
Copy Markdown
Member

Thanks for having a look! I guess there's a fundamental logic error in your current version (and I am a bit appalled that none of the Autopilot integration tests failed). The target k0s executable always exists, so renaming will never happen and the update is never applied ...

What about the following idea: Using hard links to keep the executable around, even after replacing the target k0s path. In the apply phase, we could:

  1. Create a hard link from k0s.tmp to k0s.new, overwriting k0s.new if that already exists (or removing it first, I think there's no way for link to overwrite the target path).
  2. Do the rename from k0s.new to k0s.
  3. Update the resource status.
  4. Remove k0s.tmp.

A note on the integration tests not failing: The autopilot upgrade tests from v1.35.4 to this PR's version start the cluster with the old executable so this PR's update logic is never executed, and the other self-update integration tests can't detect this, since the old and the new executable are the same anyways.

// By checking the target first, we make the reconciler idempotent.
k0sBinaryFilename := filepath.Join(r.k0sBinaryDir, "k0s")

if _, err := os.Stat(k0sBinaryFilename); err != nil && !errors.Is(err, os.ErrNotExist) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But won't it detect the old k0s binary as well as existing and do nothing?

Also I think this form is easier to read:

if _, err := os.Stat(k0sBinaryFilename); err != nil {
     if !errors.Is(err, os.ErrNotExist) {
          return cr.Result{}, fmt.Errorf("unable to stat k0s binary '%s': %w", k0sBinaryFilename, err)
     }
     // Target binary not in place — proceed with the normal update flow
     ...
} else {
}

// By checking the target first, we make the reconciler idempotent.
k0sBinaryFilename := filepath.Join(r.k0sBinaryDir, "k0s")

if _, err := os.Stat(k0sBinaryFilename); err != nil && !errors.Is(err, os.ErrNotExist) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had pre-empted earlier in #7703 (comment).

Won't it be better to compare the binary with the expected one using the checksum? The info is available in the signal annotation already

...v1.34.7+k0s.0-amd64","version":"v1.34.7+k0s.0","sha256":"f9e1335e2c4cc6e1cea3970d38bd5282d6382f4aea5050924ff50c520194619f"}...

@twz123 twz123 Jun 1, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather avoid trying to figure out if the replace already happened. I'd rather simply run the replacement again, unconditionally. My favorite is currently the hard link approach. No I/O, just a single syscall.

@bainaichang

Copy link
Copy Markdown
Author

Thank you for catching this bug. I was indeed mistaken. I think it's definitely a good idea to go with the solution you proposed. I'll update the code accordingly. Thank you!

@bainaichang bainaichang force-pushed the fix-apply-update-idempotent branch 2 times, most recently from 0d2b2d2 to 29da209 Compare June 2, 2026 04:46
When client.Update fails after os.Rename(k0s.tmp, k0s), the reconciler
retries but k0s.tmp is gone, causing an infinite error loop that leaves
the node stuck in ApplyingUpdate status.

Replace the single rename with a hard link + rename sequence: create
k0s.new as a hard link to k0s.tmp, then rename k0s.new to k0s. Since
both k0s.tmp and k0s.new share the same inode, k0s.tmp survives the
rename. If client.Update fails and the reconciler re-triggers, the
checks on k0s.tmp correctly detect the pending work and the whole
sequence can be safely replayed.

Fixes: k0sproject#7703

Signed-off-by: bainaichang <3215903958@qq.com>
@bainaichang bainaichang force-pushed the fix-apply-update-idempotent branch from 29da209 to 31520d1 Compare June 2, 2026 04:53
@bainaichang

Copy link
Copy Markdown
Author

Oh! I pushed a bit too many times; I didn’t expect the PR to update automatically!

@bainaichang bainaichang marked this pull request as ready for review June 3, 2026 07:01
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that need to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Autopilot applyingUpdate reconciler is non-idempotent — workers wedge forever after client.Update conflict

4 participants