Make applyingUpdate reconciler idempotent#7706
Conversation
|
Thanks for taking a look! The 6 failing checks appear to be pre-existing |
|
Thanks for having a look! I guess there's a fundamental logic error in your current version (and I am a bit appalled that none of the Autopilot integration tests failed). The target k0s executable always exists, so renaming will never happen and the update is never applied ... What about the following idea: Using hard links to keep the executable around, even after replacing the target k0s path. In the apply phase, we could:
A note on the integration tests not failing: The autopilot upgrade tests from v1.35.4 to this PR's version start the cluster with the old executable so this PR's update logic is never executed, and the other self-update integration tests can't detect this, since the old and the new executable are the same anyways. |
| // By checking the target first, we make the reconciler idempotent. | ||
| k0sBinaryFilename := filepath.Join(r.k0sBinaryDir, "k0s") | ||
|
|
||
| if _, err := os.Stat(k0sBinaryFilename); err != nil && !errors.Is(err, os.ErrNotExist) { |
There was a problem hiding this comment.
But won't it detect the old k0s binary as well as existing and do nothing?
Also I think this form is easier to read:
if _, err := os.Stat(k0sBinaryFilename); err != nil {
if !errors.Is(err, os.ErrNotExist) {
return cr.Result{}, fmt.Errorf("unable to stat k0s binary '%s': %w", k0sBinaryFilename, err)
}
// Target binary not in place — proceed with the normal update flow
...
} else {
}
| // By checking the target first, we make the reconciler idempotent. | ||
| k0sBinaryFilename := filepath.Join(r.k0sBinaryDir, "k0s") | ||
|
|
||
| if _, err := os.Stat(k0sBinaryFilename); err != nil && !errors.Is(err, os.ErrNotExist) { |
There was a problem hiding this comment.
I had pre-empted earlier in #7703 (comment).
Won't it be better to compare the binary with the expected one using the checksum? The info is available in the signal annotation already
...v1.34.7+k0s.0-amd64","version":"v1.34.7+k0s.0","sha256":"f9e1335e2c4cc6e1cea3970d38bd5282d6382f4aea5050924ff50c520194619f"}...
There was a problem hiding this comment.
I'd rather avoid trying to figure out if the replace already happened. I'd rather simply run the replacement again, unconditionally. My favorite is currently the hard link approach. No I/O, just a single syscall.
|
Thank you for catching this bug. I was indeed mistaken. I think it's definitely a good idea to go with the solution you proposed. I'll update the code accordingly. Thank you! |
0d2b2d2 to
29da209
Compare
When client.Update fails after os.Rename(k0s.tmp, k0s), the reconciler retries but k0s.tmp is gone, causing an infinite error loop that leaves the node stuck in ApplyingUpdate status. Replace the single rename with a hard link + rename sequence: create k0s.new as a hard link to k0s.tmp, then rename k0s.new to k0s. Since both k0s.tmp and k0s.new share the same inode, k0s.tmp survives the rename. If client.Update fails and the reconciler re-triggers, the checks on k0s.tmp correctly detect the pending work and the whole sequence can be safely replayed. Fixes: k0sproject#7703 Signed-off-by: bainaichang <3215903958@qq.com>
29da209 to
31520d1
Compare
|
Oh! I pushed a bit too many times; I didn’t expect the PR to update automatically! |
|
This pull request has merge conflicts that need to be resolved. |
Description
When the client.Update call fails after successfully renaming k0s.tmp to k0s
(e.g. due to a resourceVersion conflict), the reconciler retries but fails
because k0s.tmp no longer exists. This results in an infinite error loop that
leaves the node stuck in ApplyingUpdate status.
Fix this by checking if the target k0s binary is already in place before
attempting the file rename. If it is, skip the file operations and proceed
directly to updating the signaling status to Restart.
Fixes: #7703
Type of change
How Has This Been Tested?
go buildpassesmake check-unit)The logic follows the same pattern as the fix in PR #6994 for
schedulable.go,which addressed a similar non-idempotent reconciler issue.
Checklist