Skip to content

Adding checkpointing improvement#1062

Open
visheshtanksale wants to merge 1 commit into
kubernetes-sigs:mainfrom
visheshtanksale:checkpoint-improvement
Open

Adding checkpointing improvement#1062
visheshtanksale wants to merge 1 commit into
kubernetes-sigs:mainfrom
visheshtanksale:checkpoint-improvement

Conversation

@visheshtanksale
Copy link
Copy Markdown
Contributor

@visheshtanksale visheshtanksale commented Apr 18, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • Adding deepcopy to all the types that are checkpointed
  • Added const CheckpointVersion = "v2" and a Version string field (JSON: "version,omitempty") to Checkpoint. MarshalCheckpoint now always sets cp.Version = CheckpointVersion so every written checkpoint carries an explicit version tag.
  • Added other map[string]json.RawMessage to checkpoint for downgrade saferty

Which issue(s) this PR is related to:

Fixes #507

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE


Additional documentation (design docs, usage docs, etc.):


Checklist

  • make check test passes locally
  • make check-generate passes if api/ changed (CRDs, deepcopy, informers, listers, clientset)
  • make check-modules passes if go.mod / go.sum changed
  • Tests added or updated for the change
  • Helm chart (deployments/helm) updated if flags, RBAC, or defaults changed

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 18, 2026
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 18, 2026
@visheshtanksale visheshtanksale force-pushed the checkpoint-improvement branch from fcd236f to 550cd46 Compare April 18, 2026 00:22
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 18, 2026
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
@visheshtanksale visheshtanksale force-pushed the checkpoint-improvement branch from 550cd46 to 192cff2 Compare April 18, 2026 00:24
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@visheshtanksale: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-dra-driver-nvidia-gpu-e2e-lambda-gpu 192cff2 link false /test pull-dra-driver-nvidia-gpu-e2e-lambda-gpu

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-triage-robot
Copy link
Copy Markdown

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 18, 2026
Copy link
Copy Markdown
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreeing with the intent (explicit envelope version + unknown-field preservation is the right direction), but there's a load-bearing bug and the PR needs tests before it can merge.

  • Blocker: ToLatestVersion discards the other map. The function constructs latest := &Checkpoint{} and copies only V2; MarshalCheckpoint unconditionally calls cp = cp.ToLatestVersion() on entry, and updateCheckpoint in device_state.go does the same before the mutate callback runs. Result: unknown fields parsed into cp.other are dropped before the next write, defeating the preservation mechanism on every read-modify-write. Same bug in both cmd/gpu-kubelet-plugin/checkpoint.go and cmd/compute-domain-kubelet-plugin/checkpoint.go. Inline on the GPU-plugin copy.
  • Blocker: no tests. A PR whose entire value proposition is schema/downgrade/round-trip safety needs: (1) round-trip marshal→unmarshal→re-marshal byte-equality; (2) downgrade round-trip — unmarshal a fixture with an unknown top-level field, re-marshal, assert it survives (this test fails against current code, which is the point); (3) V1 back-compat golden file under testdata/; (4) DeepCopy no-alias per type, especially deepCopyDevice (catches a regression where someone shallow-copies the Requests/CDIDeviceIDs slices).
  • Follow-up (not blocking): checkpoint.go / checkpointv.go / deepcopy.go are substantively duplicated between the two plugins. Worth an issue to lift the envelope (Checkpoint, Version, other, V1/V2 checksum plumbing) into a shared pkg/ package. Related: the atomic-write guarantee (tmp + Sync() + Rename()) lives in k8s.io/kubernetes/pkg/kubelet/util/store/filestore.go, not here — a one-line comment in createCheckpoint pointing at that dependency would help a future refactor avoid silently losing atomicity.

Scoping question: what exact failure mode does #507 describe? If it's the downgrade-safety story, that's currently broken by the first blocker. If it's DeepCopy-to-avoid-aliasing, what's the observable bug the deep copy prevents?


func (cp *Checkpoint) MarshalCheckpoint() ([]byte, error) {
cp = cp.ToLatestVersion()
cp.Version = CheckpointVersion
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp = cp.ToLatestVersion() on the line above returns a fresh &Checkpoint{} that only copies V2 — it drops Version and other. You then re-stamp Version here, but other is already gone. This defeats the whole unknown-field preservation mechanism on every write: an unknown top-level field read via UnmarshalJSON is captured into cp.other, survives until MarshalCheckpoint, and then gets discarded by ToLatestVersion before the final json.Marshal(cp). Same call pattern in updateCheckpoint in device_state.go, so the drop also happens before the mutate callback runs. Fix: inside ToLatestVersion, copy cp.other (and arguably cp.Version) into latest. Same bug exists in cmd/compute-domain-kubelet-plugin/checkpoint.go — please patch both. A round-trip test asserting an unknown field survives would have caught this.

"k8s.io/kubernetes/pkg/kubelet/checkpointmanager/checksum"
)

const CheckpointVersion = "v2"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"v2" collides conceptually with the inner CheckpointV2 type embedded as json:"v2". Is the envelope version tied 1:1 to the inner payload, or is it an independent envelope-schema version? A future reader has to re-derive which "v2" is which. Prefer an integer (Version int + CheckpointEnvelopeVersion = 2) or a non-colliding string ("envelope.v2", "1.0").


cp.Checksum = 0
out, err := json.Marshal(*cp)
type v1View struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1View locks in the V1 checksum byte shape, but has no compile-time link to CheckpointV1. If someone adds a field to CheckpointV1 later, the checksum view silently diverges from the on-disk V1 payload and the error manifests only as "old driver cannot verify new file" in the field. Either add a comment pointing at CheckpointV1 as the source of truth, or derive the view from CheckpointV1 via struct tags so the compiler enforces the mirror.


func (c PreparedClaimV1) DeepCopy() PreparedClaimV1 {
var status resourceapi.ResourceClaimStatus
if s := c.Status.DeepCopy(); s != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.Status is a value, so c.Status.DeepCopy() should not return nil — the defensive if s != nil { status = *s } implies uncertainty. Either simplify to Status: *c.Status.DeepCopy(), or leave a comment explaining the defensive path (e.g., what condition would make DeepCopy return nil here).

Copy link
Copy Markdown
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Some non blocking suggestions.

V1 *CheckpointV1 `json:"v1,omitempty"`
V2 *CheckpointV2 `json:"v2,omitempty"`
// other holds unknown fields from a newer checkpoint format
other map[string]json.RawMessage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map[string]any maybe? Why put the encoding nuance in the type.

// MarshalJSON implements json.Marshaler, merging known fields with any
// unknown fields captured from a newer checkpoint format.
func (cp *Checkpoint) MarshalJSON() ([]byte, error) {
type Alias struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be type Alias Checksum itself since 'other' is an unexported field, the marshaling would work just fine. Also, any changes to Checkpoint struct will always stay in sync.

if err := json.Unmarshal(known, &merged); err != nil {
return nil, err
}
for k, v := range cp.other {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do a quick unit test around this maybe.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@rajatchopra: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm
Some non blocking suggestions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rajatchopra, visheshtanksale
Once this PR has been reviewed and has the lgtm label, please assign dims for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/bug Categorizes issue or PR as related to a bug. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Improve checkpointing semantics and reliability across new versions

5 participants