Skip to content

GPU plugin: fix upgrade errors as of newly added fields, add tests#1119

Merged
k8s-ci-robot merged 3 commits into
kubernetes-sigs:mainfrom
jgehrcke:jp/fix-gpu-upgrade-errors-add-test
May 12, 2026
Merged

GPU plugin: fix upgrade errors as of newly added fields, add tests#1119
k8s-ci-robot merged 3 commits into
kubernetes-sigs:mainfrom
jgehrcke:jp/fix-gpu-upgrade-errors-add-test

Conversation

@jgehrcke
Copy link
Copy Markdown
Contributor

Fixes #1080.

  • We introduced ShareID and Metadata on kubeletplugin.Device via a library bump in Bug fixes for Passthrough-Support feature #994. This broke upgrades from 25.12.0 for all non-empty checkpoints. As a fix, this patch proposes a lightweight wrapper for us to be able to annotate the offending fields with omitempty. This specific upgrade breakage was caught by the test suite, and I've confirmed that with the patch applied the simple GPU workload upgrade test once again passes.
  • We introduced ParentPCIBusID and PciBusID in Bug fixes for Passthrough-Support feature #994. This breaks upgrades from 25.12.0 when running a DynamicMIG or VfioDevice workload. This patch proposes to fix that by also adding corresponding omitempty annotations. I also added a new test that covers upgrading under dynamic MIG workload and confirmed that this new test failed without the fix, and succeeds with the patch applied.
  • To make checkpoint upgrade failures faster to debug, I've enriched the log output showing a useful diff emitted just before the otherwise hard-to-interpret message Error: error creating driver: unable to get checkpoint: checkpoint is corrupted. Below you can see an example for when "parentPCIBusID": "" would break checksum verification.
  • I also added a test that covers the emit-diff-via-logging logic.
  • Took liberty to act on Helm chart: keep supporting nvidia-dra-driver-gpu-component label for now #1079, I still don't see a downside and this makes it easy to test the general upgradeability by virtue of covering the downstream-25-12-0-on-NGC->upstream-0-4-0-dev upgrade path.

The new log output for better upgrade failure debuggability:

E0511 14:24:30.306830       1 device_state.go:640] checkpoint failed checksum verification; unified diff (on-disk vs re-marshaled by current binary):
--- on-disk
+++ re-marshaled
@@ -51,7 +51,8 @@
                     "giId": 7,
                     "ciId": 0,
                     "migUUID": "GPU-df52ace4-7143-c242-81d8-819a6b3819b4",
-                    "parentUUID": "GPU-df52ace4-7143-c242-81d8-819a6b3819b4"
+                    "parentUUID": "GPU-df52ace4-7143-c242-81d8-819a6b3819b4",
+                    "parentPCIBusID": ""
                   },
                   "device": {
                     "Requests": [
@@ -127,7 +128,8 @@
                     "giId": 7,
                     "ciId": 0,
                     "migUUID": "GPU-df52ace4-7143-c242-81d8-819a6b3819b4",
-                    "parentUUID": "GPU-df52ace4-7143-c242-81d8-819a6b3819b4"
+                    "parentUUID": "GPU-df52ace4-7143-c242-81d8-819a6b3819b4",
+                    "parentPCIBusID": ""
                   },
                   "device": {
                     "Requests": [
I0511 14:24:30.307024       1 main.go:224] shutdown
Error: error creating driver: unable to get checkpoint: checkpoint is corrupted

@k8s-ci-robot k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label May 11, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 11, 2026

Deploy Preview for dra-driver-nvidia-gpu ready!

Name Link
🔨 Latest commit aacccea
🔍 Latest deploy log https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/6a03092b97b19b00085d2a7a
😎 Deploy Preview https://deploy-preview-1119--dra-driver-nvidia-gpu.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot requested review from dims and guptaNswati May 11, 2026 17:04
@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 11, 2026
@jgehrcke
Copy link
Copy Markdown
Contributor Author

jgehrcke commented May 11, 2026

On this branch, all tests passed when manually running make bats-gpu:

27 tests, 0 failures in 232 seconds

// rejected by the new binary. Circumvent that by adding the `omitempty`
// annotations here for `ShareID` and `Metadata` which were added as part of the
// k8s 1.36 release cycle. Also see issue 1080.
type CheckpointedDevice kubeletplugin.Device
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the immediate kubeletplugin.Device cases, but the same upgrade fragility still exists for other checkpointed upstream API structs, especially resourceapi.ResourceClaimStatus correct?.

For example, if Kubernetes later adds a non-omitempty field anywhere under ResourceClaimStatus -> AllocationResult -> DeviceAllocationResult -> ..., re-marshaling an old checkpoint can again produce new zero-value JSON and fail checksum verification.

Going forward, we need to avoid checkpointing live upstream API types directly. A more robust pattern would be to define explicit checkpoint-only objects for every persisted type we own, like you did in this case, with stable JSON shape and omitempty on forward-added/optional fields, then convert to/from the current Kubernetes types at the boundary. That makes the checkpoint schema versioned and under our control instead of inheriting upstream serialization changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Shiva and this resonates with my proposal at #1080 (comment)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on a separate PR to decouple upstream structures from the Checkpoint

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@jgehrcke jgehrcke May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Shiva and this resonates with my proposal at #1080 (comment)

@shengnuo I forgot to previously recognize that you actually followed and understood the depth of the technical issue in our discussion in #1080 -- I appreciated that, and your push for a generic solution.

still exists for other checkpointed upstream API structs, especially resourceapi.ResourceClaimStatus correct?

Of course!

We can treat individual problems separately though. That is sometimes better in the overall impact per effort consideration.

What we know but what has to be said: this type of breakage never happens spontaneously, but specifically when we update vendored libraries. Hence, we have the opportunity to detect this in tests, and can respond accordingly.

Also: some types are probably already rather stable, and we shouldn't anticipate (even frequent) change in all DRA-related types.

I don't have a strong opinion here, but maybe we should not rush into preparing for changes in all types yet, for the 0.4.0 release. Because there is no user-facing problem to be solved.

I really do appreciate your work and thinking and exploration @shengnuo for the generic solution in #1120; as something that should grow into the next bigger push.

@shivamerla
Copy link
Copy Markdown
Contributor

@jgehrcke can you squash all commits. We can track changes related to resourceapi.ResourceClaimStatus in a separate issue.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
@jgehrcke jgehrcke force-pushed the jp/fix-gpu-upgrade-errors-add-test branch from 6871b8d to 683fb55 Compare May 12, 2026 11:01
jgehrcke added 2 commits May 12, 2026 11:02
tests: work towards reactivating GPU upgrade test
tests: log last-stable checkpoint before upgrade
tests: add upgrade test with DynMIG workload
tests: add test confirming checkpoint diff in log output

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…abiity

Add CheckpointedDevice, fix ShareID/Metadata upgrade errors
GPU plugin: log checkpoint diff on checksum error
GPU plugin: fix upgrade errors for MigLiveTuple and VfioDeviceInfo

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
@jgehrcke jgehrcke force-pushed the jp/fix-gpu-upgrade-errors-add-test branch from 683fb55 to aacccea Compare May 12, 2026 11:04
@jgehrcke
Copy link
Copy Markdown
Contributor Author

Thanks for feedback @shengnuo and @shivamerla.

@jgehrcke can you squash all commits.

I have squashed commits now, ack.

We can track changes related to resourceapi.ResourceClaimStatus in a separate issue.

We should do that, ack.

@jgehrcke
Copy link
Copy Markdown
Contributor Author

jgehrcke commented May 12, 2026

Tested again:

$ make image-build-and-copy-to-nodes
make -f deployments/container/Makefile build
[...]
export/compress took: 1.34 s
-rw------- 1 jgehrcke jgehrcke 78M May 12 11:08 ./dra-driver-dev-img.v9252FB.tar.gz
[...]

$ TEST_CHART_LOCAL=1 make bats-gpu
make -f tests/bats/Makefile tests-gpu
[...]
27 tests, 0 failures in 237 seconds

mock-gpu-e2e failed with a transient error (log):

#8 [internal] load metadata for nvcr.io/nvidia/k8s/container-toolkit:v1.19.0
#8 ERROR: failed to do request: Head "https://nvcr.io/v2/nvidia/k8s/container-toolkit/manifests/v1.19.0": net/http: TLS handshake timeout

@shivamerla
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 12, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: e901de21091ca801dfae7e78f813ed39a4a13593

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jgehrcke, shivamerla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [jgehrcke,shivamerla]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shivamerla
Copy link
Copy Markdown
Contributor

/retest

@shivamerla
Copy link
Copy Markdown
Contributor

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 12, 2026
@shivamerla
Copy link
Copy Markdown
Contributor

/skip-tests

@k8s-ci-robot k8s-ci-robot merged commit fefaeea into kubernetes-sigs:main May 12, 2026
17 of 18 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in DRA Driver for NVIDIA GPUs May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug] GPU plugin upgrade fails with checkpoint is corrupted

4 participants