GPU plugin: fix upgrade errors as of newly added fields, add tests#1119
Conversation
✅ Deploy Preview for dra-driver-nvidia-gpu ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
On this branch, all tests passed when manually running |
| // rejected by the new binary. Circumvent that by adding the `omitempty` | ||
| // annotations here for `ShareID` and `Metadata` which were added as part of the | ||
| // k8s 1.36 release cycle. Also see issue 1080. | ||
| type CheckpointedDevice kubeletplugin.Device |
There was a problem hiding this comment.
This fixes the immediate kubeletplugin.Device cases, but the same upgrade fragility still exists for other checkpointed upstream API structs, especially resourceapi.ResourceClaimStatus correct?.
For example, if Kubernetes later adds a non-omitempty field anywhere under ResourceClaimStatus -> AllocationResult -> DeviceAllocationResult -> ..., re-marshaling an old checkpoint can again produce new zero-value JSON and fail checksum verification.
Going forward, we need to avoid checkpointing live upstream API types directly. A more robust pattern would be to define explicit checkpoint-only objects for every persisted type we own, like you did in this case, with stable JSON shape and omitempty on forward-added/optional fields, then convert to/from the current Kubernetes types at the boundary. That makes the checkpoint schema versioned and under our control instead of inheriting upstream serialization changes.
There was a problem hiding this comment.
I agree with Shiva and this resonates with my proposal at #1080 (comment)
There was a problem hiding this comment.
I am working on a separate PR to decouple upstream structures from the Checkpoint
There was a problem hiding this comment.
I agree with Shiva and this resonates with my proposal at #1080 (comment)
@shengnuo I forgot to previously recognize that you actually followed and understood the depth of the technical issue in our discussion in #1080 -- I appreciated that, and your push for a generic solution.
still exists for other checkpointed upstream API structs, especially resourceapi.ResourceClaimStatus correct?
Of course!
We can treat individual problems separately though. That is sometimes better in the overall impact per effort consideration.
What we know but what has to be said: this type of breakage never happens spontaneously, but specifically when we update vendored libraries. Hence, we have the opportunity to detect this in tests, and can respond accordingly.
Also: some types are probably already rather stable, and we shouldn't anticipate (even frequent) change in all DRA-related types.
I don't have a strong opinion here, but maybe we should not rush into preparing for changes in all types yet, for the 0.4.0 release. Because there is no user-facing problem to be solved.
I really do appreciate your work and thinking and exploration @shengnuo for the generic solution in #1120; as something that should grow into the next bigger push.
|
@jgehrcke can you squash all commits. We can track changes related to |
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
6871b8d to
683fb55
Compare
tests: work towards reactivating GPU upgrade test tests: log last-stable checkpoint before upgrade tests: add upgrade test with DynMIG workload tests: add test confirming checkpoint diff in log output Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
…abiity Add CheckpointedDevice, fix ShareID/Metadata upgrade errors GPU plugin: log checkpoint diff on checksum error GPU plugin: fix upgrade errors for MigLiveTuple and VfioDeviceInfo Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
683fb55 to
aacccea
Compare
|
Thanks for feedback @shengnuo and @shivamerla.
I have squashed commits now, ack.
We should do that, ack. |
|
Tested again:
|
|
/lgtm |
|
LGTM label has been added. DetailsGit tree hash: e901de21091ca801dfae7e78f813ed39a4a13593 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jgehrcke, shivamerla The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/release-note-none |
|
/skip-tests |
Fixes #1080.
ShareIDandMetadataonkubeletplugin.Devicevia a library bump in Bug fixes for Passthrough-Support feature #994. This broke upgrades from 25.12.0 for all non-empty checkpoints. As a fix, this patch proposes a lightweight wrapper for us to be able to annotate the offending fields withomitempty. This specific upgrade breakage was caught by the test suite, and I've confirmed that with the patch applied the simple GPU workload upgrade test once again passes.ParentPCIBusIDandPciBusIDin Bug fixes for Passthrough-Support feature #994. This breaks upgrades from 25.12.0 when running a DynamicMIG or VfioDevice workload. This patch proposes to fix that by also adding correspondingomitemptyannotations. I also added a new test that covers upgrading under dynamic MIG workload and confirmed that this new test failed without the fix, and succeeds with the patch applied.Error: error creating driver: unable to get checkpoint: checkpoint is corrupted. Below you can see an example for when"parentPCIBusID": ""would break checksum verification.nvidia-dra-driver-gpu-componentlabel for now #1079, I still don't see a downside and this makes it easy to test the general upgradeability by virtue of covering thedownstream-25-12-0-on-NGC->upstream-0-4-0-devupgrade path.The new log output for better upgrade failure debuggability: