Gpu health check#545
Conversation
Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Swati Gupta <swatig@nvidia.com>
|
current log: resourceclaim status update is still broken. |
There was a problem hiding this comment.
Pull Request Overview
Adds preliminary GPU health monitoring functionality to detect and handle unhealthy GPU devices in the NVIDIA DRA driver. The implementation listens for NVML events (XID errors, ECC errors) and removes unhealthy devices from the allocatable pool.
- Introduces device health status tracking with
Healthy/Unhealthystates - Implements NVML event-based health monitoring for GPU devices
- Updates resource claim status to reflect device health conditions
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/gpu-kubelet-plugin/nvlib.go | Initialize all devices with Healthy status |
| cmd/gpu-kubelet-plugin/driver.go | Add device health monitor initialization and health notification handling |
| cmd/gpu-kubelet-plugin/device_state.go | Add device health status updates and resource claim status reporting |
| cmd/gpu-kubelet-plugin/device_health.go | New file implementing NVML event-based health monitoring |
| cmd/gpu-kubelet-plugin/allocatable.go | Add health status field and methods to AllocatableDevice |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| if err != nil { | ||
| return nil, fmt.Errorf("start deviceHealthMonitor: %w", err) | ||
| } | ||
| klog.Info("[SWATI DEBUGS] Started device health monitor") |
There was a problem hiding this comment.
There's a typo in the log message: 'DEBUGS' should be 'DEBUG' to match the pattern used in other debug messages.
| klog.Info("[SWATI DEBUGS] Started device health monitor") | |
| klog.Info("[SWATI DEBUG] Started device health monitor") |
| var resourceSlice resourceslice.Slice | ||
| for _, dev := range d.state.allocatable { | ||
| if dev.IsHealthy() { | ||
| klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev) |
There was a problem hiding this comment.
There's a typo in the log message: 'resoureslice' should be 'resourceslice'.
| klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev) | |
| klog.Infof("[SWATI DEBUG] device is healthy, added to resourceslice: %v", dev) |
| } | ||
|
|
||
| // Republish updated resources | ||
| klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices") |
There was a problem hiding this comment.
There's a typo in the log message: 'rebulishing' should be 'republishing'.
| klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices") | |
| klog.Info("[SWATI DEBUG] republishing resourceslice with healthy devices") |
| Config: configapi.DefaultMigDeviceConfig(), | ||
| }) | ||
|
|
||
| // Swati: Add resourceclaim status update |
There was a problem hiding this comment.
The comment should follow proper Go comment conventions and be more descriptive. Consider: '// Add resource claim status update to track device health'.
| // Swati: Add resourceclaim status update | |
| // Add resource claim status update to track device health. |
| // Swati add health check | ||
| klog.Info("[SWATI DEBUG] adding device status") |
There was a problem hiding this comment.
The comment should follow proper Go comment conventions. Consider: '// Add health status to device allocation result'.
| // Swati add health check | |
| klog.Info("[SWATI DEBUG] adding device status") | |
| // Add health status to device allocation result |
| //defer nvdevlib.alwaysShutdown() | ||
|
|
||
| //klog.Info("[SWATI DEBUG] getting all devices..") | ||
| //allocatable, err := nvdevlib.enumerateAllPossibleDevices(config) | ||
| //if err != nil { | ||
| // return nil, fmt.Errorf("error enumerating all possible devices: %w", err) | ||
| //} | ||
|
|
There was a problem hiding this comment.
Commented-out code should be removed. If this code might be needed later, consider documenting why it's commented out or remove it entirely.
| //defer nvdevlib.alwaysShutdown() | |
| //klog.Info("[SWATI DEBUG] getting all devices..") | |
| //allocatable, err := nvdevlib.enumerateAllPossibleDevices(config) | |
| //if err != nil { | |
| // return nil, fmt.Errorf("error enumerating all possible devices: %w", err) | |
| //} |
| } | ||
|
|
||
| func newDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*deviceHealthMonitor, error) { | ||
| klog.Info("[SWATI DEBUG] initializing NVML..") |
There was a problem hiding this comment.
The log message has inconsistent punctuation. Either use 'NVML...' (with proper ellipsis) or 'NVML' (without trailing dots).
| klog.Info("[SWATI DEBUG] initializing NVML..") | |
| klog.Info("[SWATI DEBUG] initializing NVML") |
|
More logs after fixing republish of resourceslice when unhealthy gpu found |
|
need to fix resourceclaim status update: not using the right client api |
|
Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it. |
|
Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet. |
Yes. this is just to test the e2e flow (which is to report any health events and example action is to republish the slice by the driver). This is just to see if i have setup everything correctly. |
Not resourceslice, but update the resourceclaim status similar to this https://github.com/google/dranet/pull/78/files#diff-e8a7e777d80a14b455bdbf7aae3f28ad8082ffa0a06579e11cc1af741b5f98f7R266 |
|
Got the resourceclaim status to be updated |
Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Swati Gupta <swatig@nvidia.com>
717656d to
d1852f0
Compare
Signed-off-by: Swati Gupta <swatig@nvidia.com>
|
Updated action on health event: update device condition to unhealthy in resourceclaim status |
|
@ArangoGutierrez @klueska can i get a prelim review on this. There are still some tasks but its in working state. |
|
need to check how to enable the DeviceHealth FG from helm. |
| Destination: &flags.healthcheckPort, | ||
| EnvVars: []string{"HEALTHCHECK_PORT"}, | ||
| }, | ||
| &cli.StringFlag{ |
There was a problem hiding this comment.
Is StringSliceFlag a better choice? (see how we use it in https://github.com/search?q=repo%3ANVIDIA%2Fk8s-dra-driver-gpu%20%2FadditionalNamespaces%2F&type=code)
There was a problem hiding this comment.
comma separation will still work i think!
There was a problem hiding this comment.
yes, StringSlice is a good idea. but right now getAdditionalXids() https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/545/files#diff-2c2a0bd3b0d6412f7f1307b6ea3cd23f451fba48563771c9cd8ad697ebf9d6c4R238 expects a comma separated list and handles slicing...
There was a problem hiding this comment.
Keeping it open, in case i want to add this improvement.
|
test of skipped xid: |
Signed-off-by: Swati Gupta <swatig@nvidia.com>
| func (s *DeviceState) MarkDeviceUnhealthy(device *AllocatableDevice) { | ||
| // SWATI: check if a mig device is marked properly | ||
| s.Lock() | ||
| defer s.Unlock() | ||
|
|
||
| device.Health = Unhealthy | ||
| klog.Infof("Marked device:%s unhealthy", device.GetUUID()) | ||
| } |
There was a problem hiding this comment.
Can this take health as a parameter so it can be reused once we have the ability to bring a device back to healthy?
| SearchDeviceGroups: | ||
| for _, group := range preparedClaim.PreparedDevices { | ||
| for _, device := range group.Devices { | ||
| var currentUUID string | ||
| var currentDeviceName string | ||
|
|
||
| if device.Gpu != nil { | ||
| currentUUID = device.Gpu.Info.UUID | ||
| currentDeviceName = device.Gpu.Device.DeviceName | ||
| } else if device.Mig != nil { | ||
| currentUUID = device.Mig.Info.UUID | ||
| currentDeviceName = device.Mig.Device.DeviceName | ||
| } | ||
|
|
||
| if currentUUID == unhealthyDeviceUUID { | ||
| klog.V(6).Infof("found matching device: %v for claim: %s", currentDeviceName, claimUID) | ||
| matchingDeviceName = currentDeviceName | ||
| break SearchDeviceGroups | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
You can avoid the labeled break by putting this in a function and returning at the point that you find what you are looking for.
| } | ||
|
|
||
| func (d *driver) findClaimByUID(ctx context.Context, claimUID types.UID) (*v1beta1.ResourceClaim, error) { | ||
| claimList, err := d.state.config.clientsets.Core.ResourceV1beta1().ResourceClaims("").List(ctx, metav1.ListOptions{}) |
There was a problem hiding this comment.
This needs to work not just for the v1beta1 api, but all of v1, v1beta1, and v1beta2. We have helpers to do that in the staging repo.
There was a problem hiding this comment.
Yes. I was not sure how to do it. link?
| } | ||
|
|
||
| func (d *driver) findClaimByUID(ctx context.Context, claimUID types.UID) (*v1beta1.ResourceClaim, error) { | ||
| claimList, err := d.state.config.clientsets.Core.ResourceV1beta1().ResourceClaims("").List(ctx, metav1.ListOptions{}) |
There was a problem hiding this comment.
I'm not sure how I feel about listing all Resource claims and then searching through them for the matching UID. It feels like we should rather be storing the claim name / namespace in the checkpoint so we can pull it directly and then assert that it has the correct UID.
There was a problem hiding this comment.
Yes yes, this is not efficient, i dint want to make changes to the existing code as we are not sure if we want to take this action or not. This is more of a sample action.
There was a problem hiding this comment.
As discussed offline, we may update checkpoint to add claim name/namespace for faster lookup for claims status update of the unhealthy device, so that there is no need to iterate on all claims and find the needed one.
For another usecase, its already gettimg updated for computedomain.
| if err := d.state.applyClaimDeviceStatuses(ctx, claim.Namespace, claim.Name, ds); err != nil { | ||
| klog.Errorf("Failed to update status for claim %s/%s: %v", claim.Namespace, claim.Name, err) | ||
| } else { | ||
| klog.V(6).Infof("applied unhealthy device status to claim %s/%s", claim.Namespace, claim.Name) | ||
| } |
There was a problem hiding this comment.
I don't think we should just give up here. Its likely we may have a conflict when writing and we need to try again. Instead we should push a task to the workqueue that will keep retrying until it succceeds.
|
|
||
| claim, err := d.findClaimByUID(ctx, types.UID(claimUID)) | ||
| if err != nil { | ||
| klog.Errorf("Failed to find ResourceClaim object for UID %s: %v", claimUID, err) |
There was a problem hiding this comment.
This isn't necessarily an error.
| // Update allocated device health status in a given claim | ||
| result, claimUID, err := d.findDeviceResultAndClaimUID(uuid) | ||
| if err != nil { | ||
| klog.Errorf("Device %s is unhealthy, but no associated claim was found: %v", uuid, err) |
There was a problem hiding this comment.
this is not necessarily an error.
| klog.V(6).Infof("Adding devices health status to claim %s/%s", claim.Namespace, claim.Name) | ||
| if err := s.applyClaimDeviceStatuses(ctx, claim.Namespace, claim.Name, deviceStatuses...); err != nil { | ||
| klog.Warningf("Failed to update devices status for claim %s/%s: %v", claim.Namespace, claim.Name, err) | ||
| } |
There was a problem hiding this comment.
High level question -- who is going to be reading this status from the ResourceClaim and doing anything with it? I know I suggested looking at DRANet and seeing how they were reporting their health status, but does it even make sense to do this in our driver? Why does DRANet need to do it?
/cc @aojea
There was a problem hiding this comment.
DRANET uses the standardized fields for network information on the status, so cni-dra-driver
Using standard data for reporting this information in status allow us to build tooling and applications on top, specially useful for some use cases of multi networking or monitoring
As discussed in the team meeting. We need to bring this back for the intial release of these health checks. The "right" way to marm the GPUs as unhealthy will be with device taints, but those are still an alpha feature, and we need someway to mark them as unhealthy / unschedulable in the interim. |
| d.state.MarkDeviceUnhealthy(device) | ||
|
|
||
| // Update allocated device health status in a given claim | ||
| result, claimUID, err := d.findDeviceResultAndClaimUID(uuid) |
There was a problem hiding this comment.
Does d.findDeviceResultAndClaimUID(uuid) only operate on local file system state (checkpoint data)? Or is any networking interaction or IPC involved?
I would find it helpful to clarify that in the code comment right above that call, or to even reflect that in the method name.
There was a problem hiding this comment.
local checkpoint data.
| continue | ||
| } | ||
|
|
||
| claim, err := d.findClaimByUID(ctx, types.UID(claimUID)) |
There was a problem hiding this comment.
Could we add a code comment here explaining why at this point it is relevant to look up the full claim object?
That code comment would clarify the importance of having that data at all, and having it fresh. It would also clarify
- how much of a problem it is if we do not find the claim
- how much of a problem it is to use an outdated version of the claim object
That's so important to have clarified clearly, and maybe we want to discuss that goal statement (specification) before getting too deep into the weeds of the error handling / retrying behavior implementation review.
For this to work, we also need a reconciliation to bring them back once there is remediation and GPU is healthy again. |
|
Quick Mig test logs: |
Signed-off-by: Swati Gupta <swatig@nvidia.com>
|
Closing this in favor of #689 |
Addressing #360 to add preliminary health check