Gpu health check by guptaNswati · Pull Request #545 · kubernetes-sigs/dra-driver-nvidia-gpu

guptaNswati · 2025-09-06T00:30:40Z

Addressing #360 to add preliminary health check

Signed-off-by: Swati Gupta <swatig@nvidia.com>

copy-pr-bot · 2025-09-06T00:30:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

guptaNswati · 2025-09-06T00:31:32Z

current log:

Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0905 23:51:54.358343       1 driver.go:208] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
W0905 23:51:54.358366       1 device_state.go:619] Attempted to mark unknown device as unhealthy: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0905 23:51:54.358482       1 driver.go:235] Successfully republished resources after marking device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy

resourceclaim status update is still broken.

Copilot

Pull Request Overview

Adds preliminary GPU health monitoring functionality to detect and handle unhealthy GPU devices in the NVIDIA DRA driver. The implementation listens for NVML events (XID errors, ECC errors) and removes unhealthy devices from the allocatable pool.

Introduces device health status tracking with Healthy/Unhealthy states
Implements NVML event-based health monitoring for GPU devices
Updates resource claim status to reflect device health conditions

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
cmd/gpu-kubelet-plugin/nvlib.go	Initialize all devices with `Healthy` status
cmd/gpu-kubelet-plugin/driver.go	Add device health monitor initialization and health notification handling
cmd/gpu-kubelet-plugin/device_state.go	Add device health status updates and resource claim status reporting
cmd/gpu-kubelet-plugin/device_health.go	New file implementing NVML event-based health monitoring
cmd/gpu-kubelet-plugin/allocatable.go	Add health status field and methods to AllocatableDevice

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-08T16:04:09Z

+	if err != nil {
+		return nil, fmt.Errorf("start deviceHealthMonitor: %w", err)
+	}
+	klog.Info("[SWATI DEBUGS] Started device health monitor")


There's a typo in the log message: 'DEBUGS' should be 'DEBUG' to match the pattern used in other debug messages.

Suggested change

klog.Info("[SWATI DEBUGS] Started device health monitor")

klog.Info("[SWATI DEBUG] Started device health monitor")

Copilot · 2025-09-08T16:04:09Z

+			var resourceSlice resourceslice.Slice
+			for _, dev := range d.state.allocatable {
+				if dev.IsHealthy() {
+					klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev)


There's a typo in the log message: 'resoureslice' should be 'resourceslice'.

Suggested change

klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev)

klog.Infof("[SWATI DEBUG] device is healthy, added to resourceslice: %v", dev)

Copilot · 2025-09-08T16:04:10Z

+			}
+
+			// Republish updated resources
+			klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices")


There's a typo in the log message: 'rebulishing' should be 'republishing'.

Suggested change

klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices")

klog.Info("[SWATI DEBUG] republishing resourceslice with healthy devices")

Copilot · 2025-09-08T16:04:10Z

 		Config:   configapi.DefaultMigDeviceConfig(),
 	})

+	// Swati: Add resourceclaim status update


The comment should follow proper Go comment conventions and be more descriptive. Consider: '// Add resource claim status update to track device health'.

Suggested change

// Swati: Add resourceclaim status update

// Add resource claim status update to track device health.

Copilot · 2025-09-08T16:04:10Z

+		// Swati add health check
+		klog.Info("[SWATI DEBUG] adding device status")


The comment should follow proper Go comment conventions. Consider: '// Add health status to device allocation result'.

Suggested change

// Swati add health check

klog.Info("[SWATI DEBUG] adding device status")

// Add health status to device allocation result

Copilot · 2025-09-08T16:04:11Z

+	//defer nvdevlib.alwaysShutdown()
+
+	//klog.Info("[SWATI DEBUG] getting all devices..")
+	//allocatable, err := nvdevlib.enumerateAllPossibleDevices(config)
+	//if err != nil {
+	//	return nil, fmt.Errorf("error enumerating all possible devices: %w", err)
+	//}
+


Commented-out code should be removed. If this code might be needed later, consider documenting why it's commented out or remove it entirely.

Suggested change

//defer nvdevlib.alwaysShutdown()

//klog.Info("[SWATI DEBUG] getting all devices..")

//allocatable, err := nvdevlib.enumerateAllPossibleDevices(config)

//if err != nil {

// return nil, fmt.Errorf("error enumerating all possible devices: %w", err)

//}

Copilot · 2025-09-08T16:04:11Z

+}
+
+func newDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*deviceHealthMonitor, error) {
+	klog.Info("[SWATI DEBUG] initializing NVML..")


The log message has inconsistent punctuation. Either use 'NVML...' (with proper ellipsis) or 'NVML' (without trailing dots).

Suggested change

klog.Info("[SWATI DEBUG] initializing NVML..")

klog.Info("[SWATI DEBUG] initializing NVML")

guptaNswati · 2025-09-08T23:14:16Z

More logs after fixing republish of resourceslice when unhealthy gpu found

kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-ndv47 -n nvidia-dra-driver-gpu  -c gpus | grep unhealth 
I0908 23:07:58.793308       1 device_health.go:173] Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0908 23:07:58.793342       1 driver.go:208] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0908 23:07:58.793371       1 device_state.go:636] Marked device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy
E0908 23:07:58.793381       1 driver.go:220] device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 with uuid:&{%!s(*main.GpuInfo=&{GPU-a4f34abc-7715-3560-dcea-7238b9611a45 0 0 false 102625181696 NVIDIA GH200 96GB HBM3 Nvidia Hopper 9.0 570.86.15 12.8 0009:01:00.0 {resource.kubernetes.io/pcieRoot {<nil> <nil> 0x4000328130 <nil>}} [0x40008965a0 0x40008965d0 0x4000896600 0x4000896630 0x4000896660 0x4000896690 0x4000896840 0x40008972f0 0x4000897530 0x4000897560]}) %!s(*main.MigDeviceInfo=<nil>) Unhealthy} is unhealthy
I0908 23:07:58.793531       1 driver.go:235] Successfully republished resources after marking device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy

 "ResourceSlice update" logger="ResourceSlice controller" slice="sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5" diff=<
	@@ -3,8 +3,8 @@
	   "name": "sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5",
	   "generateName": "sc-starwars-mab9-b00-gpu.nvidia.com-",
	   "uid": "b5a8727d-b8cd-4073-8817-d3e31147a8bd",
	-  "resourceVersion": "50777207",
	-  "generation": 1,
	+  "resourceVersion": "50777758",
	+  "generation": 2,
	   "creationTimestamp": "2025-09-08T23:05:30Z",
	   "ownerReferences": [
	    {
	@@ -20,7 +20,7 @@
	     "manager": "gpu-kubelet-plugin",
	     "operation": "Update",
	     "apiVersion": "resource.k8s.io/v1beta1",
	-    "time": "2025-09-08T23:05:30Z",
	+    "time": "2025-09-08T23:07:58Z",
	     "fieldsType": "FieldsV1",
	     "fieldsV1": {
	      "f:metadata": {

$ kubectl get resourceslice  sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5  -o yaml 
apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
  creationTimestamp: "2025-09-08T23:05:30Z"
  generateName: sc-starwars-mab9-b00-gpu.nvidia.com-
  generation: 2
  name: sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: sc-starwars-mab9-b00
    uid: 80ede971-5b44-4a12-a951-a1bebe79209d
  resourceVersion: "50777758"
  uid: b5a8727d-b8cd-4073-8817-d3e31147a8bd
spec:
  devices:
  - basic:
      attributes:
        architecture:
          string: Hopper
        brand:
          string: Nvidia
        cudaComputeCapability:
          version: 9.0.0
        cudaDriverVersion:
          version: 12.8.0
        driverVersion:
          version: 570.86.15
        index:
          int: 1
        minor:
          int: 1
        pcieBusID:
          string: "0019:01:00.0"
        productName:
          string: NVIDIA GH200 96GB HBM3
        resource.kubernetes.io/pcieRoot:
          string: pci0019:00
        type:
          string: gpu
        uuid:
          string: GPU-9e6df7cb-64d4-5e53-2b1d-cee9e58aeb94
      capacity:
        memory:
          value: 97871Mi
    name: gpu-1
  driver: gpu.nvidia.com
  nodeName: sc-starwars-mab9-b00
  pool:
    generation: 1
    name: sc-starwars-mab9-b00
    resourceSliceCount: 1

guptaNswati · 2025-09-08T23:14:27Z

need to fix resourceclaim status update: not using the right client api

Device gpu-0 is healthy, marking as ready
E0908 23:06:44.085161       1 device_state.go:346] failed to update status for claim gpu-test1/pod1-gpu-zc6s4: not implemented in k8s.io/dynamic-resource-allocation/client

failed to update status for claim gpu-test1/pod2-gpu-q45rg: not implemented in k8s.io/dynamic-resource-allocation/client

klueska · 2025-09-09T03:50:26Z

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

klueska · 2025-09-09T03:55:15Z

Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet.

guptaNswati · 2025-09-09T17:56:27Z

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

Yes. this is just to test the e2e flow (which is to report any health events and example action is to republish the slice by the driver). This is just to see if i have setup everything correctly.

guptaNswati · 2025-09-09T17:57:52Z

Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet.

Not resourceslice, but update the resourceclaim status similar to this https://github.com/google/dranet/pull/78/files#diff-e8a7e777d80a14b455bdbf7aae3f28ad8082ffa0a06579e11cc1af741b5f98f7R266

guptaNswati · 2025-09-09T22:02:08Z

Got the resourceclaim status to be updated

 Device gpu-1 is healthy, marking as ready
I0909 21:53:04.772855       1 round_trippers.go:632] "Response" logger="dra" requestID=7 method="/k8s.io.kubelet.pkg.apis.dra.v1beta1.DRAPlugin/NodePrepareResources" verb="PATCH" url="https://x.x.x.x:443/apis/resource.k8s.io/v1beta1/namespaces/gpu-test1/resourceclaims/pod1-gpu-rrkx5/status?fieldManager=gpu.nvidia.com&force=true" status="200 OK" milliseconds=4
I0909 21:53:04.772960       1 device_state.go:348] updated device status for claim gpu-test1/pod1-gpu-rrkx5

  devices:
  - conditions:
    - lastTransitionTime: "2025-09-09T21:53:04Z"
      message: Device is healthy and ready
      reason: Healthy
      status: "True"
      type: Ready
    data: null
    device: gpu-1
    driver: gpu.nvidia.com
    pool: sc-starwars-mab9-b00

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati · 2025-09-19T03:29:18Z

Updated action on health event: update device condition to unhealthy in resourceclaim status

$ kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-m8xsz -n nvidia-dra-driver-gpu -c gpus

1 device_health.go:167] Processing event {Device:{Handle:0xee82742dfef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
W0919 02:59:06.452857       1 device_health.go:170] Critical XID error detected on device: {Device:{Handle:0xee82742dfef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I0919 02:59:06.452874       1 device_health.go:200] Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0919 02:59:06.452905       1 driver.go:212] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0919 02:59:06.452918       1 device_state.go:617] Marked device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy
I0919 02:59:06.453547       1 driver.go:298] found matching device to claim: gpu-0
I0919 02:59:06.453556       1 driver.go:312] Found it! Return the result object: gpu-0 and the claim UID: 590e5164-7511-418d-8b8b-77ae0e414dc6
I0919 02:59:06.456314       1 round_trippers.go:632] "Response" verb="GET" url="https://10.96.0.1:443/apis/resource.k8s.io/v1beta1/resourceclaims" status="200 OK" milliseconds=2
I0919 02:59:06.456538       1 driver.go:335] found ResourceClaim with UID 590e5164-7511-418d-8b8b-77ae0e414dc6 not found
I0919 02:59:06.456548       1 driver.go:345] Applying 'Ready=False' condition for device 'gpu-0' in ResourceClaim 'gpu-test1/pod2-gpu-l8rrx'
I0919 02:59:06.460688       1 round_trippers.go:632] "Response" verb="PATCH" url="https://x.x.x.x:443/apis/resource.k8s.io/v1beta1/namespaces/gpu-test1/resourceclaims/pod2-gpu-l8rrx/status?fieldManager=gpu.nvidia.com&force=true" status="200 OK" milliseconds=3

$ kubectl get resourceclaim -n gpu-test1  -o yaml | grep -A 8 condition
    - conditions:
      - lastTransitionTime: "2025-09-09T21:53:04Z"
        message: Device is healthy and ready
        reason: Healthy
        status: "True"
        type: Ready
      data: null
      device: gpu-1
      driver: gpu.nvidia.com
--
    - conditions:
      - lastTransitionTime: "2025-09-19T02:59:06Z"
        message: Device gpu-0 has become unhealthy.
        reason: DeviceUnhealthy
        status: "False"
        type: Ready
      data: null
      device: gpu-0
      driver: gpu.nvidia.com

guptaNswati · 2025-09-19T04:36:28Z

@ArangoGutierrez @klueska can i get a prelim review on this. There are still some tasks but its in working state.

guptaNswati · 2025-09-19T04:38:05Z

need to check how to enable the DeviceHealth FG from helm.

dims · 2025-09-23T02:08:45Z

 			Destination: &flags.healthcheckPort,
 			EnvVars:     []string{"HEALTHCHECK_PORT"},
 		},
+		&cli.StringFlag{


Is StringSliceFlag a better choice? (see how we use it in https://github.com/search?q=repo%3ANVIDIA%2Fk8s-dra-driver-gpu%20%2FadditionalNamespaces%2F&type=code)

comma separation will still work i think!

yes, StringSlice is a good idea. but right now getAdditionalXids() https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/545/files#diff-2c2a0bd3b0d6412f7f1307b6ea3cd23f451fba48563771c9cd8ad697ebf9d6c4R238 expects a comma separated list and handles slicing...

Keeping it open, in case i want to add this improvement.

guptaNswati · 2025-09-24T19:20:01Z

test of skipped xid:

$ helm upgrade nvidia-dra-driver-gpu  deployments/helm/nvidia-dra-driver-gpu --set featureGates.DeviceHealthCheck=true --set kubeletPlugin.gpus.additionalXidsToIgnore="43"

kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-qzplg  -n nvidia-dra-driver-gpu -c gpus | grep event
'I0924 18:24:31.947121       1 device_health.go:58] creating NVML events for device health monitor
I0924 18:24:31.947143       1 device_health.go:68] registering NVML events for device health monitor
I0924 18:28:04.610817       1 device_health.go:175] Skipping event {Device:{Handle:0xe44bad2ffef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

Signed-off-by: Swati Gupta <swatig@nvidia.com>

klueska · 2025-10-07T09:37:15Z

+func (s *DeviceState) MarkDeviceUnhealthy(device *AllocatableDevice) {
+	// SWATI: check if a mig device is marked properly
+	s.Lock()
+	defer s.Unlock()
+
+	device.Health = Unhealthy
+	klog.Infof("Marked device:%s unhealthy", device.GetUUID())
+}


Can this take health as a parameter so it can be reused once we have the ability to bring a device back to healthy?

klueska · 2025-10-07T09:53:49Z

+	SearchDeviceGroups:
+		for _, group := range preparedClaim.PreparedDevices {
+			for _, device := range group.Devices {
+				var currentUUID string
+				var currentDeviceName string
+
+				if device.Gpu != nil {
+					currentUUID = device.Gpu.Info.UUID
+					currentDeviceName = device.Gpu.Device.DeviceName
+				} else if device.Mig != nil {
+					currentUUID = device.Mig.Info.UUID
+					currentDeviceName = device.Mig.Device.DeviceName
+				}
+
+				if currentUUID == unhealthyDeviceUUID {
+					klog.V(6).Infof("found matching device: %v for claim: %s", currentDeviceName, claimUID)
+					matchingDeviceName = currentDeviceName
+					break SearchDeviceGroups
+				}
+			}
+		}


You can avoid the labeled break by putting this in a function and returning at the point that you find what you are looking for.

klueska · 2025-10-07T09:57:32Z

+}
+
+func (d *driver) findClaimByUID(ctx context.Context, claimUID types.UID) (*v1beta1.ResourceClaim, error) {
+	claimList, err := d.state.config.clientsets.Core.ResourceV1beta1().ResourceClaims("").List(ctx, metav1.ListOptions{})


This needs to work not just for the v1beta1 api, but all of v1, v1beta1, and v1beta2. We have helpers to do that in the staging repo.

Yes. I was not sure how to do it. link?

klueska · 2025-10-07T10:04:07Z

+}
+
+func (d *driver) findClaimByUID(ctx context.Context, claimUID types.UID) (*v1beta1.ResourceClaim, error) {
+	claimList, err := d.state.config.clientsets.Core.ResourceV1beta1().ResourceClaims("").List(ctx, metav1.ListOptions{})


I'm not sure how I feel about listing all Resource claims and then searching through them for the matching UID. It feels like we should rather be storing the claim name / namespace in the checkpoint so we can pull it directly and then assert that it has the correct UID.

Yes yes, this is not efficient, i dint want to make changes to the existing code as we are not sure if we want to take this action or not. This is more of a sample action.

As discussed offline, we may update checkpoint to add claim name/namespace for faster lookup for claims status update of the unhealthy device, so that there is no need to iterate on all claims and find the needed one.

For another usecase, its already gettimg updated for computedomain.

klueska · 2025-10-07T10:07:23Z

+			if err := d.state.applyClaimDeviceStatuses(ctx, claim.Namespace, claim.Name, ds); err != nil {
+				klog.Errorf("Failed to update status for claim %s/%s: %v", claim.Namespace, claim.Name, err)
+			} else {
+				klog.V(6).Infof("applied unhealthy device status to claim %s/%s", claim.Namespace, claim.Name)
+			}


I don't think we should just give up here. Its likely we may have a conflict when writing and we need to try again. Instead we should push a task to the workqueue that will keep retrying until it succceeds.

klueska · 2025-10-07T10:07:58Z

+
+			claim, err := d.findClaimByUID(ctx, types.UID(claimUID))
+			if err != nil {
+				klog.Errorf("Failed to find ResourceClaim object for UID %s: %v", claimUID, err)


This isn't necessarily an error.

klueska · 2025-10-07T10:08:29Z

+			// Update allocated device health status in a given claim
+			result, claimUID, err := d.findDeviceResultAndClaimUID(uuid)
+			if err != nil {
+				klog.Errorf("Device %s is unhealthy, but no associated claim was found: %v", uuid, err)


this is not necessarily an error.

klueska · 2025-10-07T10:10:59Z

+		klog.V(6).Infof("Adding devices health status to claim %s/%s", claim.Namespace, claim.Name)
+		if err := s.applyClaimDeviceStatuses(ctx, claim.Namespace, claim.Name, deviceStatuses...); err != nil {
+			klog.Warningf("Failed to update devices status for claim %s/%s: %v", claim.Namespace, claim.Name, err)
+		}


High level question -- who is going to be reading this status from the ResourceClaim and doing anything with it? I know I suggested looking at DRANet and seeing how they were reporting their health status, but does it even make sense to do this in our driver? Why does DRANet need to do it?

/cc @aojea

DRANET uses the standardized fields for network information on the status, so cni-dra-driver

kubernetes/enhancements#4817

Using standard data for reporting this information in status allow us to build tooling and applications on top, specially useful for some use cases of multi networking or monitoring

klueska · 2025-10-07T10:13:39Z

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

As discussed in the team meeting. We need to bring this back for the intial release of these health checks. The "right" way to marm the GPUs as unhealthy will be with device taints, but those are still an alpha feature, and we need someway to mark them as unhealthy / unschedulable in the interim.

jgehrcke · 2025-10-09T10:25:38Z

+			d.state.MarkDeviceUnhealthy(device)
+
+			// Update allocated device health status in a given claim
+			result, claimUID, err := d.findDeviceResultAndClaimUID(uuid)


Does d.findDeviceResultAndClaimUID(uuid) only operate on local file system state (checkpoint data)? Or is any networking interaction or IPC involved?

I would find it helpful to clarify that in the code comment right above that call, or to even reflect that in the method name.

local checkpoint data.

jgehrcke · 2025-10-09T10:31:13Z

+				continue
+			}
+
+			claim, err := d.findClaimByUID(ctx, types.UID(claimUID))


Could we add a code comment here explaining why at this point it is relevant to look up the full claim object?

That code comment would clarify the importance of having that data at all, and having it fresh. It would also clarify

how much of a problem it is if we do not find the claim

how much of a problem it is to use an outdated version of the claim object

That's so important to have clarified clearly, and maybe we want to discuss that goal statement (specification) before getting too deep into the weeds of the error handling / retrying behavior implementation review.

guptaNswati · 2025-10-09T18:20:35Z

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

As discussed in the team meeting. We need to bring this back for the intial release of these health checks. The "right" way to marm the GPUs as unhealthy will be with device taints, but those are still an alpha feature, and we need someway to mark them as unhealthy / unschedulable in the interim.

For this to work, we also need a reconciliation to bring them back once there is remediation and GPU is healthy again.

guptaNswati · 2025-10-14T23:33:05Z

Quick Mig test logs:

I1014 23:30:24.276796       1 device_health.go:179] Processing event {Device:{Handle:0xfe835631fef0} EventType:8 EventData:43 GpuInstanceId:7 ComputeInstanceId:0}
I1014 23:30:24.276920       1 device_health.go:192] Event for mig device: &{<nil> 0x40005a3900 Healthy}
I1014 23:30:24.276949       1 device_health.go:202] Sending unhealthy notification for device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 due to event type: 8 and event data: 43
W1014 23:30:24.276999       1 driver.go:221] Received unhealthy notification for device: MIG-4d806f22-346a-5a1d-ac01-86b505cdf485
I1014 23:30:24.277025       1 device_state.go:590] Marked device:MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 unhealthy
E1014 23:30:24.277488       1 driver.go:229] Device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 is unhealthy, but no associated claim was found: unable to find device result and claim uid for MIG-4d806f22-346a-5a1d-ac01-86b505cdf485

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati · 2025-10-17T20:46:01Z

Closing this in favor of #689

guptaNswati added 2 commits September 4, 2025 22:39

preliminary device health monitor

5069e1c

Signed-off-by: Swati Gupta <swatig@nvidia.com>

publish health status

a4566a9

Signed-off-by: Swati Gupta <swatig@nvidia.com>

klueska added this to the v25.8.0 milestone Sep 8, 2025

klueska assigned guptaNswati Sep 8, 2025

ArangoGutierrez requested review from ArangoGutierrez and Copilot September 8, 2025 16:02

Copilot AI reviewed Sep 8, 2025

View reviewed changes

status update fixes

407562d

Signed-off-by: Swati Gupta <swatig@nvidia.com>

klueska added the feature issue/PR that proposes a new feature or functionality label Sep 11, 2025

handle mig devices

d1852f0

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati force-pushed the gpu-health-check branch from 717656d to d1852f0 Compare September 17, 2025 21:46

klueska modified the milestones: v25.8.0, v25.12.0, v25.8.1 Sep 18, 2025

Update device condition in resourceclaim on health event

0e6516d

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati changed the title ~~Draft: Gpu health check~~ Gpu health check Sep 19, 2025

guptaNswati requested a review from klueska September 19, 2025 04:36

dims reviewed Sep 23, 2025

View reviewed changes

refractor device status

260ea86

Signed-off-by: Swati Gupta <swatig@nvidia.com>