Enhanced Error-handling config
Current State
See https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
The NVIDIA GPU Device Plugin
We register for NVML Events of type nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError
We treat the following XIDs as non-fatal errors:
| XID |
Description |
| 13 |
Graphics Engine Exception |
| 31 |
GPU memory page fault |
| 43 |
GPU stopped processing |
| 45 |
Preemptive cleanup, due to previous errors |
| 68 |
Video processor exception |
| 109 |
Context Switch Timeout Error |
We allow additional Xids to be specified in the DP_DISABLE_HEALTHCHECKS envvar with the following logic:
- If the value is
xids or all we disable healthchecks entirely.
- A comma-separated list of numeric XIDs to ignore: e.g.
109,68
The GKE Device Plugin
See https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/0509b1f9f4b9a357b44ba65e7b508ded8bd5ecf0/pkg/gpu/nvidia/health_check/health_checker.go#L41
By default the following error is checked:
| XID |
Description |
| 48 |
Double-bit ECC Error |
The XID_CONFIG envvar is used to specifiy a comma-separated list of additional XIDs to treat as critical.
Proposal
Add the following config section:
version: v1
health:
disabled: false
eventTypes: [EventTypeXidCriticalError, EventTypeDoubleBitEccError, EventTypeSingleBitEccError]
ignoredXIDs: [13, 31, 43, 45, 68]
criticalXIDs: all
GKE defaults:
version: v1
health:
disabled: false
eventTypes: [EventTypeXidCriticalError]
ignoredXIDs: []
criticalXIDs: [48]
Enhanced Error-handling config
Current State
See https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
The NVIDIA GPU Device Plugin
We register for NVML Events of type
nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccErrorWe treat the following XIDs as non-fatal errors:
We allow additional Xids to be specified in the
DP_DISABLE_HEALTHCHECKSenvvar with the following logic:xidsorallwe disable healthchecks entirely.109,68The GKE Device Plugin
See https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/0509b1f9f4b9a357b44ba65e7b508ded8bd5ecf0/pkg/gpu/nvidia/health_check/health_checker.go#L41
By default the following error is checked:
The
XID_CONFIGenvvar is used to specifiy a comma-separated list of additional XIDs to treat as critical.Proposal
Add the following config section:
GKE defaults: