Skip to content

fix(device-plugin): allow devices to recover to healthy after xid error#10

Merged
dkeven merged 1 commit intofeat/nvsharefrom
fix/xid_recovery
Jan 14, 2026
Merged

fix(device-plugin): allow devices to recover to healthy after xid error#10
dkeven merged 1 commit intofeat/nvsharefrom
fix/xid_recovery

Conversation

@dkeven
Copy link
Copy Markdown
Member

@dkeven dkeven commented Jan 14, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

In #9, we made devices marked unhealthy due to NVML errors able to recover after some time, however, when the NVML works normally but reports certain Xid errors, the devices are marked unhealthy and still not able to recover, like the error recovery mechanism, we make such cases also recoverable after certain time.

@dkeven dkeven merged commit 01a49a2 into feat/nvshare Jan 14, 2026
1 check passed
@dkeven dkeven deleted the fix/xid_recovery branch January 14, 2026 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant