Skip to content

fix(gpu): do not abort plugin start in case no device is found#2671

Merged
eball merged 1 commit intomainfrom
gpu/fix/startup_reinit
Mar 12, 2026
Merged

fix(gpu): do not abort plugin start in case no device is found#2671
eball merged 1 commit intomainfrom
gpu/fix/startup_reinit

Conversation

@dkeven
Copy link
Copy Markdown
Member

@dkeven dkeven commented Mar 12, 2026

  • Background
    In feat(device-plugin): supports dynamic detection of hot plugged in GPUs HAMi#13, support for hot plug/unplug is added to GPU device-plugin.
    In rare cases, the nvidia driver loses control (e.g. PCI timeout) of the only GPU and reinitiates it shortly after. If this happens during the device plugin is starting, and before the reinitialization is finished, NVML may report no devices are found, causing device plugin to abort start, thus when the reinitialization finishes, the GPU still will not be reported by the device plugin. We make the case of no devices also eligible to restart device plugin after a delay.

  • Target Version for Merge
    1.12.5, 1.12.6

  • Related Issues
    none

  • PRs Involving Sub-Systems
    fix(device-plugin): do not abort plugin start in case no device is found HAMi#16

  • Other information:
    none

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
olares-docs Ignored Ignored Mar 12, 2026 8:02am

Request Review

@eball eball merged commit 622d7a3 into main Mar 12, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants