Current behavior:
When dynamically switching a node from a default/unpartitioned state to an MPS state via node labels (e.g., kubectl label node nvidia.com/device-plugin.config=mps-10x), the config-manager container panics with index out of range [0] with length 0 and enters a CrashLoopBackOff state.
This issue is observed in GPU Operator v26.3.0 (which uses k8s-device-plugin:v0.19.0). The dynamic hot-reload switching worked perfectly in the previous version (v25.10.1).
Expected behavior:
The config-manager should safely handle the transition. If the mps-control-daemon process is not yet running (because it was sleeping in the unpartitioned state), the config manager should either gracefully start it or handle the empty PID array without panicking, restoring the dynamic switching capability.
Root Cause Analysis:
After reviewing the source code for v0.19.0, I found two related issues causing this regression:
Missing safety check in findPidToSignal:
In cmd/config-manager/main.go, when attempting to send a SIGHUP, the code does not verify the length of the pids array before accessing pids[0]. Since the daemon is sleeping (Waiting indefinitely) in the unpartitioned state, no process is found, causing the panic.
// cmd/config-manager/main.go (approx line 456)
targetPid := pids[0] // Panics here if len(pids) == 0
Race condition introduced by lastRead pointer change:
The reason the daemon is sleeping in the first place is due to a recent change in SyncableConfig.Get() where lastRead became a pointer (*string).
if m.lastRead != nil && *m.lastRead == m.current {
m.cond.Wait()
}
During the initial startup, lastRead is nil. This causes the condition to be skipped, preventing the initial Wait(). The config-manager immediately returns an empty config, putting the daemon to sleep before the new node label can be fully processed. When the label is finally detected milliseconds later, it triggers the SIGHUP logic to a non-existent process.
Steps to reproduce:
Deploy GPU Operator v26.3.0 without a default MPS configuration (leaving the node in an unpartitioned state).
The mps-control-daemon container logs will show: No devices are configured for MPS sharing; Waiting indefinitely.
Apply the label to trigger the config-manager: kubectl label node nvidia.com/device-plugin.config=mps-10x
The config-manager container immediately panics and restarts.
Information to attach (optional if deemed irrelevant)
The config-manager container logs showing the panic:
I0424 00:27:57.648037 1359495 main.go:249] Label change detected: nvidia.com/device-plugin.config=mps-10x
I0424 00:27:57.648118 1359495 main.go:305] Updating to config: mps-10x
I0424 00:27:57.648344 1359495 main.go:320] Successfully updated to config: mps-10x
I0424 00:27:57.648353 1359495 main.go:324] Sending signal 'hangup' to '/usr/bin/mps-control-daemon'
panic: runtime error: index out of range [0] with length 0
goroutine 1 [running]:
main.findPidToSignal(0x24c3d4b51080)
/build/cmd/config-manager/main.go:456 +0x206
main.signalProcess(0x24c3d4b51080)
/build/cmd/config-manager/main.go:435 +0x1c
main.updateConfig({0x24c3d49d6830?, 0x2?}, 0x24c3d4b51080)
/build/cmd/config-manager/main.go:325 +0x354
...
Current behavior:
When dynamically switching a node from a default/unpartitioned state to an MPS state via node labels (e.g., kubectl label node nvidia.com/device-plugin.config=mps-10x), the config-manager container panics with index out of range [0] with length 0 and enters a CrashLoopBackOff state.
This issue is observed in GPU Operator v26.3.0 (which uses k8s-device-plugin:v0.19.0). The dynamic hot-reload switching worked perfectly in the previous version (v25.10.1).
Expected behavior:
The config-manager should safely handle the transition. If the mps-control-daemon process is not yet running (because it was sleeping in the unpartitioned state), the config manager should either gracefully start it or handle the empty PID array without panicking, restoring the dynamic switching capability.
Root Cause Analysis:
After reviewing the source code for v0.19.0, I found two related issues causing this regression:
Missing safety check in findPidToSignal:
In cmd/config-manager/main.go, when attempting to send a SIGHUP, the code does not verify the length of the pids array before accessing pids[0]. Since the daemon is sleeping (Waiting indefinitely) in the unpartitioned state, no process is found, causing the panic.
Race condition introduced by lastRead pointer change:
The reason the daemon is sleeping in the first place is due to a recent change in SyncableConfig.Get() where lastRead became a pointer (*string).
During the initial startup, lastRead is nil. This causes the condition to be skipped, preventing the initial Wait(). The config-manager immediately returns an empty config, putting the daemon to sleep before the new node label can be fully processed. When the label is finally detected milliseconds later, it triggers the SIGHUP logic to a non-existent process.
Steps to reproduce:
Deploy GPU Operator v26.3.0 without a default MPS configuration (leaving the node in an unpartitioned state).
The mps-control-daemon container logs will show: No devices are configured for MPS sharing; Waiting indefinitely.
Apply the label to trigger the config-manager: kubectl label node nvidia.com/device-plugin.config=mps-10x
The config-manager container immediately panics and restarts.
Information to attach (optional if deemed irrelevant)
The config-manager container logs showing the panic: