From 1d15075ae05453bd02866aed1ff45ae25239d014 Mon Sep 17 00:00:00 2001 From: Christopher Desiniotis Date: Tue, 19 May 2026 10:46:47 -0700 Subject: [PATCH] fix: ensure systemd service is restarted if nvidia-smi fails We run nvidia-smi in an ExecStart statement instead of an ExecCondition statement so that the unit gets restarted in case of failures. With ExecCondition, if the command returns an error code between 1 and 254 (inclusive) the remaining commands are skipped and the unit is not marked as failed, and thus the service is not restarted. Signed-off-by: Christopher Desiniotis (cherry picked from commit 57b8d1c75efdee79401ced16b001fe6006b0b164) --- deployments/systemd/nvidia-cdi-refresh.service | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/deployments/systemd/nvidia-cdi-refresh.service b/deployments/systemd/nvidia-cdi-refresh.service index 9a6e0f5b7..7c483c81a 100644 --- a/deployments/systemd/nvidia-cdi-refresh.service +++ b/deployments/systemd/nvidia-cdi-refresh.service @@ -28,7 +28,7 @@ Type=oneshot Environment=NVIDIA_CTK_CDI_OUTPUT_FILE_PATH=/var/run/cdi/nvidia.yaml EnvironmentFile=-/etc/nvidia-container-toolkit/nvidia-cdi-refresh.env ExecCondition=/bin/sh -c '/usr/bin/grep -qE "/(nvidia|nvidia-current)[.]ko" /lib/modules/%v/modules.dep || [ -e /dev/dxg ]' -ExecCondition=/bin/sh -c '/usr/bin/nvidia-smi -L || /usr/sbin/nvidia-smi -L || /usr/lib/wsl/lib/nvidia-smi -L' +ExecStart=/bin/sh -c '/usr/bin/nvidia-smi -L || /usr/sbin/nvidia-smi -L || /usr/lib/wsl/lib/nvidia-smi -L' ExecStart=/usr/bin/nvidia-ctk cdi generate CapabilityBoundingSet=CAP_SYS_MODULE CAP_SYS_ADMIN CAP_MKNOD # We set the service to restart on failure to ensure that a CDI spec is