Skip to content

[release-1.19] fix: ensure systemd service is restarted if nvidia-smi fails#1837

Merged
cdesiniotis merged 1 commit into
release-1.19from
backport-1836-to-release-1.19
May 19, 2026
Merged

[release-1.19] fix: ensure systemd service is restarted if nvidia-smi fails#1837
cdesiniotis merged 1 commit into
release-1.19from
backport-1836-to-release-1.19

Conversation

@github-actions
Copy link
Copy Markdown

🤖 Automated backport of #1836 to release-1.19

✅ Cherry-pick completed successfully with no conflicts.

Original PR: #1836
Original Author: @cdesiniotis

Cherry-picked commits (1):

  • 57b8d1c fix: ensure systemd service is restarted if nvidia-smi fails

This backport was automatically created by the backport bot.

We run nvidia-smi in an ExecStart statement instead of an ExecCondition statement
so that the unit gets restarted in case of failures. With ExecCondition, if the
command returns an error code between 1 and 254 (inclusive) the remaining commands
are skipped and the unit is not marked as failed, and thus the service is not
restarted.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
(cherry picked from commit 57b8d1c)
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cdesiniotis cdesiniotis merged commit 77d9c4d into release-1.19 May 19, 2026
1 check passed
@cdesiniotis cdesiniotis deleted the backport-1836-to-release-1.19 branch May 19, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant