Skip to content

fix mps env var for chroot#1143

Open
guptaNswati wants to merge 1 commit into
kubernetes-sigs:mainfrom
guptaNswati:fix-mps-env
Open

fix mps env var for chroot#1143
guptaNswati wants to merge 1 commit into
kubernetes-sigs:mainfrom
guptaNswati:fix-mps-env

Conversation

@guptaNswati
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

Special notes for your reviewer:

Found MPS daemon crashlooping because of missing env

dra-driver-nvidia-gpu   mps-control-daemon-a0d2f2f3-f0e4-42ff-ab94-eec4b82db2ee-26qs7bc   0/1     CrashLoopBackOff    9 (4m24s ago)   21m
gpu-test-mps            test-pod                                                          0/2     ContainerCreating   0               21m

Startup probe failed: cat: can't open '/driver-root/var/log/nvidia-mps/startup.log': No such file or directory

CUDA_MPS_PIPE_DIRECTORY must be set to start or to communicate with MPS control daemon

Does this PR introduce a user-facing change?

None

Additional documentation (design docs, usage docs, etc.):

Tested on GB10 and A30.

After setting the env

nvidia-smi pmon
# gpu         pid   type     sm    mem    enc    dec    jpg    ofa    command 
# Idx           #    C/G      %      %      %      %      %      %    name 
    0     106525     C     95      0      -      -      -      -    nvidia-cuda-mps
    0     126081   M+C      -      -      -      -      -      -    sample         
    0     126162   M+C      -      -      -      -      -      -    sample         
    0     106525     C     95      0      -      -      -      -    nvidia-cuda-mps
    
 
kubectl logs mps-control-daemon-9512d551-1b8e-45b7-9f2e-2596fd262980-26fq46p  -n dra-driver-nvidia-gpu 
50.0
[2026-05-14 19:17:18.513 Control  1372] Starting control daemon using socket /tmp/nvidia-mps/control
[2026-05-14 19:17:18.513 Control  1372] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
[2026-05-14 19:17:18.513 Control  1372] CUDA MPS Control binary version: 13000
[2026-05-14 19:17:18.516 Control  1372] Accepting connection...
[2026-05-14 19:17:18.516 Control  1372] NEW UI
[2026-05-14 19:17:18.516 Control  1372] Cmd:set_default_active_thread_percentage 50
[2026-05-14 19:17:18.516 Control  1372] 50.0
[2026-05-14 19:17:18.516 Control  1372] UI closed
[2026-05-14 19:17:18.518 Control  1372] Accepting connection...
[2026-05-14 19:17:18.518 Control  1372] NEW UI
[2026-05-14 19:17:18.518 Control  1372] Cmd:set_default_device_pinned_mem_limit GPU-a2f2413a-51f3-1b29-11b5-4503f4ad3078 10240M
[2026-05-14 19:17:18.518 Control  1372] set_default_device_pinned_mem_limit GPU-a2f2413a-51f3-1b29-11b5-4503f4ad3078 10240M.
[2026-05-14 19:17:18.518 Control  1372] UI closed
[2026-05-14 19:17:32.347 Control  1372] Accepting connection...
[2026-05-14 19:17:32.347 Control  1372] User did not send valid credentials
[2026-05-14 19:17:32.347 Control  1372] Accepting connection...
[2026-05-14 19:17:32.347 Control  1372] NEW CLIENT 1594 from user 0: Server is not ready, push client to pending list
[2026-05-14 19:17:32.347 Control  1596] Starting new server 1596 for user 0
[2026-05-14 19:17:32.350 Control  1372] Accepting connection...
[2026-05-14 19:17:32.645 Control  1372] NEW SERVER 1596: Ready
[2026-05-14 19:17:32.646 Control  1372] Accepting connection...
[2026-05-14 19:17:32.646 Control  1372] User did not send valid credentials
[2026-05-14 19:17:32.646 Control  1372] Accepting connection...
[2026-05-14 19:17:32.646 Control  1372] NEW CLIENT 1674 from user 0: Server already exists
[2026-05-14 19:17:32.726 Control  1372] Accepting connection...
[2026-05-14 19:17:32.726 Control  1372] NEW CLIENT 1594 from user 0: Server already exists
[2026-05-14 19:17:32.747 Control  1372] Accepting connection...
[2026-05-14 19:17:32.747 Control  1372] NEW CLIENT 1674 from user 0: Server already exists

@guptaNswati guptaNswati added this to the v0.4.1 milestone May 14, 2026
@guptaNswati guptaNswati self-assigned this May 14, 2026
@guptaNswati guptaNswati added the kind/bug Categorizes issue or PR as related to a bug. label May 14, 2026
@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label May 14, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 14, 2026

Deploy Preview for dra-driver-nvidia-gpu ready!

Name Link
🔨 Latest commit da1b4bf
🔍 Latest deploy log https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/6a06286764cf5d0008bffa28
😎 Deploy Preview https://deploy-preview-1143--dra-driver-nvidia-gpu.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: guptaNswati
Once this PR has been reviewed and has the lgtm label, please assign varunrsekar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from klueska and tariq1890 May 14, 2026 19:54
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 14, 2026
if [ -x /driver-root/bin/sh ] || [ -x /driver-root/usr/bin/sh ]; then
# Use chroot to avoid library mismatch between container and host
# when driver root is / (default value) or /run/nvidia/driver (default location for driver installation by GPU Operator)
# Export the paths explicitly for the chroot environment
Copy link
Copy Markdown
Contributor

@shivamerla shivamerla May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are set via CDI edits that we generate. May be you have CDI disabled on your system?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check that. i have lost access to the system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants