What is the version?
4.5.2-4.8.1
What happened?
The dcgm container crushed at start with with very dew logs (--debug, and dcgm log at were enabled):
"the libdcgm.so.4 library was not found".
The same image start without issues on EKS 1.33.
Investigation revealed the /etc/ld.so.cache was missing a libdcgm.so.4 entry when running on 1.34.
Changed pod command argument to refresh the ldconfig cache before starting the exporter allowed it to start normally.:
- args:
- |
rm /etc/ld.so.cache; ldconfig; exec /usr/local/dcgm/dcgm-exporter-entrypoint.sh -f /etc/dcgm-exporter/default-counters.csv
command:
- /bin/bash
- -exc
What did you expect to happen?
The exporter start normally and serve metrics
What is the GPU model?
AWS g6.xlarge
What is the environment?
EKS 1.34
bottlerocket-nvidia 1.34
How did you deploy the dcgm-exporter and what is the configuration?
Deployed using helm-chart.
How to reproduce the issue?
Run dcgm-exporter on EKS 1.34 g6 node using bottlerocket 1.54+
Anything else we need to know?
- All three images failed (distroless, ubi9 & ubuntu22.4).
- All three images started normally in EKS 1.33 bottlerocket 1.51
- nodes were started using karpenter - the following useData was used:
[settings.boot]
reboot-to-reconcile = true
[settings.boot.kernel-parameters]
"psi" = ["1"]
"nvidia.NVreg_EnableGpuFirmware" = ["1"]
[settings.kubernetes.nvidia.container-runtime]
"visible-devices-as-volume-mounts" = true
"visible-devices-envvar-when-unprivileged" = true
[settings.metrics]
send-metrics = false
What is the version?
4.5.2-4.8.1
What happened?
The dcgm container crushed at start with with very dew logs (--debug, and dcgm log at were enabled):
"the libdcgm.so.4 library was not found".
The same image start without issues on EKS 1.33.
Investigation revealed the
/etc/ld.so.cachewas missing alibdcgm.so.4entry when running on 1.34.Changed pod command argument to refresh the ldconfig cache before starting the exporter allowed it to start normally.:
What did you expect to happen?
The exporter start normally and serve metrics
What is the GPU model?
AWS g6.xlarge
What is the environment?
EKS 1.34
bottlerocket-nvidia 1.34
How did you deploy the dcgm-exporter and what is the configuration?
Deployed using helm-chart.
How to reproduce the issue?
Run dcgm-exporter on EKS 1.34 g6 node using bottlerocket 1.54+
Anything else we need to know?