Skip to content

Exporter crashes at start when loading libdcgm.so.4 on EKS 1.34 / Bottlerocket 1.54 #636

@mindw

Description

@mindw

What is the version?

4.5.2-4.8.1

What happened?

The dcgm container crushed at start with with very dew logs (--debug, and dcgm log at were enabled):
"the libdcgm.so.4 library was not found".
The same image start without issues on EKS 1.33.
Investigation revealed the /etc/ld.so.cache was missing a libdcgm.so.4 entry when running on 1.34.
Changed pod command argument to refresh the ldconfig cache before starting the exporter allowed it to start normally.:

      - args:
        - |
          rm /etc/ld.so.cache; ldconfig; exec /usr/local/dcgm/dcgm-exporter-entrypoint.sh -f /etc/dcgm-exporter/default-counters.csv
        command:
        - /bin/bash
        - -exc

What did you expect to happen?

The exporter start normally and serve metrics

What is the GPU model?

AWS g6.xlarge

What is the environment?

EKS 1.34
bottlerocket-nvidia 1.34

How did you deploy the dcgm-exporter and what is the configuration?

Deployed using helm-chart.

How to reproduce the issue?

Run dcgm-exporter on EKS 1.34 g6 node using bottlerocket 1.54+

Anything else we need to know?

  • All three images failed (distroless, ubi9 & ubuntu22.4).
  • All three images started normally in EKS 1.33 bottlerocket 1.51
  • nodes were started using karpenter - the following useData was used:
              [settings.boot]
              reboot-to-reconcile = true

              [settings.boot.kernel-parameters]
              "psi" = ["1"]
              "nvidia.NVreg_EnableGpuFirmware" = ["1"]

              [settings.kubernetes.nvidia.container-runtime]
              "visible-devices-as-volume-mounts" = true
              "visible-devices-envvar-when-unprivileged" = true

              [settings.metrics]
              send-metrics = false

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions