Skip to content

DCGM Exporter doesn't work with fractional vGPU shapes (nvidia-rtx-pro-6000) #661

@JeffLuoo

Description

@JeffLuoo

What is the version?

4.4.1-4.6.0

What happened?

I have GKE cluster with DCGM Exporter version 4.4.1-4.6.0. I saw following error log:

msg="Got unexpected return -14 from m_gpmManager.GetLatestSample [/workspaces/dcgm-exporter/dcgmlib/src/DcgmCacheManager.cpp:13088] [DcgmCacheManager::BufferOrCacheLatestGpuValue]" dcgm_level=ERROR"

from the node for G4 nodes that run fractional vGPU shapes (g4-standard-24, g4-standard-12, g4-standard-6), while they work as expected on G4 machines that run >=1 GPU, so G4 machines >= g4-standard-48.

(G4 machine: https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#g4-series)

Only node with full GPU has the metric

Image

What did you expect to happen?

I can see DCGM metrics exported from G4 node with fractional vGPU shapes ( < 1 GPU).

What is the GPU model?

nvidia-rtx-pro-6000

What is the environment?

DCGM-Exporter runs on GKE cluster as a daemonset.

How did you deploy the dcgm-exporter and what is the configuration?

Run it as a daemonset, part of the set-up:

      containers:
      - args:
        - --enable-dcgm-log
        - --dcgm-log-level
        - ERROR
        - name: DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE
          value: device-name
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        - name: DISABLE_STARTUP_VALIDATE
          value: "true"

How to reproduce the issue?

No response

Anything else we need to know?

I wonder if I need to make changes to configuration in DCGM-Exporter to make it work, or it indeed doesn't work on nodes with partial nvidia-rtx-pro-6000.

I also tried 4.5.2-4.8.1 but still see the same error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions