What is the version?
4.4.1-4.6.0
What happened?
I have GKE cluster with DCGM Exporter version 4.4.1-4.6.0. I saw following error log:
msg="Got unexpected return -14 from m_gpmManager.GetLatestSample [/workspaces/dcgm-exporter/dcgmlib/src/DcgmCacheManager.cpp:13088] [DcgmCacheManager::BufferOrCacheLatestGpuValue]" dcgm_level=ERROR"
from the node for G4 nodes that run fractional vGPU shapes (g4-standard-24, g4-standard-12, g4-standard-6), while they work as expected on G4 machines that run >=1 GPU, so G4 machines >= g4-standard-48.
(G4 machine: https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#g4-series)
Only node with full GPU has the metric
What did you expect to happen?
I can see DCGM metrics exported from G4 node with fractional vGPU shapes ( < 1 GPU).
What is the GPU model?
nvidia-rtx-pro-6000
What is the environment?
DCGM-Exporter runs on GKE cluster as a daemonset.
How did you deploy the dcgm-exporter and what is the configuration?
Run it as a daemonset, part of the set-up:
containers:
- args:
- --enable-dcgm-log
- --dcgm-log-level
- ERROR
- name: DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE
value: device-name
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DISABLE_STARTUP_VALIDATE
value: "true"
How to reproduce the issue?
No response
Anything else we need to know?
I wonder if I need to make changes to configuration in DCGM-Exporter to make it work, or it indeed doesn't work on nodes with partial nvidia-rtx-pro-6000.
I also tried 4.5.2-4.8.1 but still see the same error.
What is the version?
4.4.1-4.6.0
What happened?
I have GKE cluster with DCGM Exporter version
4.4.1-4.6.0. I saw following error log:from the node for G4 nodes that run fractional vGPU shapes (g4-standard-24, g4-standard-12, g4-standard-6), while they work as expected on G4 machines that run >=1 GPU, so G4 machines >= g4-standard-48.
(G4 machine: https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#g4-series)
Only node with full GPU has the metric
What did you expect to happen?
I can see DCGM metrics exported from G4 node with fractional vGPU shapes ( < 1 GPU).
What is the GPU model?
nvidia-rtx-pro-6000
What is the environment?
DCGM-Exporter runs on GKE cluster as a daemonset.
How did you deploy the dcgm-exporter and what is the configuration?
Run it as a daemonset, part of the set-up:
How to reproduce the issue?
No response
Anything else we need to know?
I wonder if I need to make changes to configuration in DCGM-Exporter to make it work, or it indeed doesn't work on nodes with partial nvidia-rtx-pro-6000.
I also tried 4.5.2-4.8.1 but still see the same error.