VEP #254: Guest GPU Metrics via VSOCK#255
Conversation
Signed-off-by: machadovilaca <machadovilaca@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/cc @rthallisey |
| VSOCK (`AF_VSOCK`) provides socket-based communication between guest and host without virtio-serial. | ||
|
|
||
| **Rejected because:** | ||
| - VSOCK requires kernel support that is not universally available, especially on older guests and Windows. |
There was a problem hiding this comment.
Windows support is upcoming, and I would be surprised if Linux guests which are too old for vsock would utilize GPUs
There was a problem hiding this comment.
@enp0s3 said the exactly same on kv vep discussion meeting, if the general opinion is that "Windows support is upcoming" is fine, then I would be okay with it
I actually tried VSOCK first, and was able to get a poc running, ditched it because of the windows support
and since we already used virtio-serial, seemed like a good approach
There was a problem hiding this comment.
updated to use DCGM with VSOCK
|
|
||
| **Rejected because:** | ||
| - VSOCK requires kernel support that is not universally available, especially on older guests and Windows. | ||
| - Virtio-serial is already used by KubeVirt for qemu-guest-agent and downward metrics, making it a proven transport. |
There was a problem hiding this comment.
@kostyanf14 (and maybe @jcanocan ) Do you agree?
There was a problem hiding this comment.
Virtio-serial is being used already by downward metrics and it works fine. However, I don't have any experience in Windows guests.
| **Rejected because:** | ||
| - VSOCK requires kernel support that is not universally available, especially on older guests and Windows. | ||
| - Virtio-serial is already used by KubeVirt for qemu-guest-agent and downward metrics, making it a proven transport. | ||
| - Virtio-serial channels appear as simple character devices in the guest, making the agent trivial to implement on both Linux and Windows. |
There was a problem hiding this comment.
Reject for Virtio-serial because we see some problems with huge data transfer virtio-win/kvm-guest-drivers-windows#1462
There was a problem hiding this comment.
here the data sent should be small, i shared the structure in the VEP document
|
@dominikholler @kostyanf14 added vsock as an alternative in the design section |
|
/cc @michalskrivanek added guest-file-read as an alternative in the design section |
Signed-off-by: machadovilaca <machadovilaca@gmail.com>
02f5441 to
f64ad5c
Compare
| ## Motivation | ||
|
|
||
| GPU passthrough and vGPU workloads are increasingly common in KubeVirt for AI/ML training, inference, and media processing. Host-level GPU | ||
| monitoring tools like NVIDIA DCGM exporter are not available in these configurations. The NVIDIA GPU Operator does not deploy this service |
There was a problem hiding this comment.
Should kubevirt be responsible for maintaining a gpu metric system? This seems generally useful for virtualization.
There was a problem hiding this comment.
KubeVirt wouldn't really be maintaining a GPU metric system. The heavy work would be done entirely by DCGM inside the guest, and KubeVirt would handle the communication and exposition of the data.
With that said, NVIDIA DCGM exporter is the de facto way to expose GPU metrics to Prometheus, so it would be ideal and simpler for users if it stayed that way. What I think that would mean:
1- KubeVirt creates a VSOCK for the VMI if any GPU device is configured
2- DCGM listening of VSOCK is running on the guest
3- NVIDIA GPU Operator deploys DCGM exporter on nodes configured for vGPU and GPU passthrough
4- DCGM exporter connects to VMI's via VSOCK to query DCGM data and expose metrics to Prometheus
I would then add some Prometheus recording rules to correlate DCGM exporter metrics with KubeVirt VMI's.
There was a problem hiding this comment.
@machadovilaca I think that adding the VOSCK automatically will be a problematic approach, the correct way to configure VSOCK for a VM is via the KubeVirt API.
|
|
||
| #### VSOCK | ||
|
|
||
| VSOCK (`AF_VSOCK`) is a socket address family for guest-host communication using the virtio-vsock transport. KubeVirt already has VSOCK |
There was a problem hiding this comment.
Using Vsock can bypass the launcher and go right to the handler or another node-loacl exporter. I can see some advantages to that.
There was a problem hiding this comment.
I do not understand the full meaning of this comment. I just want to highlight that it might be beneficial to avoid bypassing KubeVirt APIs, e.g. because of #223
There was a problem hiding this comment.
The data path in the vsock approach is simpler, e.g. guest -> exporter.
| Collect GPU metrics from the host using NVIDIA DCGM or the GPU node exporter. | ||
|
|
||
| **Rejected because:** | ||
| - The NVIDIA GPU Operator does not deploy DCGM exporter on nodes where GPUs are configured for passthrough or vGPU, because the host no |
There was a problem hiding this comment.
This is changing. The DCGM team is working on adding vsock support so we can gather metrics from inside guest and share it over vsock. Once implemented, it would make sense to always have DCGM exporter on the host.
There was a problem hiding this comment.
do you have any pointers for that work? I tried proposing the support for vsock on dcgm exporter and they suggested pursuing other approaches
There was a problem hiding this comment.
https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html#features
Host engine
Added support for listening on the VSOCK protocol.
Added support for the following fields:
DCGM_FI_DEV_GET_GPU_RECOVERY_ACTION
DCGM_FI_DEV_GPU_RECOVERY_ACTION
DCGM_FI_DEV_MEMORY_UNREPAIRABLE_FLAG
DCGM_FI_DEV_NVLINK_ECC_DATA_ERROR_COUNT_TOTAL
DCGM_FI_DEV_NVLINK_PPCNT_IBPC_PORT_XMIT_WAIT
There was a problem hiding this comment.
updated to use DCGM with VSOCK
Signed-off-by: machadovilaca <machadovilaca@gmail.com>
enp0s3
left a comment
There was a problem hiding this comment.
@machadovilaca Hi. Actually I don't see any added value for the vGPU metrics to be maintained by the KubeVirt tree. It will couple DCGM metrics development with KV release cycle. I think that DCGM can manage with its own metrics exporter.
|
@machadovilaca @rthallisey To be more specific, I don't understand the drawbacks of the following approach:
|
|
@rthallisey Another drawback of using VSOCK is that we are willing to graduate it to be namespace confined, which might impose obstacles for current DCGM design, leading to the need of escalating the privileges of DCGM to be able to access every network namespace on the node. |
imo we should look for ways to simplify this |
Fair point. Perhaps though there's a way to solve The two use cases I see are:
Use case 1 is meant for kubevirt admins. Use case 2 can tie into the existing GPU tooling ecosystem. |
+1 for recording rules, IMO its much more simpler then the current approach.
this can be simplified using tools like Helm. |
enp0s3
left a comment
There was a problem hiding this comment.
I would prefer this VEP to be eventually converged with VEP 143, if we will go with the VSOCK approach I would like the code to live in the separate monitoring stack
|
|
||
| - Expose per-VM, per-GPU utilization metrics as Prometheus metrics from virt-handler. | ||
| - Support both GPU passthrough and vGPU devices. | ||
| - Support Linux and Windows guests. |
There was a problem hiding this comment.
How are we going to support Windows? Do we have a way to test this?
|
|
||
| - https://github.com/kubevirt/kubevirt | ||
|
|
||
| ## Design |
There was a problem hiding this comment.
We need to mention that VSOCK had to be requested in the VM spec
There was a problem hiding this comment.
my idea was that if the feature gate is enabled and GPUs are present in the spec, virt-controller would automatically attach a VSOCK device to the domain XML
There was a problem hiding this comment.
@machadovilaca But its only relevant for NVIDIA GPUs, isn't it?
There was a problem hiding this comment.
initially my idea was to keep this generic enough to handle other gpus in the future
but with some of the work moving to the dgcm exporter supporting vsock, and nvidia operator deploying the exporter on all use cases, it is now more focused on nvidia, yes
There was a problem hiding this comment.
IMO its suboptimal to always attach VSOCK upon GPU resource request.
There was a problem hiding this comment.
i think many times if a gpu is attached, dcgm will be installed in the guest alongside with the drivers, so the dcgm exporter would be able to collect metrics with no additional user operation over what he does today
but i can change that. what do you think is better, using the existing attach vsock field or creating a new one specific for gpu metrics, that if enabled, would create the vsock?
There was a problem hiding this comment.
@machadovilaca I would leave the responsibility to attach the VSOCK for the user. In case the VSOCK isn't attached via the VM spec we won't collect the metrics.
There was a problem hiding this comment.
ok
then with the new dcgm exporter support for vsock
we should be able to just add some documentation for the users to enable vsock on the spec, install dcgm in the guest, and start the dcgm service listening on the vsock
now, nvidia needs to update nvidia gpu operator to deploy the the dcgm exporter on nodes configured for vgpu or gpu passthrough, which it currently doesn't
on kubevirt side, we would only need to add the recording rules, that correlated the dcgm exporter metrics, with the vmis
| deviceName: nvidia.com/A100 | ||
| ``` | ||
|
|
||
| ## Alternatives |
There was a problem hiding this comment.
@machadovilaca Can you please add in the alternative why inline communication method using regular kubernetes network was rejected? What are the drawbacks of deploying additional kubernetes resources in order to expose the metric service that is running inside the guest?
In addition what are the drawbacks of using recording rules or any other high level tools to tie VM to GPU? Why explicit instrumentation is needed to tie these resources?
There was a problem hiding this comment.
added DCGM via regular Kubernetes Networking alternative
recording rules are not an issue, even if we expose VSOCK for DCGM exporter to collect the DCGM metrics, we would need the recording rules
There was a problem hiding this comment.
@machadovilaca I would leave the responsibility to attach the VSOCK for the user. In case the VSOCK isn't attached via the VM spec we won't collect the metrics. sorry confused with another thread.
Signed-off-by: machadovilaca <machadovilaca@gmail.com>
Signed-off-by: machadovilaca <machadovilaca@gmail.com>
|
@vladikr Hi, could you please have a look? |
|
@machadovilaca Three more topics we should converge:
|
|
@machadovilaca @enp0s3 To be honest, personally, I didn't have the capacity to review this proposal deep enough to understand whether it fits KubeVirt or not ... |
… to Linux Signed-off-by: machadovilaca <machadovilaca@gmail.com>
|
New changes are detected. LGTM label has been removed. |
|
@vladikr Makes sense. Sorry for the noise. It looks like we need more time. |
|
|
||
| This VEP introduces a mechanism for collecting GPU metrics from inside the guest and exposing them as Prometheus metrics on the host. NVIDIA | ||
| DCGM (Data Center GPU Manager) 4.5.0 added native support for listening on the VSOCK protocol, enabling direct guest-to-host communication | ||
| without a custom guest agent. virt-launcher connects to DCGM inside the guest via VSOCK and exposes the GPU metrics data through the unified |
There was a problem hiding this comment.
DCGM-exporter running on the host will poll for metrics over the socket. DCGM should be running in the guest and listening on the socket. Having dcgm-exporter or dcgm connect to the launcher over vosck doesn't sound right to me.
VEP Metadata
Tracking issue: #254
SIG label: /sig observability /sig compute
What this PR does
GPU workloads running inside KubeVirt virtual machines currently lack observability. Cluster administrators and users have no way to monitor GPU utilization, memory usage, temperature, power consumption, or error counts for GPUs passed through to VMs.
This VEP introduces a mechanism for collecting GPU metrics from inside the guest and exposing them as Prometheus metrics on the host.
Special notes for your reviewer