Skip to content

VEP #254: Guest GPU Metrics via VSOCK#255

Open
machadovilaca wants to merge 6 commits into
kubevirt:mainfrom
machadovilaca:vep-254-guest-gpu-metrics-via-virtio-serial
Open

VEP #254: Guest GPU Metrics via VSOCK#255
machadovilaca wants to merge 6 commits into
kubevirt:mainfrom
machadovilaca:vep-254-guest-gpu-metrics-via-virtio-serial

Conversation

@machadovilaca
Copy link
Copy Markdown
Member

@machadovilaca machadovilaca commented Apr 10, 2026

VEP Metadata

Tracking issue: #254
SIG label: /sig observability /sig compute

What this PR does

GPU workloads running inside KubeVirt virtual machines currently lack observability. Cluster administrators and users have no way to monitor GPU utilization, memory usage, temperature, power consumption, or error counts for GPUs passed through to VMs.

This VEP introduces a mechanism for collecting GPU metrics from inside the guest and exposing them as Prometheus metrics on the host.

Special notes for your reviewer

Signed-off-by: machadovilaca <machadovilaca@gmail.com>
@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Apr 10, 2026
@kubevirt-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vladikr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alaypatel07
Copy link
Copy Markdown
Contributor

/cc @rthallisey

@kubevirt-bot kubevirt-bot requested a review from rthallisey April 14, 2026 14:31
VSOCK (`AF_VSOCK`) provides socket-based communication between guest and host without virtio-serial.

**Rejected because:**
- VSOCK requires kernel support that is not universally available, especially on older guests and Windows.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows support is upcoming, and I would be surprised if Linux guests which are too old for vsock would utilize GPUs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enp0s3 said the exactly same on kv vep discussion meeting, if the general opinion is that "Windows support is upcoming" is fine, then I would be okay with it

I actually tried VSOCK first, and was able to get a poc running, ditched it because of the windows support
and since we already used virtio-serial, seemed like a good approach

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows already supported

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use DCGM with VSOCK


**Rejected because:**
- VSOCK requires kernel support that is not universally available, especially on older guests and Windows.
- Virtio-serial is already used by KubeVirt for qemu-guest-agent and downward metrics, making it a proven transport.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kostyanf14 (and maybe @jcanocan ) Do you agree?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

@jcanocan jcanocan Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Virtio-serial is being used already by downward metrics and it works fine. However, I don't have any experience in Windows guests.

**Rejected because:**
- VSOCK requires kernel support that is not universally available, especially on older guests and Windows.
- Virtio-serial is already used by KubeVirt for qemu-guest-agent and downward metrics, making it a proven transport.
- Virtio-serial channels appear as simple character devices in the guest, making the agent trivial to implement on both Linux and Windows.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reject for Virtio-serial because we see some problems with huge data transfer virtio-win/kvm-guest-drivers-windows#1462

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the data sent should be small, i shared the structure in the VEP document

@machadovilaca
Copy link
Copy Markdown
Member Author

@dominikholler @kostyanf14 added vsock as an alternative in the design section

@machadovilaca
Copy link
Copy Markdown
Member Author

/cc @michalskrivanek

added guest-file-read as an alternative in the design section

Signed-off-by: machadovilaca <machadovilaca@gmail.com>
@machadovilaca machadovilaca force-pushed the vep-254-guest-gpu-metrics-via-virtio-serial branch from 02f5441 to f64ad5c Compare April 15, 2026 10:37
## Motivation

GPU passthrough and vGPU workloads are increasingly common in KubeVirt for AI/ML training, inference, and media processing. Host-level GPU
monitoring tools like NVIDIA DCGM exporter are not available in these configurations. The NVIDIA GPU Operator does not deploy this service
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should kubevirt be responsible for maintaining a gpu metric system? This seems generally useful for virtualization.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KubeVirt wouldn't really be maintaining a GPU metric system. The heavy work would be done entirely by DCGM inside the guest, and KubeVirt would handle the communication and exposition of the data.

With that said, NVIDIA DCGM exporter is the de facto way to expose GPU metrics to Prometheus, so it would be ideal and simpler for users if it stayed that way. What I think that would mean:

1- KubeVirt creates a VSOCK for the VMI if any GPU device is configured
2- DCGM listening of VSOCK is running on the guest
3- NVIDIA GPU Operator deploys DCGM exporter on nodes configured for vGPU and GPU passthrough
4- DCGM exporter connects to VMI's via VSOCK to query DCGM data and expose metrics to Prometheus

I would then add some Prometheus recording rules to correlate DCGM exporter metrics with KubeVirt VMI's.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machadovilaca I think that adding the VOSCK automatically will be a problematic approach, the correct way to configure VSOCK for a VM is via the KubeVirt API.


#### VSOCK

VSOCK (`AF_VSOCK`) is a socket address family for guest-host communication using the virtio-vsock transport. KubeVirt already has VSOCK
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Vsock can bypass the launcher and go right to the handler or another node-loacl exporter. I can see some advantages to that.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the full meaning of this comment. I just want to highlight that it might be beneficial to avoid bypassing KubeVirt APIs, e.g. because of #223

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data path in the vsock approach is simpler, e.g. guest -> exporter.

Collect GPU metrics from the host using NVIDIA DCGM or the GPU node exporter.

**Rejected because:**
- The NVIDIA GPU Operator does not deploy DCGM exporter on nodes where GPUs are configured for passthrough or vGPU, because the host no
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing. The DCGM team is working on adding vsock support so we can gather metrics from inside guest and share it over vsock. Once implemented, it would make sense to always have DCGM exporter on the host.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have any pointers for that work? I tried proposing the support for vsock on dcgm exporter and they suggested pursuing other approaches

NVIDIA/dcgm-exporter#649 (comment)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html#features

Host engine

        Added support for listening on the VSOCK protocol.

        Added support for the following fields:

                DCGM_FI_DEV_GET_GPU_RECOVERY_ACTION

                DCGM_FI_DEV_GPU_RECOVERY_ACTION

                DCGM_FI_DEV_MEMORY_UNREPAIRABLE_FLAG

                DCGM_FI_DEV_NVLINK_ECC_DATA_ERROR_COUNT_TOTAL

                DCGM_FI_DEV_NVLINK_PPCNT_IBPC_PORT_XMIT_WAIT


Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use DCGM with VSOCK

Signed-off-by: machadovilaca <machadovilaca@gmail.com>
@machadovilaca machadovilaca changed the title VEP #254: Guest GPU Metrics via virtio-serial VEP #254: Guest GPU Metrics via VSOCK Apr 22, 2026
Copy link
Copy Markdown

@enp0s3 enp0s3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machadovilaca Hi. Actually I don't see any added value for the vGPU metrics to be maintained by the KubeVirt tree. It will couple DCGM metrics development with KV release cycle. I think that DCGM can manage with its own metrics exporter.

@enp0s3
Copy link
Copy Markdown

enp0s3 commented Apr 28, 2026

@machadovilaca @rthallisey To be more specific, I don't understand the drawbacks of the following approach:

  1. Aggregate VMI to vGPU metrics on a higher level using Prometheus PromQL
  2. Using the regular VM pod network, attach kubernetes service the VM network and collect the metrics that way, instead of using VSOCK.

@enp0s3
Copy link
Copy Markdown

enp0s3 commented Apr 28, 2026

@rthallisey Another drawback of using VSOCK is that we are willing to graduate it to be namespace confined, which might impose obstacles for current DCGM design, leading to the need of escalating the privileges of DCGM to be able to access every network namespace on the node.

@machadovilaca
Copy link
Copy Markdown
Member Author

@machadovilaca @rthallisey To be more specific, I don't understand the drawbacks of the following approach:

  1. Aggregate VMI to vGPU metrics on a higher level using Prometheus PromQL
  2. Using the regular VM pod network, attach kubernetes service the VM network and collect the metrics that way, instead of using VSOCK.

@enp0s3

  1. whether or not we decide to support this VSOCK DCGM approach, if the metrics Prometheus ingests are coming from DCGM directly, we would always need a way to correlate them to VMIs. And even if we end up not doing any action to simplify this metric collection, we can still provide recording rules for the correlations

  2. it is just a question of simplifying the work for the end user, the approach you are suggesting should even work right now. But for each VMI the user now needs to: 1. configure network for the vmi, 2. install dcgm in the vmi, 3. create the k8s service, 4. create a service monitor, 5. correlate gpu and vmi metrics

imo we should look for ways to simplify this

@rthallisey
Copy link
Copy Markdown

Another drawback of using VSOCK is that we are willing to graduate it to be namespace confined

Fair point. Perhaps though there's a way to solve global vsock with some security hardening. I'd like to at least explore that option.

The two use cases I see are:

  1. Advertise kubevirt specific gpu metrics that tie vmi to gpu
  2. Advertise all gpu metrics from a passthrough GPU to the dcgm-exporter

Use case 1 is meant for kubevirt admins. Use case 2 can tie into the existing GPU tooling ecosystem.

@enp0s3
Copy link
Copy Markdown

enp0s3 commented Apr 28, 2026

@machadovilaca @rthallisey To be more specific, I don't understand the drawbacks of the following approach:

  1. Aggregate VMI to vGPU metrics on a higher level using Prometheus PromQL
  2. Using the regular VM pod network, attach kubernetes service the VM network and collect the metrics that way, instead of using VSOCK.

@enp0s3

  1. whether or not we decide to support this VSOCK DCGM approach, if the metrics Prometheus ingests are coming from DCGM directly, we would always need a way to correlate them to VMIs. And even if we end up not doing any action to simplify this metric collection, we can still provide recording rules for the correlations

+1 for recording rules, IMO its much more simpler then the current approach.

  1. it is just a question of simplifying the work for the end user, the approach you are suggesting should even work right now. But for each VMI the user now needs to: 1. configure network for the vmi, 2. install dcgm in the vmi, 3. create the k8s service, 4. create a service monitor, 5. correlate gpu and vmi metrics

imo we should look for ways to simplify this

this can be simplified using tools like Helm.

Copy link
Copy Markdown

@enp0s3 enp0s3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this VEP to be eventually converged with VEP 143, if we will go with the VSOCK approach I would like the code to live in the separate monitoring stack


- Expose per-VM, per-GPU utilization metrics as Prometheus metrics from virt-handler.
- Support both GPU passthrough and vGPU devices.
- Support Linux and Windows guests.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to support Windows? Do we have a way to test this?


- https://github.com/kubevirt/kubevirt

## Design
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to mention that VSOCK had to be requested in the VM spec

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my idea was that if the feature gate is enabled and GPUs are present in the spec, virt-controller would automatically attach a VSOCK device to the domain XML

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machadovilaca But its only relevant for NVIDIA GPUs, isn't it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initially my idea was to keep this generic enough to handle other gpus in the future
but with some of the work moving to the dgcm exporter supporting vsock, and nvidia operator deploying the exporter on all use cases, it is now more focused on nvidia, yes

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO its suboptimal to always attach VSOCK upon GPU resource request.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think many times if a gpu is attached, dcgm will be installed in the guest alongside with the drivers, so the dcgm exporter would be able to collect metrics with no additional user operation over what he does today

but i can change that. what do you think is better, using the existing attach vsock field or creating a new one specific for gpu metrics, that if enabled, would create the vsock?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machadovilaca I would leave the responsibility to attach the VSOCK for the user. In case the VSOCK isn't attached via the VM spec we won't collect the metrics.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

then with the new dcgm exporter support for vsock
we should be able to just add some documentation for the users to enable vsock on the spec, install dcgm in the guest, and start the dcgm service listening on the vsock

now, nvidia needs to update nvidia gpu operator to deploy the the dcgm exporter on nodes configured for vgpu or gpu passthrough, which it currently doesn't

on kubevirt side, we would only need to add the recording rules, that correlated the dcgm exporter metrics, with the vmis

deviceName: nvidia.com/A100
```

## Alternatives
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machadovilaca Can you please add in the alternative why inline communication method using regular kubernetes network was rejected? What are the drawbacks of deploying additional kubernetes resources in order to expose the metric service that is running inside the guest?

In addition what are the drawbacks of using recording rules or any other high level tools to tie VM to GPU? Why explicit instrumentation is needed to tie these resources?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added DCGM via regular Kubernetes Networking alternative

recording rules are not an issue, even if we expose VSOCK for DCGM exporter to collect the DCGM metrics, we would need the recording rules

Copy link
Copy Markdown

@enp0s3 enp0s3 Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machadovilaca I would leave the responsibility to attach the VSOCK for the user. In case the VSOCK isn't attached via the VM spec we won't collect the metrics. sorry confused with another thread.

Signed-off-by: machadovilaca <machadovilaca@gmail.com>
Signed-off-by: machadovilaca <machadovilaca@gmail.com>
Copy link
Copy Markdown

@enp0s3 enp0s3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@machadovilaca Thank you!

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
@enp0s3
Copy link
Copy Markdown

enp0s3 commented Apr 30, 2026

@vladikr Hi, could you please have a look?

@enp0s3
Copy link
Copy Markdown

enp0s3 commented Apr 30, 2026

@machadovilaca Three more topics we should converge:

  • Windows support, can we commit to it?
  • Combined consumption of VSOCK, both by the metric collector and by KubeVirt API. We should mention the limitations.
  • The trigger to collect the metrics, there can be non-NVIDIA GPU with attached VSOCK, how would we differ that case?

@vladikr
Copy link
Copy Markdown
Member

vladikr commented Apr 30, 2026

@machadovilaca @enp0s3 To be honest, personally, I didn't have the capacity to review this proposal deep enough to understand whether it fits KubeVirt or not ...
I am not able to give my approval for this cycle.
Let's defer it to the next one, which will give us enough time to have a proper discussion.

… to Linux

Signed-off-by: machadovilaca <machadovilaca@gmail.com>
@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
@kubevirt-bot
Copy link
Copy Markdown

New changes are detected. LGTM label has been removed.

@enp0s3
Copy link
Copy Markdown

enp0s3 commented Apr 30, 2026

@vladikr Makes sense. Sorry for the noise. It looks like we need more time.


This VEP introduces a mechanism for collecting GPU metrics from inside the guest and exposing them as Prometheus metrics on the host. NVIDIA
DCGM (Data Center GPU Manager) 4.5.0 added native support for listening on the VSOCK protocol, enabling direct guest-to-host communication
without a custom guest agent. virt-launcher connects to DCGM inside the guest via VSOCK and exposes the GPU metrics data through the unified
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DCGM-exporter running on the host will poll for metrics over the socket. DCGM should be running in the guest and listening on the socket. Having dcgm-exporter or dcgm connect to the launcher over vosck doesn't sound right to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants