From cf731e516514a97f8925ba571d8c033230b3d30d Mon Sep 17 00:00:00 2001 From: machadovilaca Date: Fri, 10 Apr 2026 11:13:29 +0100 Subject: [PATCH 1/6] VEP #254: Guest GPU Metrics via virtio-serial Signed-off-by: machadovilaca --- .../gpu-metrics-via-virtio-serial/vep.md | 312 ++++++++++++++++++ 1 file changed, 312 insertions(+) create mode 100644 veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md diff --git a/veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md b/veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md new file mode 100644 index 00000000..4c5e9fe4 --- /dev/null +++ b/veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md @@ -0,0 +1,312 @@ +# VEP #254: Guest GPU Metrics via virtio-serial + +## VEP Status Metadata + +### Target releases + +- This VEP targets alpha for version: +- This VEP targets beta for version: +- This VEP targets GA for version: + +### Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue created, which links to VEP dir in [kubevirt/enhancements] +- [ ] (R) Alpha target version is explicitly mentioned and approved +- [ ] (R) Beta target version is explicitly mentioned and approved +- [ ] (R) GA target version is explicitly mentioned and approved + +## Overview + +GPU workloads running inside KubeVirt virtual machines currently lack observability. Cluster administrators and users have no way to monitor +GPU utilization, memory usage, temperature, power consumption, or error counts for GPUs passed through to VMs. + +This VEP introduces a mechanism for collecting GPU metrics from inside the guest and exposing them as Prometheus metrics on the host. A +lightweight guest agent communicates with the host via a virtio-serial channel, and virt-handler scrapes the agent on each Prometheus +collection cycle to produce `kubevirt_vmi_gpu_*` metrics. + +## Motivation + +GPU passthrough and vGPU workloads are increasingly common in KubeVirt for AI/ML training, inference, and media processing. Host-level GPU +monitoring tools like NVIDIA DCGM exporter are not available in these configurations. The NVIDIA GPU Operator does not deploy this service +on nodes where GPUs are configured for passthrough or vGPU, because the host no longer has direct access to the device. This leaves GPU +workloads inside VMs completely unmonitored. + +By collecting metrics from inside the guest via NVML and forwarding them to the host over virtio-serial, KubeVirt can provide per-VM, +per-GPU observability that is consistent with the existing `kubevirt_vmi_*` metrics namespace, enabling unified dashboards and alerting. + +## Goals + +- Expose per-VM, per-GPU utilization metrics as Prometheus metrics from virt-handler. +- Support both GPU passthrough and vGPU devices. +- Support Linux and Windows guests. +- Use virtio-serial as the transport, avoiding network dependencies inside the guest. +- Keep the guest agent lightweight, stateless, and easy to install. + +## Non Goals + +- Managing GPU drivers or NVML installation inside the guest. +- Supporting non-NVIDIA GPUs (AMD, Intel) in the initial implementation. The protocol is vendor-agnostic, but the first agent implementation +uses NVML. +- Alerting rules or Grafana dashboards (these can be added separately). +- Collecting GPU metrics from the host side (e.g., via DCGM). + +## Definition of Users + +- **Cluster administrators** who need to monitor GPU utilization across VMs for capacity planning, cost allocation, and health monitoring. +- **VM users** running GPU workloads who want to see GPU metrics alongside other VM metrics in existing monitoring infrastructure. +- **Platform teams** building autoscaling or scheduling decisions based on GPU utilization. + +## User Stories + +### GPU Utilization Monitoring +As a cluster administrator, I want to see GPU utilization, memory usage, and temperature for each VM so I can identify underutilized or +overheating GPUs and take action. + +### Capacity Planning +As a platform engineer, I want per-VM GPU metrics in Prometheus so I can build dashboards showing GPU utilization trends across the cluster +and plan capacity. + +### Error Detection +As an operations engineer, I want to be alerted when a GPU inside a VM reports ECC errors so I can proactively migrate the workload before +hardware failure. + +## Repos + +- https://github.com/kubevirt/kubevirt (virtio-serial channel, virt-handler collector) +- https://github.com/kubevirt/gpu-metrics-agent (guest-side agent, separate repo) + +## Design + +The design has four components spanning three processes: + +``` +Guest VM (QEMU) virt-launcher virt-handler ++--------------------------+ +-------------------------------+ +---------------------------+ +| | | | | | +| gpu-metrics-agent | | QEMU | | domainstats scraper | +| | | virtio-serial backend | | | +| - collects NVML metrics | | gpu-metrics.sock (UNIX) | | - GetDomainStats() | +| - responds with JSON | <====> | | | - GetFilesystems() | +| | | DomainManager | | - GetGPUMetrics() | +| /dev/virtio-ports/ | | gpuMetricsCache | | | +| org.kubevirt. | | TimeDefinedCache (3.25s) | | resourceMetrics: | +| gpu-metrics.0 | | scrapeGPUMetrics() | | gpuMetrics.Collect() | +| | | connect → GET\n → JSON | | → kubevirt_vmi_gpu_* | ++--------------------------+ | | | | + | cmd-server (gRPC) | | cmd-client (gRPC) | + | GetGPUMetrics() RPC | <-- | GetGPUMetrics() | + +-------------------------------+ +---------------------------+ +``` + +### 1. Virtio-Serial Channel (virt-launcher) + +When a VMI has GPU devices configured (`spec.domain.devices.gpus`), the domain converter adds a virtio-serial channel to the libvirt domain +XML. The virt-launcher process creates the socket directory (`/var/run/kubevirt-private/gpu-metrics-channel/`) during initialization, before +the domain is started. libvirt/QEMU then binds the UNIX socket at that path. + +This produces: +- **Host side (virt-launcher pod)**: UNIX socket at `/var/run/kubevirt-private/gpu-metrics-channel/gpu-metrics.sock` +- **Guest side**: character device at `/dev/virtio-ports/org.kubevirt.gpu-metrics.0` (Linux) or named pipe +`\\.\Global\org.kubevirt.gpu-metrics.0` (Windows) + +### 2. Guest Agent (gpu-metrics-agent) + +A standalone Go binary that runs inside the guest as a systemd service (Linux) or Windows service. On startup it: +1. Opens the virtio-serial device. +2. Initializes NVML (gracefully handles failure and remains running with error responses). +3. Enters a read loop waiting for request lines from the host. +4. On each request, collects GPU metrics via NVML and writes a JSON response. + +The request/response protocol is newline-delimited: +- **Request**: any text line ending with `\n` (e.g., `GET\n`) +- **Response**: a single JSON object followed by `\n` + +The agent handles host disconnects (EOF on the char device when the host closes the socket) by re-entering the read loop on the same file +descriptor, matching virtio-serial reconnection semantics. + +Response schema: +```json +{ + "version": "1.0.0", + "error": {"code": 12, "message": "ERROR_LIBRARY_NOT_FOUND"}, + "devices": [ + { + "index": 0, + "uuid": "GPU-abc-123", + "name": "Tesla T4", + "gpuUtilizationPercent": 75, + "memoryUtilizationPercent": 38, + "memoryUsedBytes": 4294967296, + "memoryTotalBytes": 17179869184, + "temperatureCelsius": 54, + "powerUsageMilliwatts": 121180, + "powerLimitMilliwatts": 250000, + "eccErrorsSingleBit": 0, + "eccErrorsDoubleBit": 0, + "encoderUtilizationPercent": 15, + "decoderUtilizationPercent": 5, + "runningProcesses": 3, + "pcieTxBytesPerSecond": 102400, + "pcieRxBytesPerSecond": 204800 + } + ] +} +``` + +If NVML is unavailable, `error` is populated and `devices` is empty. The agent remains running. + +### 3. virt-launcher: DomainManager and gRPC + +Two layers handle GPU metrics on the virt-launcher side: + +**DomainManager (`LibvirtDomainManager`)**: A `gpuMetricsCache` (`TimeDefinedCache[string]`, 3250ms TTL) caches the raw JSON from the guest +agent. The recalculation function (`scrapeGPUMetrics`) connects to the local virtio-serial UNIX socket, sends `GET\n`, and reads the JSON +response. This follows the same caching pattern as `domainStatsCache` for domain stats. + +**cmd-server (`GetGPUMetrics` RPC)**: A new `GetGPUMetrics` method on the `Cmd` gRPC service delegates to `DomainManager.GetGPUMetrics()`, +which returns the cached value. This follows the same pattern as `GetDomainStats`, keeping the cmd-server thin. + +### 4. Prometheus Collector (virt-handler) + +GPU metrics are collected as part of the existing **domainstats** collector, following the same `resourceMetrics` pattern used for CPU, +memory, block, network, and filesystem metrics. No separate collector is needed. + +On each Prometheus scrape, the domainstats scraper: + +1. Connects to each VMI's virt-launcher via its cmd-client socket (same as for domain stats). +2. Calls `cli.GetGPUMetrics()` alongside `GetDomainStats()` and `GetFilesystems()` within the same scrape. +3. Parses the JSON response into `GPUMetricsResponse` and stores it in `VirtualMachineInstanceStats.GPUStats`. +4. The `gpuMetrics` resource metrics implementation emits collector results for each GPU device. + +GPU metric scrape failures are logged at warning verbosity and do not block the rest of the domain stats collection. If the guest agent is +not installed or not running, `GPUStats` is nil and no GPU metrics are emitted for that VMI. + +This approach reuses the existing `ConcurrentCollector` infrastructure (concurrency limiting, per-VMI timeouts, socket discovery) rather +than duplicating it. + +### Metrics Emitted + +| Metric | Type | Description | +|--------|------|-------------| +| `kubevirt_vmi_gpu_agent_status` | Gauge | Agent status (0 = OK, non-zero = NVML error code) | +| `kubevirt_vmi_gpu_utilization_percent` | Gauge | GPU compute utilization (0-100) | +| `kubevirt_vmi_gpu_memory_utilization_percent` | Gauge | GPU memory controller utilization (0-100) | +| `kubevirt_vmi_gpu_memory_used_bytes` | Gauge | GPU memory used in bytes | +| `kubevirt_vmi_gpu_memory_total_bytes` | Gauge | GPU total memory in bytes | +| `kubevirt_vmi_gpu_temperature_celsius` | Gauge | GPU temperature in degrees Celsius | +| `kubevirt_vmi_gpu_power_usage_milliwatts` | Gauge | GPU power draw in milliwatts | +| `kubevirt_vmi_gpu_ecc_errors_single_bit_total` | Gauge | Lifetime corrected ECC error count from NVML | +| `kubevirt_vmi_gpu_ecc_errors_double_bit_total` | Gauge | Lifetime uncorrected ECC error count from NVML | +| `kubevirt_vmi_gpu_encoder_utilization_percent` | Gauge | Video encoder utilization (0-100) | +| `kubevirt_vmi_gpu_decoder_utilization_percent` | Gauge | Video decoder utilization (0-100) | +| `kubevirt_vmi_gpu_running_processes` | Gauge | Number of compute processes on the GPU | + +All per-device metrics carry labels: `node`, `namespace`, `name`, `gpu_index`, `gpu_uuid`, `gpu_name`, plus VMI labels prefixed with +`kubernetes_vmi_label_`. The `gpu_agent_status` metric carries `version`, `error_code`, and `error_message` labels instead of per-device +labels. + +## API Examples + +No changes to the KubeVirt API are required. The virtio-serial channel is added automatically when GPUs are present in the VMI spec: + +```yaml +apiVersion: kubevirt.io/v1 +kind: VirtualMachineInstance +metadata: + name: gpu-workload +spec: + domain: + devices: + gpus: + - name: gpu1 + deviceName: nvidia.com/A100 +``` + +The GPU metrics channel appears in the domain XML alongside the existing guest agent channel: + +```xml + + + + +``` + +## Alternatives + +### VSOCK Instead of Virtio-Serial + +VSOCK (`AF_VSOCK`) provides socket-based communication between guest and host without virtio-serial. + +**Rejected because:** +- VSOCK requires kernel support that is not universally available, especially on older guests and Windows. +- Virtio-serial is already used by KubeVirt for qemu-guest-agent and downward metrics, making it a proven transport. +- Virtio-serial channels appear as simple character devices in the guest, making the agent trivial to implement on both Linux and Windows. + +### Host-Side GPU Metrics (DCGM / Node Exporter) + +Collect GPU metrics from the host using NVIDIA DCGM or the GPU node exporter. + +**Rejected as the sole approach because:** +- The NVIDIA GPU Operator does not deploy DCGM exporter on nodes where GPUs are configured for passthrough or vGPU, because the host no +longer has direct access to the device. + +### Embedding Metrics in QEMU Guest Agent + +Extend the existing QEMU guest agent to collect GPU metrics. + +**Rejected because:** +- Out-of-scope arbitrary NVIDIA-specific monitoring commands to the QEMU guest agent. + +## Scalability + +- **Per-scrape overhead**: One UNIX socket connection per VMI with GPUs, per Prometheus scrape. Connection is short-lived (connect, request, +read, close). A 5-second timeout prevents slow agents from blocking the scrape. +- **Caching**: GPU metrics are cached in virt-launcher with a 3.25-second TTL, so multiple Prometheus scrapes within that window reuse the +same data without reconnecting to the guest agent. +- **Concurrency**: GPU metrics are fetched as part of the existing domainstats scraper, which scrapes all VMIs in parallel using the +`ConcurrentCollector` infrastructure. +- **No persistent connections**: The host does not maintain long-lived connections to guest agents. +- **Scale**: Comparable to the existing domain stats and filesystem stats collection, which already scrape per-VMI data on each Prometheus +collection. + +## Update/Rollback Compatibility + +- The virtio-serial channel is only added when the VMI has GPU devices. Once the `GPUMetrics` feature gate is implemented, disabling it or +rolling back will remove the channel from new VMIs; existing running VMIs retain the channel until they are stopped. +- The guest agent is an opt-in installation. If the agent is not installed, virt-handler logs a connection failure and emits no GPU metrics +for that VMI. +- No API changes; no migration compatibility concerns. + +## Functional Testing Approach + +- **Unit tests**: Test the collector callback with mock socket responses (success, error, timeout, agent not running). +- **Unit tests**: Test virtio-serial channel creation in domain XML converter when GPUs are present vs. absent. +- **Integration tests**: Start a VMI with a mock GPU metrics agent, verify `kubevirt_vmi_gpu_*` metrics are emitted from the virt-handler +metrics endpoint. +- **Guest agent tests**: Tested in the gpu-metrics-agent repo (protocol, NVML mock, reconnection behavior). + +## Implementation History + +## Graduation Requirements + +### Alpha + +- [ ] Feature gate `GPUMetrics` guards all code changes +- [ ] Virtio-serial channel created for VMIs with GPU devices +- [ ] virt-handler collector scrapes guest agent and emits Prometheus metrics +- [ ] Guest agent supports Linux with NVML +- [ ] Unit tests for collector, channel creation, and agent protocol +- [ ] Documentation for installing and running the guest agent + +### Beta + +- [ ] Guest agent supports Windows +- [ ] Integration tests with mock agent in kubevirtci +- [ ] Prometheus recording rules and/or alerts for common GPU failure scenarios +- [ ] Protocol versioning validated (agent version vs. host expectations) + +### GA + +- [ ] Stable for at least two releases with no protocol-breaking changes From f64ad5c6c07bad1a2a0149b4d506c5e820961e21 Mon Sep 17 00:00:00 2001 From: machadovilaca Date: Wed, 15 Apr 2026 11:33:33 +0100 Subject: [PATCH 2/6] VEP #254: Present guest-host communication alternatives Signed-off-by: machadovilaca --- .../gpu-metrics-via-virtio-serial/vep.md | 136 +++++++++--------- 1 file changed, 70 insertions(+), 66 deletions(-) diff --git a/veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md b/veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md index 4c5e9fe4..8e8f0933 100644 --- a/veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md +++ b/veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md @@ -4,7 +4,7 @@ ### Target releases -- This VEP targets alpha for version: +- This VEP targets alpha for version: v1.9 - This VEP targets beta for version: - This VEP targets GA for version: @@ -41,7 +41,6 @@ per-GPU observability that is consistent with the existing `kubevirt_vmi_*` metr - Expose per-VM, per-GPU utilization metrics as Prometheus metrics from virt-handler. - Support both GPU passthrough and vGPU devices. - Support Linux and Windows guests. -- Use virtio-serial as the transport, avoiding network dependencies inside the guest. - Keep the guest agent lightweight, stateless, and easy to install. ## Non Goals @@ -74,8 +73,8 @@ hardware failure. ## Repos -- https://github.com/kubevirt/kubevirt (virtio-serial channel, virt-handler collector) -- https://github.com/kubevirt/gpu-metrics-agent (guest-side agent, separate repo) +- https://github.com/kubevirt/kubevirt +- https://github.com/kubevirt/gpu-metrics-agent ## Design @@ -84,47 +83,79 @@ The design has four components spanning three processes: ``` Guest VM (QEMU) virt-launcher virt-handler +--------------------------+ +-------------------------------+ +---------------------------+ -| | | | | | -| gpu-metrics-agent | | QEMU | | domainstats scraper | -| | | virtio-serial backend | | | -| - collects NVML metrics | | gpu-metrics.sock (UNIX) | | - GetDomainStats() | -| - responds with JSON | <====> | | | - GetFilesystems() | -| | | DomainManager | | - GetGPUMetrics() | -| /dev/virtio-ports/ | | gpuMetricsCache | | | -| org.kubevirt. | | TimeDefinedCache (3.25s) | | resourceMetrics: | -| gpu-metrics.0 | | scrapeGPUMetrics() | | gpuMetrics.Collect() | -| | | connect → GET\n → JSON | | → kubevirt_vmi_gpu_* | -+--------------------------+ | | | | - | cmd-server (gRPC) | | cmd-client (gRPC) | - | GetGPUMetrics() RPC | <-- | GetGPUMetrics() | - +-------------------------------+ +---------------------------+ +| | | DomainManager | | | +| | | gpuMetricsCache | | | +| gpu-metrics-agent | | TimeDefinedCache (3.25s) | | | +| - collects NVML metrics | <====> | scrapeGPUMetrics() | | domainstats scraper | +| | | | | - GetDomainStats() | ++--------------------------+ | cmd-server (gRPC) | | - GetFilesystems() | + | GetGPUMetrics() RPC | <-- | - GetGPUMetrics() | + +-------------------------------+ | | + | resourceMetrics: | + | gpuMetrics.Collect() | + | → kubevirt_vmi_gpu_* | + +---------------------------+ ``` -### 1. Virtio-Serial Channel (virt-launcher) +### 1. Guest-Host Communication -When a VMI has GPU devices configured (`spec.domain.devices.gpus`), the domain converter adds a virtio-serial channel to the libvirt domain -XML. The virt-launcher process creates the socket directory (`/var/run/kubevirt-private/gpu-metrics-channel/`) during initialization, before -the domain is started. libvirt/QEMU then binds the UNIX socket at that path. +#### Virtio-Serial -This produces: -- **Host side (virt-launcher pod)**: UNIX socket at `/var/run/kubevirt-private/gpu-metrics-channel/gpu-metrics.sock` -- **Guest side**: character device at `/dev/virtio-ports/org.kubevirt.gpu-metrics.0` (Linux) or named pipe -`\\.\Global\org.kubevirt.gpu-metrics.0` (Windows) +Virtio-serial provides bidirectional character-device-based channels between the guest and host. The host side exposes a UNIX socket, and +the guest side exposes a character device (`/dev/virtio-ports/` on Linux) or named pipe (`\\.\Global\` on Windows). +KubeVirt already uses virtio-serial for qemu-guest-agent and downward metrics. -### 2. Guest Agent (gpu-metrics-agent) +Linux has native support and Windows is supported through the virtio-win driver package. + +**Downsides:** +- Known issues with large data transfers: Windows drivers fail WriteFile calls >2MB. +- No flow control. The guest can't detect whether the host has connected or disconnected from the socket. + +#### VSOCK + +VSOCK (`AF_VSOCK`) is a socket address family for guest-host communication using the virtio-vsock transport. KubeVirt already has VSOCK +support with per-VMI CID assignment by virt-controller. + +**Downsides:** +- Requires Linux kernel 4.8+ in the guest; older kernels have no support. +- Windows guests require custom virtio-win drivers and a non-standard socket library (`viosocklib`) instead of native Winsock2. + +#### QEMU Guest Agent guest-file-read + +The guest metrics agent writes GPU metrics to a known file path inside the guest (e.g., `/var/lib/kubevirt/gpu-metrics.json`). The host +reads that file via the QEMU Guest Agent's `guest-file-open`, `guest-file-read`, and `guest-file-close` commands, which travel over the +existing qemu-guest-agent virtio-serial channel. No additional virtio-serial channel or transport is needed. + +**Downsides:** +- Each scrape requires three QGA round-trips (open, read, close), adding latency compared to a direct socket connection. +- Reading while the guest agent is writing can produce partial or corrupt JSON. +- KubeVirt does not currently expose `guest-file-read`. Enabling would allow reading arbitrary guest files, therefore would need a careful +security analysis. -A standalone Go binary that runs inside the guest as a systemd service (Linux) or Windows service. On startup it: -1. Opens the virtio-serial device. -2. Initializes NVML (gracefully handles failure and remains running with error responses). -3. Enters a read loop waiting for request lines from the host. -4. On each request, collects GPU metrics via NVML and writes a JSON response. +#### QEMU Guest Agent guest-exec + +The host uses the QGA `guest-exec` command to run a metrics collection binary or script inside the guest on each scrape, and retrieves the +output via `guest-exec-status`. This avoids the need for a persistent guest agent process or additional virtio-serial channels. + +**Downsides:** +- `guest-exec` is disabled by default in many builds (e.g., RHEL/CentOS) due to security concerns, as it allows arbitrary command execution +inside the guest. +- Common issues with SELinux policies blocking executed commands. +- Output is base64-encoded and must be polled via `guest-exec-status`, adding latency. + +#### Embedding Metrics in QEMU Guest Agent + +Extend the existing QEMU guest agent to collect GPU metrics. + +**Downsides:** +- Out-of-scope arbitrary NVIDIA-specific monitoring commands to the QEMU guest agent. +- Push back from QEMU guest agent maintainers team. -The request/response protocol is newline-delimited: -- **Request**: any text line ending with `\n` (e.g., `GET\n`) -- **Response**: a single JSON object followed by `\n` +### 2. Guest Agent (gpu-metrics-agent) -The agent handles host disconnects (EOF on the char device when the host closes the socket) by re-entering the read loop on the same file -descriptor, matching virtio-serial reconnection semantics. +A standalone Go binary that runs inside the guest as a systemd service (Linux) or Windows service. +1. Initializes NVML (gracefully handles failure and remains running with error responses). +2. On each request, collects GPU metrics via NVML and writes a JSON response. Response schema: ```json @@ -209,7 +240,7 @@ labels. ## API Examples -No changes to the KubeVirt API are required. The virtio-serial channel is added automatically when GPUs are present in the VMI spec: +No changes to the KubeVirt API are required. The setup is enabled when GPUs are present in the VMI spec: ```yaml apiVersion: kubevirt.io/v1 @@ -224,45 +255,18 @@ spec: deviceName: nvidia.com/A100 ``` -The GPU metrics channel appears in the domain XML alongside the existing guest agent channel: - -```xml - - - - -``` - ## Alternatives -### VSOCK Instead of Virtio-Serial - -VSOCK (`AF_VSOCK`) provides socket-based communication between guest and host without virtio-serial. - -**Rejected because:** -- VSOCK requires kernel support that is not universally available, especially on older guests and Windows. -- Virtio-serial is already used by KubeVirt for qemu-guest-agent and downward metrics, making it a proven transport. -- Virtio-serial channels appear as simple character devices in the guest, making the agent trivial to implement on both Linux and Windows. - ### Host-Side GPU Metrics (DCGM / Node Exporter) Collect GPU metrics from the host using NVIDIA DCGM or the GPU node exporter. -**Rejected as the sole approach because:** +**Rejected because:** - The NVIDIA GPU Operator does not deploy DCGM exporter on nodes where GPUs are configured for passthrough or vGPU, because the host no longer has direct access to the device. -### Embedding Metrics in QEMU Guest Agent - -Extend the existing QEMU guest agent to collect GPU metrics. - -**Rejected because:** -- Out-of-scope arbitrary NVIDIA-specific monitoring commands to the QEMU guest agent. - ## Scalability -- **Per-scrape overhead**: One UNIX socket connection per VMI with GPUs, per Prometheus scrape. Connection is short-lived (connect, request, -read, close). A 5-second timeout prevents slow agents from blocking the scrape. - **Caching**: GPU metrics are cached in virt-launcher with a 3.25-second TTL, so multiple Prometheus scrapes within that window reuse the same data without reconnecting to the guest agent. - **Concurrency**: GPU metrics are fetched as part of the existing domainstats scraper, which scrapes all VMIs in parallel using the From 75c58daad6b140a5046d98d1964dad6f57d8842c Mon Sep 17 00:00:00 2001 From: machadovilaca Date: Mon, 20 Apr 2026 21:29:56 +0100 Subject: [PATCH 3/6] VEP #254: Replace custom agent with DCGM VSOCK Signed-off-by: machadovilaca --- .../vep.md | 226 ++++++++---------- 1 file changed, 101 insertions(+), 125 deletions(-) rename veps/sig-observability/{gpu-metrics-via-virtio-serial => gpu-metrics-via-vsock}/vep.md (53%) diff --git a/veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md b/veps/sig-observability/gpu-metrics-via-vsock/vep.md similarity index 53% rename from veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md rename to veps/sig-observability/gpu-metrics-via-vsock/vep.md index 8e8f0933..2e06c102 100644 --- a/veps/sig-observability/gpu-metrics-via-virtio-serial/vep.md +++ b/veps/sig-observability/gpu-metrics-via-vsock/vep.md @@ -1,4 +1,4 @@ -# VEP #254: Guest GPU Metrics via virtio-serial +# VEP #254: Guest GPU Metrics via VSOCK ## VEP Status Metadata @@ -22,9 +22,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release* GPU workloads running inside KubeVirt virtual machines currently lack observability. Cluster administrators and users have no way to monitor GPU utilization, memory usage, temperature, power consumption, or error counts for GPUs passed through to VMs. -This VEP introduces a mechanism for collecting GPU metrics from inside the guest and exposing them as Prometheus metrics on the host. A -lightweight guest agent communicates with the host via a virtio-serial channel, and virt-handler scrapes the agent on each Prometheus -collection cycle to produce `kubevirt_vmi_gpu_*` metrics. +This VEP introduces a mechanism for collecting GPU metrics from inside the guest and exposing them as Prometheus metrics on the host. NVIDIA +DCGM (Data Center GPU Manager) 4.5.0 added native support for listening on the VSOCK protocol, enabling direct guest-to-host communication +without a custom guest agent. virt-launcher connects to DCGM inside the guest via VSOCK, and virt-handler scrapes virt-launcher on each +Prometheus collection cycle to produce `kubevirt_vmi_gpu_*` metrics. ## Motivation @@ -33,23 +34,24 @@ monitoring tools like NVIDIA DCGM exporter are not available in these configurat on nodes where GPUs are configured for passthrough or vGPU, because the host no longer has direct access to the device. This leaves GPU workloads inside VMs completely unmonitored. -By collecting metrics from inside the guest via NVML and forwarding them to the host over virtio-serial, KubeVirt can provide per-VM, -per-GPU observability that is consistent with the existing `kubevirt_vmi_*` metrics namespace, enabling unified dashboards and alerting. +NVIDIA DCGM 4.5.0 introduced native VSOCK support, allowing the DCGM daemon inside a guest VM to accept connections from the host over +VSOCK. By leveraging this capability, KubeVirt can collect GPU metrics directly from DCGM without maintaining a custom guest agent, providing +per-VM, per-GPU observability that is consistent with the existing `kubevirt_vmi_*` metrics namespace and enabling unified dashboards and +alerting. ## Goals - Expose per-VM, per-GPU utilization metrics as Prometheus metrics from virt-handler. - Support both GPU passthrough and vGPU devices. - Support Linux and Windows guests. -- Keep the guest agent lightweight, stateless, and easy to install. +- Leverage DCGM's native VSOCK support to avoid maintaining a custom guest agent. ## Non Goals -- Managing GPU drivers or NVML installation inside the guest. -- Supporting non-NVIDIA GPUs (AMD, Intel) in the initial implementation. The protocol is vendor-agnostic, but the first agent implementation -uses NVML. +- Managing GPU drivers or DCGM installation inside the guest. +- Supporting non-NVIDIA GPUs (AMD, Intel) in the initial implementation. - Alerting rules or Grafana dashboards (these can be added separately). -- Collecting GPU metrics from the host side (e.g., via DCGM). +- Collecting GPU metrics from the host side (e.g., via DCGM on the host). ## Definition of Users @@ -74,22 +76,21 @@ hardware failure. ## Repos - https://github.com/kubevirt/kubevirt -- https://github.com/kubevirt/gpu-metrics-agent ## Design -The design has four components spanning three processes: +The design has three components spanning three processes: ``` Guest VM (QEMU) virt-launcher virt-handler +--------------------------+ +-------------------------------+ +---------------------------+ | | | DomainManager | | | | | | gpuMetricsCache | | | -| gpu-metrics-agent | | TimeDefinedCache (3.25s) | | | -| - collects NVML metrics | <====> | scrapeGPUMetrics() | | domainstats scraper | -| | | | | - GetDomainStats() | -+--------------------------+ | cmd-server (gRPC) | | - GetFilesystems() | - | GetGPUMetrics() RPC | <-- | - GetGPUMetrics() | +| DCGM (nv-hostengine) | | TimeDefinedCache (3.25s) | | | +| - collects GPU metrics | <====> | scrapeGPUMetrics() | | domainstats scraper | +| - listens on VSOCK | vsock | | | - GetDomainStats() | +| | | cmd-server (gRPC) | | - GetFilesystems() | ++--------------------------+ | GetGPUMetrics() RPC | <-- | - GetGPUMetrics() | +-------------------------------+ | | | resourceMetrics: | | gpuMetrics.Collect() | @@ -99,102 +100,44 @@ Guest VM (QEMU) virt-launcher virt-h ### 1. Guest-Host Communication -#### Virtio-Serial - -Virtio-serial provides bidirectional character-device-based channels between the guest and host. The host side exposes a UNIX socket, and -the guest side exposes a character device (`/dev/virtio-ports/` on Linux) or named pipe (`\\.\Global\` on Windows). -KubeVirt already uses virtio-serial for qemu-guest-agent and downward metrics. - -Linux has native support and Windows is supported through the virtio-win driver package. - -**Downsides:** -- Known issues with large data transfers: Windows drivers fail WriteFile calls >2MB. -- No flow control. The guest can't detect whether the host has connected or disconnected from the socket. - -#### VSOCK +#### VSOCK (chosen) VSOCK (`AF_VSOCK`) is a socket address family for guest-host communication using the virtio-vsock transport. KubeVirt already has VSOCK support with per-VMI CID assignment by virt-controller. -**Downsides:** -- Requires Linux kernel 4.8+ in the guest; older kernels have no support. -- Windows guests require custom virtio-win drivers and a non-standard socket library (`viosocklib`) instead of native Winsock2. - -#### QEMU Guest Agent guest-file-read +DCGM 4.5.0 added native support for listening on the VSOCK protocol. The DCGM daemon (`nv-hostengine`) inside the guest can be configured +to listen on a VSOCK port, allowing virt-launcher on the host to connect and query GPU metrics using DCGM's client protocol. This provides +proper socket semantics including flow control and connection state detection. -The guest metrics agent writes GPU metrics to a known file path inside the guest (e.g., `/var/lib/kubevirt/gpu-metrics.json`). The host -reads that file via the QEMU Guest Agent's `guest-file-open`, `guest-file-read`, and `guest-file-close` commands, which travel over the -existing qemu-guest-agent virtio-serial channel. No additional virtio-serial channel or transport is needed. +**Advantages over virtio-serial:** +- Standard socket API with flow control and connection state detection. +- No data transfer size limitations (virtio-serial Windows drivers fail WriteFile calls >2MB). +- Already supported by KubeVirt with per-VMI CID assignment. +- DCGM natively supports VSOCK, eliminating the need for a custom guest agent. **Downsides:** -- Each scrape requires three QGA round-trips (open, read, close), adding latency compared to a direct socket connection. -- Reading while the guest agent is writing can produce partial or corrupt JSON. -- KubeVirt does not currently expose `guest-file-read`. Enabling would allow reading arbitrary guest files, therefore would need a careful -security analysis. - -#### QEMU Guest Agent guest-exec - -The host uses the QGA `guest-exec` command to run a metrics collection binary or script inside the guest on each scrape, and retrieves the -output via `guest-exec-status`. This avoids the need for a persistent guest agent process or additional virtio-serial channels. - -**Downsides:** -- `guest-exec` is disabled by default in many builds (e.g., RHEL/CentOS) due to security concerns, as it allows arbitrary command execution -inside the guest. -- Common issues with SELinux policies blocking executed commands. -- Output is base64-encoded and must be polled via `guest-exec-status`, adding latency. +- Requires Linux kernel 4.8+ in the guest; older kernels have no support. +- Windows guests require virtio-win drivers with VSOCK support. -#### Embedding Metrics in QEMU Guest Agent +### 2. Guest: DCGM with VSOCK -Extend the existing QEMU guest agent to collect GPU metrics. +NVIDIA DCGM runs inside the guest VM as the GPU metrics provider. The DCGM daemon (`nv-hostengine`) is configured to listen on a VSOCK +port, accepting connections from the host. -**Downsides:** -- Out-of-scope arbitrary NVIDIA-specific monitoring commands to the QEMU guest agent. -- Push back from QEMU guest agent maintainers team. - -### 2. Guest Agent (gpu-metrics-agent) - -A standalone Go binary that runs inside the guest as a systemd service (Linux) or Windows service. -1. Initializes NVML (gracefully handles failure and remains running with error responses). -2. On each request, collects GPU metrics via NVML and writes a JSON response. - -Response schema: -```json -{ - "version": "1.0.0", - "error": {"code": 12, "message": "ERROR_LIBRARY_NOT_FOUND"}, - "devices": [ - { - "index": 0, - "uuid": "GPU-abc-123", - "name": "Tesla T4", - "gpuUtilizationPercent": 75, - "memoryUtilizationPercent": 38, - "memoryUsedBytes": 4294967296, - "memoryTotalBytes": 17179869184, - "temperatureCelsius": 54, - "powerUsageMilliwatts": 121180, - "powerLimitMilliwatts": 250000, - "eccErrorsSingleBit": 0, - "eccErrorsDoubleBit": 0, - "encoderUtilizationPercent": 15, - "decoderUtilizationPercent": 5, - "runningProcesses": 3, - "pcieTxBytesPerSecond": 102400, - "pcieRxBytesPerSecond": 204800 - } - ] -} -``` +DCGM collects GPU metrics via NVML and exposes them through its client API. The guest only needs DCGM installed and configured to listen +on VSOCK; no additional KubeVirt-specific agent is required. -If NVML is unavailable, `error` is populated and `devices` is empty. The agent remains running. +The metrics collected from DCGM include GPU utilization, memory usage, temperature, power consumption, ECC errors, encoder/decoder +utilization, and running process counts. ### 3. virt-launcher: DomainManager and gRPC Two layers handle GPU metrics on the virt-launcher side: -**DomainManager (`LibvirtDomainManager`)**: A `gpuMetricsCache` (`TimeDefinedCache[string]`, 3250ms TTL) caches the raw JSON from the guest -agent. The recalculation function (`scrapeGPUMetrics`) connects to the local virtio-serial UNIX socket, sends `GET\n`, and reads the JSON -response. This follows the same caching pattern as `domainStatsCache` for domain stats. +**DomainManager (`LibvirtDomainManager`)**: A `gpuMetricsCache` (`TimeDefinedCache[string]`, 3250ms TTL) caches the metrics from DCGM. +The recalculation function (`scrapeGPUMetrics`) connects to the guest's DCGM via VSOCK (using the VMI's CID and a well-known port), queries +GPU metrics through the DCGM client protocol, and returns the response. This follows the same caching pattern as `domainStatsCache` for +domain stats. **cmd-server (`GetGPUMetrics` RPC)**: A new `GetGPUMetrics` method on the `Cmd` gRPC service delegates to `DomainManager.GetGPUMetrics()`, which returns the cached value. This follows the same pattern as `GetDomainStats`, keeping the cmd-server thin. @@ -208,11 +151,11 @@ On each Prometheus scrape, the domainstats scraper: 1. Connects to each VMI's virt-launcher via its cmd-client socket (same as for domain stats). 2. Calls `cli.GetGPUMetrics()` alongside `GetDomainStats()` and `GetFilesystems()` within the same scrape. -3. Parses the JSON response into `GPUMetricsResponse` and stores it in `VirtualMachineInstanceStats.GPUStats`. +3. Parses the response into `GPUMetricsResponse` and stores it in `VirtualMachineInstanceStats.GPUStats`. 4. The `gpuMetrics` resource metrics implementation emits collector results for each GPU device. -GPU metric scrape failures are logged at warning verbosity and do not block the rest of the domain stats collection. If the guest agent is -not installed or not running, `GPUStats` is nil and no GPU metrics are emitted for that VMI. +GPU metric scrape failures are logged at warning verbosity and do not block the rest of the domain stats collection. If DCGM is not +installed or not running inside the guest, `GPUStats` is nil and no GPU metrics are emitted for that VMI. This approach reuses the existing `ConcurrentCollector` infrastructure (concurrency limiting, per-VMI timeouts, socket discovery) rather than duplicating it. @@ -221,22 +164,20 @@ than duplicating it. | Metric | Type | Description | |--------|------|-------------| -| `kubevirt_vmi_gpu_agent_status` | Gauge | Agent status (0 = OK, non-zero = NVML error code) | | `kubevirt_vmi_gpu_utilization_percent` | Gauge | GPU compute utilization (0-100) | | `kubevirt_vmi_gpu_memory_utilization_percent` | Gauge | GPU memory controller utilization (0-100) | | `kubevirt_vmi_gpu_memory_used_bytes` | Gauge | GPU memory used in bytes | | `kubevirt_vmi_gpu_memory_total_bytes` | Gauge | GPU total memory in bytes | | `kubevirt_vmi_gpu_temperature_celsius` | Gauge | GPU temperature in degrees Celsius | | `kubevirt_vmi_gpu_power_usage_milliwatts` | Gauge | GPU power draw in milliwatts | -| `kubevirt_vmi_gpu_ecc_errors_single_bit_total` | Gauge | Lifetime corrected ECC error count from NVML | -| `kubevirt_vmi_gpu_ecc_errors_double_bit_total` | Gauge | Lifetime uncorrected ECC error count from NVML | +| `kubevirt_vmi_gpu_ecc_errors_single_bit_total` | Gauge | Lifetime corrected ECC error count | +| `kubevirt_vmi_gpu_ecc_errors_double_bit_total` | Gauge | Lifetime uncorrected ECC error count | | `kubevirt_vmi_gpu_encoder_utilization_percent` | Gauge | Video encoder utilization (0-100) | | `kubevirt_vmi_gpu_decoder_utilization_percent` | Gauge | Video decoder utilization (0-100) | | `kubevirt_vmi_gpu_running_processes` | Gauge | Number of compute processes on the GPU | All per-device metrics carry labels: `node`, `namespace`, `name`, `gpu_index`, `gpu_uuid`, `gpu_name`, plus VMI labels prefixed with -`kubernetes_vmi_label_`. The `gpu_agent_status` metric carries `version`, `error_code`, and `error_message` labels instead of per-device -labels. +`kubernetes_vmi_label_`. ## API Examples @@ -257,6 +198,24 @@ spec: ## Alternatives +### Custom Guest Agent via Virtio-Serial + +A standalone Go binary (`gpu-metrics-agent`) runs inside the guest, collects GPU metrics via NVML, and communicates with the host over a +dedicated virtio-serial channel using a simple text protocol (`GET\n` -> JSON response). + +**Rejected because:** +- Requires maintaining a separate guest agent repository and release lifecycle. +- Virtio-serial lacks flow control and connection state detection. +- Windows virtio-serial drivers have known issues with large data transfers (>2MB). +- DCGM 4.5.0's native VSOCK support makes a custom agent unnecessary. + +### Custom Guest Agent via VSOCK + +Same as above but using VSOCK instead of virtio-serial as the transport. + +**Rejected because:** +- Still requires maintaining a custom guest agent when DCGM can serve metrics directly. + ### Host-Side GPU Metrics (DCGM / Node Exporter) Collect GPU metrics from the host using NVIDIA DCGM or the GPU node exporter. @@ -265,31 +224,49 @@ Collect GPU metrics from the host using NVIDIA DCGM or the GPU node exporter. - The NVIDIA GPU Operator does not deploy DCGM exporter on nodes where GPUs are configured for passthrough or vGPU, because the host no longer has direct access to the device. +### QEMU Guest Agent guest-file-read + +The guest writes GPU metrics to a file, and the host reads it via QGA's `guest-file-open`, `guest-file-read`, and `guest-file-close` +commands. + +**Rejected because:** +- Each scrape requires three QGA round-trips, adding latency. +- Reading while writing can produce partial or corrupt data. +- Enabling `guest-file-read` allows reading arbitrary guest files, requiring careful security analysis. + +### QEMU Guest Agent guest-exec + +The host uses QGA `guest-exec` to run a metrics collection command inside the guest. + +**Rejected because:** +- `guest-exec` is disabled by default in many distributions (e.g., RHEL/CentOS) due to security concerns. +- Common SELinux issues blocking executed commands. +- Output is base64-encoded and must be polled, adding latency. + ## Scalability - **Caching**: GPU metrics are cached in virt-launcher with a 3.25-second TTL, so multiple Prometheus scrapes within that window reuse the -same data without reconnecting to the guest agent. +same data without reconnecting to DCGM. - **Concurrency**: GPU metrics are fetched as part of the existing domainstats scraper, which scrapes all VMIs in parallel using the `ConcurrentCollector` infrastructure. -- **No persistent connections**: The host does not maintain long-lived connections to guest agents. +- **No persistent connections**: The host does not maintain long-lived connections to DCGM in the guest. - **Scale**: Comparable to the existing domain stats and filesystem stats collection, which already scrape per-VMI data on each Prometheus collection. ## Update/Rollback Compatibility -- The virtio-serial channel is only added when the VMI has GPU devices. Once the `GPUMetrics` feature gate is implemented, disabling it or -rolling back will remove the channel from new VMIs; existing running VMIs retain the channel until they are stopped. -- The guest agent is an opt-in installation. If the agent is not installed, virt-handler logs a connection failure and emits no GPU metrics -for that VMI. +- VSOCK is enabled per-VMI via KubeVirt's existing VSOCK infrastructure. Once the `GPUMetrics` feature gate is implemented, disabling it or +rolling back will stop GPU metrics collection for new VMIs; existing running VMIs are unaffected. +- DCGM inside the guest is an opt-in installation by the VM user. If DCGM is not installed or not listening on VSOCK, virt-handler logs +a connection failure and emits no GPU metrics for that VMI. - No API changes; no migration compatibility concerns. ## Functional Testing Approach -- **Unit tests**: Test the collector callback with mock socket responses (success, error, timeout, agent not running). -- **Unit tests**: Test virtio-serial channel creation in domain XML converter when GPUs are present vs. absent. -- **Integration tests**: Start a VMI with a mock GPU metrics agent, verify `kubevirt_vmi_gpu_*` metrics are emitted from the virt-handler +- **Unit tests**: Test the collector callback with mock VSOCK responses (success, error, timeout, DCGM not running). +- **Unit tests**: Test VSOCK connection setup for VMIs with GPU devices present vs. absent. +- **Integration tests**: Start a VMI with a mock DCGM VSOCK listener, verify `kubevirt_vmi_gpu_*` metrics are emitted from the virt-handler metrics endpoint. -- **Guest agent tests**: Tested in the gpu-metrics-agent repo (protocol, NVML mock, reconnection behavior). ## Implementation History @@ -298,19 +275,18 @@ metrics endpoint. ### Alpha - [ ] Feature gate `GPUMetrics` guards all code changes -- [ ] Virtio-serial channel created for VMIs with GPU devices -- [ ] virt-handler collector scrapes guest agent and emits Prometheus metrics -- [ ] Guest agent supports Linux with NVML -- [ ] Unit tests for collector, channel creation, and agent protocol -- [ ] Documentation for installing and running the guest agent +- [ ] virt-launcher connects to guest DCGM via VSOCK and queries GPU metrics +- [ ] virt-handler collector scrapes virt-launcher and emits Prometheus metrics +- [ ] Unit tests for collector, VSOCK connection, and DCGM protocol handling +- [ ] Documentation for installing and configuring DCGM with VSOCK in the guest ### Beta -- [ ] Guest agent supports Windows -- [ ] Integration tests with mock agent in kubevirtci +- [ ] Windows guest support validated +- [ ] Integration tests with mock DCGM VSOCK listener in kubevirtci - [ ] Prometheus recording rules and/or alerts for common GPU failure scenarios -- [ ] Protocol versioning validated (agent version vs. host expectations) +- [ ] DCGM version compatibility validated (minimum version requirements documented) ### GA -- [ ] Stable for at least two releases with no protocol-breaking changes +- [ ] Stable for at least two releases with no breaking changes From 68385bb9aa0ec1950fe1557bc29b7935c9fd50fa Mon Sep 17 00:00:00 2001 From: machadovilaca Date: Wed, 29 Apr 2026 10:37:04 +0100 Subject: [PATCH 4/6] VEP #254: Add DCGM via regular Kubernetes Networking alternative Signed-off-by: machadovilaca --- .../gpu-metrics-via-vsock/vep.md | 25 +++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/veps/sig-observability/gpu-metrics-via-vsock/vep.md b/veps/sig-observability/gpu-metrics-via-vsock/vep.md index 2e06c102..5ce2de0b 100644 --- a/veps/sig-observability/gpu-metrics-via-vsock/vep.md +++ b/veps/sig-observability/gpu-metrics-via-vsock/vep.md @@ -243,6 +243,31 @@ The host uses QGA `guest-exec` to run a metrics collection command inside the gu - Common SELinux issues blocking executed commands. - Output is base64-encoded and must be polled, adding latency. +### Exposing DCGM via regular Kubernetes Networking + +Instead of using VSOCK, DCGM inside the guest could listen on a standard network +interface and be exposed to Prometheus via Kubernetes Services and +ServiceMonitors. + +**Rejected because:** + +- **Additional resource overhead**: Each VM would need a dedicated Service and +ServiceMonitor created and deleted in sync with the VM lifecycle. This scales +with the number of GPU VMs and adds complexity that does not exist with the +VSOCK approach. + +- **Network dependency**: Requires the guest to have a network interface on the +pod network. Not all VM use-cases will have usable network configurations. +SR-IOV only, isolated via Multus, or no network connectivity at all. VSOCK is +independent of the networking configuration. + +- **Security**: With VSOCK, communication is scoped to host-guest only and is +managed entirely by KubeVirt. virt-handler's Service and ServiceMonitor are the +only externally reachable endpoints where this data will be exposed, and their +security is handled by KubeVirt. Exposing DCGM on a network interface shifts +this responsibility to the user, who must secure DCGM against access from other +sources. + ## Scalability - **Caching**: GPU metrics are cached in virt-launcher with a 3.25-second TTL, so multiple Prometheus scrapes within that window reuse the From db608f133fdff39a066c5a740a8c15cc09e3c904 Mon Sep 17 00:00:00 2001 From: machadovilaca Date: Thu, 30 Apr 2026 11:09:55 +0100 Subject: [PATCH 5/6] VEP #254: Integrate GPU metrics with GetVMStats unified RPC Signed-off-by: machadovilaca --- .../gpu-metrics-via-vsock/vep.md | 134 ++++++++++-------- 1 file changed, 73 insertions(+), 61 deletions(-) diff --git a/veps/sig-observability/gpu-metrics-via-vsock/vep.md b/veps/sig-observability/gpu-metrics-via-vsock/vep.md index 5ce2de0b..126d3971 100644 --- a/veps/sig-observability/gpu-metrics-via-vsock/vep.md +++ b/veps/sig-observability/gpu-metrics-via-vsock/vep.md @@ -24,8 +24,9 @@ GPU utilization, memory usage, temperature, power consumption, or error counts f This VEP introduces a mechanism for collecting GPU metrics from inside the guest and exposing them as Prometheus metrics on the host. NVIDIA DCGM (Data Center GPU Manager) 4.5.0 added native support for listening on the VSOCK protocol, enabling direct guest-to-host communication -without a custom guest agent. virt-launcher connects to DCGM inside the guest via VSOCK, and virt-handler scrapes virt-launcher on each -Prometheus collection cycle to produce `kubevirt_vmi_gpu_*` metrics. +without a custom guest agent. virt-launcher connects to DCGM inside the guest via VSOCK and exposes the GPU metrics data through the unified +`GetVMStats` gRPC call ([VEP #143](https://github.com/kubevirt/enhancements/pull/81)). virt-handler queries virt-launcher via `GetVMStats`, +and the observability-controller queries virt-handler to produce `kubevirt_vmi_gpu_*` metrics. ## Motivation @@ -41,10 +42,11 @@ alerting. ## Goals -- Expose per-VM, per-GPU utilization metrics as Prometheus metrics from virt-handler. +- Expose per-VM, per-GPU utilization metrics as Prometheus metrics from the observability-controller. - Support both GPU passthrough and vGPU devices. - Support Linux and Windows guests. - Leverage DCGM's native VSOCK support to avoid maintaining a custom guest agent. +- Integrate with the unified `GetVMStats` gRPC call rather than introducing a separate RPC. ## Non Goals @@ -76,34 +78,33 @@ hardware failure. ## Repos - https://github.com/kubevirt/kubevirt +- https://github.com/kubevirt/kubevirt-observability-controller ## Design -The design has three components spanning three processes: +The design integrates GPU metrics into the existing monitoring data pipeline established by +[VEP #143](https://github.com/kubevirt/enhancements/pull/81). GPU metrics become a new data category within the unified `GetVMStats` gRPC +call, following the same pattern as domain stats, guest info, and filesystem data. ``` -Guest VM (QEMU) virt-launcher virt-handler -+--------------------------+ +-------------------------------+ +---------------------------+ -| | | DomainManager | | | -| | | gpuMetricsCache | | | -| DCGM (nv-hostengine) | | TimeDefinedCache (3.25s) | | | -| - collects GPU metrics | <====> | scrapeGPUMetrics() | | domainstats scraper | -| - listens on VSOCK | vsock | | | - GetDomainStats() | -| | | cmd-server (gRPC) | | - GetFilesystems() | -+--------------------------+ | GetGPUMetrics() RPC | <-- | - GetGPUMetrics() | - +-------------------------------+ | | - | resourceMetrics: | - | gpuMetrics.Collect() | - | → kubevirt_vmi_gpu_* | - +---------------------------+ +Guest VM (QEMU) virt-launcher virt-handler observability-controller ++--------------------------+ +----------------------------+ +---------------------+ +-------------------------+ +| | | | | | | | +| DCGM (nv-hostengine) | | DomainManager | | | | | +| - collects GPU metrics | <====> | gpuMetricsCache | | | | | +| - listens on VSOCK | vsock | TimeDefinedCache | | | | | +| | | scrapeGPUMetrics() | | | | queries virt-handler | ++--------------------------+ | | | GetVMStats() | <-- | for all VMI data | + | cmd-server (gRPC) | | includes gpuStats | | | + | GetVMStats() handler | <-- | | | emits: | + | includes gpuStats | | | | kubevirt_vmi_gpu_* | + +----------------------------+ +---------------------+ +-------------------------+ ``` -### 1. Guest-Host Communication +### 1. Prerequisites: VSOCK Enablement -#### VSOCK (chosen) - -VSOCK (`AF_VSOCK`) is a socket address family for guest-host communication using the virtio-vsock transport. KubeVirt already has VSOCK -support with per-VMI CID assignment by virt-controller. +Users must enable VSOCK on the VM spec to allow guest-host communication. VSOCK (`AF_VSOCK`) is a socket address family for guest-host +communication using the virtio-vsock transport. KubeVirt already has VSOCK support with per-VMI CID assignment by virt-controller. DCGM 4.5.0 added native support for listening on the VSOCK protocol. The DCGM daemon (`nv-hostengine`) inside the guest can be configured to listen on a VSOCK port, allowing virt-launcher on the host to connect and query GPU metrics using DCGM's client protocol. This provides @@ -122,7 +123,7 @@ proper socket semantics including flow control and connection state detection. ### 2. Guest: DCGM with VSOCK NVIDIA DCGM runs inside the guest VM as the GPU metrics provider. The DCGM daemon (`nv-hostengine`) is configured to listen on a VSOCK -port, accepting connections from the host. +port, accepting connections from the host. Users are responsible for installing and configuring DCGM in the guest. DCGM collects GPU metrics via NVML and exposes them through its client API. The guest only needs DCGM installed and configured to listen on VSOCK; no additional KubeVirt-specific agent is required. @@ -130,35 +131,33 @@ on VSOCK; no additional KubeVirt-specific agent is required. The metrics collected from DCGM include GPU utilization, memory usage, temperature, power consumption, ECC errors, encoder/decoder utilization, and running process counts. -### 3. virt-launcher: DomainManager and gRPC +### 3. virt-launcher: GPU Metrics in GetVMStats -Two layers handle GPU metrics on the virt-launcher side: +GPU metrics are integrated into the `GetVMStats` gRPC handler introduced by [VEP #143](https://github.com/kubevirt/enhancements/pull/81). +A new `GpuStatsRequest` / `gpuStats` field is added to `VMStatsRequest` / `VMStatsResponse`, following the same pattern as the other data +categories (domain stats, guest info, filesystems, etc.). **DomainManager (`LibvirtDomainManager`)**: A `gpuMetricsCache` (`TimeDefinedCache[string]`, 3250ms TTL) caches the metrics from DCGM. The recalculation function (`scrapeGPUMetrics`) connects to the guest's DCGM via VSOCK (using the VMI's CID and a well-known port), queries GPU metrics through the DCGM client protocol, and returns the response. This follows the same caching pattern as `domainStatsCache` for domain stats. -**cmd-server (`GetGPUMetrics` RPC)**: A new `GetGPUMetrics` method on the `Cmd` gRPC service delegates to `DomainManager.GetGPUMetrics()`, -which returns the cached value. This follows the same pattern as `GetDomainStats`, keeping the cmd-server thin. - -### 4. Prometheus Collector (virt-handler) +**GetVMStats handler**: When the caller includes `GpuStatsRequest` in the `VMStatsRequest`, the handler reads from `gpuMetricsCache` and +populates the `gpuStats` field in the response. This keeps GPU metrics collection consistent with the unified monitoring data pipeline. -GPU metrics are collected as part of the existing **domainstats** collector, following the same `resourceMetrics` pattern used for CPU, -memory, block, network, and filesystem metrics. No separate collector is needed. +### 4. virt-handler: Requesting GPU Stats -On each Prometheus scrape, the domainstats scraper: +virt-handler includes `GpuStatsRequest` when calling `GetVMStats` on each virt-launcher. The GPU metrics data is returned alongside domain +stats and other monitoring data in the same `GetVMStats` response. -1. Connects to each VMI's virt-launcher via its cmd-client socket (same as for domain stats). -2. Calls `cli.GetGPUMetrics()` alongside `GetDomainStats()` and `GetFilesystems()` within the same scrape. -3. Parses the response into `GPUMetricsResponse` and stores it in `VirtualMachineInstanceStats.GPUStats`. -4. The `gpuMetrics` resource metrics implementation emits collector results for each GPU device. +### 5. observability-controller: Emitting Prometheus Metrics -GPU metric scrape failures are logged at warning verbosity and do not block the rest of the domain stats collection. If DCGM is not -installed or not running inside the guest, `GPUStats` is nil and no GPU metrics are emitted for that VMI. +The observability-controller ([VEP #143](https://github.com/kubevirt/enhancements/pull/81)) queries virt-handler to collect runtime VM data. +GPU metrics are collected as part of this existing flow. The controller parses the `gpuStats` data from `GetVMStats` responses and emits +`kubevirt_vmi_gpu_*` Prometheus metrics. -This approach reuses the existing `ConcurrentCollector` infrastructure (concurrency limiting, per-VMI timeouts, socket discovery) rather -than duplicating it. +The `gpuMetrics` resource metrics implementation emits collector results for each GPU device found in the response. This follows the same +`resourceMetrics` pattern used for CPU, memory, block, network, and filesystem metrics. ### Metrics Emitted @@ -181,7 +180,7 @@ All per-device metrics carry labels: `node`, `namespace`, `name`, `gpu_index`, ` ## API Examples -No changes to the KubeVirt API are required. The setup is enabled when GPUs are present in the VMI spec: +Users must enable VSOCK on the VM spec and have DCGM installed and listening on VSOCK inside the guest: ```yaml apiVersion: kubevirt.io/v1 @@ -191,6 +190,7 @@ metadata: spec: domain: devices: + autoattachVSOCK: true gpus: - name: gpu1 deviceName: nvidia.com/A100 @@ -268,30 +268,41 @@ security is handled by KubeVirt. Exposing DCGM on a network interface shifts this responsibility to the user, who must secure DCGM against access from other sources. +### Dedicated GetGPUMetrics gRPC RPC + +A separate `GetGPUMetrics` RPC on the `Cmd` gRPC service, called by virt-handler alongside `GetDomainStats` and `GetFilesystems`. + +**Rejected because:** +- VEP #143 introduces a unified `GetVMStats` RPC that consolidates all monitoring data into a single call. Adding a separate RPC for GPU +metrics would work against that consolidation goal. +- GPU metrics fit naturally as a new data category within `GetVMStats`, following the same pattern as domain stats, guest info, and +filesystems. + ## Scalability -- **Caching**: GPU metrics are cached in virt-launcher with a 3.25-second TTL, so multiple Prometheus scrapes within that window reuse the -same data without reconnecting to DCGM. -- **Concurrency**: GPU metrics are fetched as part of the existing domainstats scraper, which scrapes all VMIs in parallel using the -`ConcurrentCollector` infrastructure. +- **Caching**: GPU metrics are cached in virt-launcher with a 3.25-second TTL, so multiple scrapes within that window reuse the same data +without reconnecting to DCGM. +- **Unified collection**: GPU metrics are fetched as part of the `GetVMStats` call, adding no additional gRPC round-trips between +virt-handler and virt-launcher. - **No persistent connections**: The host does not maintain long-lived connections to DCGM in the guest. -- **Scale**: Comparable to the existing domain stats and filesystem stats collection, which already scrape per-VMI data on each Prometheus -collection. +- **Scale**: Comparable to the existing domain stats and filesystem stats collection, which already collect per-VMI data as part of +`GetVMStats`. ## Update/Rollback Compatibility -- VSOCK is enabled per-VMI via KubeVirt's existing VSOCK infrastructure. Once the `GPUMetrics` feature gate is implemented, disabling it or -rolling back will stop GPU metrics collection for new VMIs; existing running VMIs are unaffected. -- DCGM inside the guest is an opt-in installation by the VM user. If DCGM is not installed or not listening on VSOCK, virt-handler logs -a connection failure and emits no GPU metrics for that VMI. -- No API changes; no migration compatibility concerns. +- VSOCK must be enabled per-VMI by the user. Once the `GPUMetrics` feature gate is implemented, disabling it or rolling back will stop GPU +metrics collection for new VMIs; existing running VMIs are unaffected. +- This VEP depends on the `GetVMStats` RPC from VEP #143. If VEP #143 is not yet implemented, GPU metrics cannot be collected. +- DCGM inside the guest is an opt-in installation by the VM user. If DCGM is not installed or not listening on VSOCK, virt-launcher returns +empty `gpuStats` and no GPU metrics are emitted for that VMI. +- No API changes beyond requiring `autoattachVSOCK: true`; no migration compatibility concerns. ## Functional Testing Approach -- **Unit tests**: Test the collector callback with mock VSOCK responses (success, error, timeout, DCGM not running). -- **Unit tests**: Test VSOCK connection setup for VMIs with GPU devices present vs. absent. -- **Integration tests**: Start a VMI with a mock DCGM VSOCK listener, verify `kubevirt_vmi_gpu_*` metrics are emitted from the virt-handler -metrics endpoint. +- **Unit tests**: Test the GPU stats handler within `GetVMStats` with mock VSOCK responses (success, error, timeout, DCGM not running). +- **Unit tests**: Test VSOCK connection setup for VMIs with GPU devices and VSOCK enabled vs. absent. +- **Integration tests**: Start a VMI with a mock DCGM VSOCK listener, verify `kubevirt_vmi_gpu_*` metrics are emitted from the +observability-controller metrics endpoint. ## Implementation History @@ -300,10 +311,11 @@ metrics endpoint. ### Alpha - [ ] Feature gate `GPUMetrics` guards all code changes -- [ ] virt-launcher connects to guest DCGM via VSOCK and queries GPU metrics -- [ ] virt-handler collector scrapes virt-launcher and emits Prometheus metrics -- [ ] Unit tests for collector, VSOCK connection, and DCGM protocol handling -- [ ] Documentation for installing and configuring DCGM with VSOCK in the guest +- [ ] `GpuStatsRequest` / `gpuStats` field added to `GetVMStats` proto messages (depends on VEP #143) +- [ ] virt-launcher connects to guest DCGM via VSOCK and populates `gpuStats` in `GetVMStats` response +- [ ] observability-controller parses GPU stats and emits Prometheus metrics +- [ ] Unit tests for GPU stats collection, VSOCK connection, and DCGM protocol handling +- [ ] Documentation for enabling VSOCK and installing/configuring DCGM in the guest ### Beta From 152ed4da2dd4bcfd0d7bfa37f91e347f4c84ff49 Mon Sep 17 00:00:00 2001 From: machadovilaca Date: Thu, 30 Apr 2026 16:00:47 +0100 Subject: [PATCH 6/6] VEP #254: Add opt-in annotation, VSOCK limitations, and scope to Linux Signed-off-by: machadovilaca --- .../gpu-metrics-via-vsock/vep.md | 66 +++++++++++++++---- 1 file changed, 54 insertions(+), 12 deletions(-) diff --git a/veps/sig-observability/gpu-metrics-via-vsock/vep.md b/veps/sig-observability/gpu-metrics-via-vsock/vep.md index 126d3971..d38bab70 100644 --- a/veps/sig-observability/gpu-metrics-via-vsock/vep.md +++ b/veps/sig-observability/gpu-metrics-via-vsock/vep.md @@ -44,7 +44,7 @@ alerting. - Expose per-VM, per-GPU utilization metrics as Prometheus metrics from the observability-controller. - Support both GPU passthrough and vGPU devices. -- Support Linux and Windows guests. +- Support Linux guests. - Leverage DCGM's native VSOCK support to avoid maintaining a custom guest agent. - Integrate with the unified `GetVMStats` gRPC call rather than introducing a separate RPC. @@ -54,6 +54,7 @@ alerting. - Supporting non-NVIDIA GPUs (AMD, Intel) in the initial implementation. - Alerting rules or Grafana dashboards (these can be added separately). - Collecting GPU metrics from the host side (e.g., via DCGM on the host). +- Windows guest support (see [Future Work](#future-work)). ## Definition of Users @@ -101,10 +102,27 @@ Guest VM (QEMU) virt-launcher virt-hand +----------------------------+ +---------------------+ +-------------------------+ ``` -### 1. Prerequisites: VSOCK Enablement +### 1. Opting In: Annotation and VSOCK Enablement -Users must enable VSOCK on the VM spec to allow guest-host communication. VSOCK (`AF_VSOCK`) is a socket address family for guest-host -communication using the virtio-vsock transport. KubeVirt already has VSOCK support with per-VMI CID assignment by virt-controller. +Users must annotate the VM to enable GPU metrics collection and enable VSOCK on the VM spec. The annotation +`kubevirt.io/gpu-metrics-collector` signals to virt-launcher that it should connect to DCGM via VSOCK and collect GPU metrics. Without this +annotation, virt-launcher does not attempt any GPU metrics collection, regardless of whether the VM has GPUs or VSOCK enabled. + +```yaml +annotations: + kubevirt.io/gpu-metrics-collector: "dcgm-vsock" +``` + +The annotation serves three purposes: + +1. **virt-launcher trigger**: virt-launcher only attempts DCGM VSOCK connections on VMIs carrying this annotation, avoiding unnecessary +connection attempts and timeouts on VMs with non-NVIDIA GPUs or VMs where DCGM is not installed. +2. **virt-api VSOCK conflict warning**: virt-api can check for this annotation and warn users when they attempt to use VSOCK via the +KubeVirt client (`virtctl vsock`) that the VSOCK device is also in use for GPU metrics collection on the DCGM port. +3. **Explicit opt-in**: Keeps GPU metrics collection strictly opt-in, making it clear which VMs are participating. + +Users must also enable VSOCK on the VM spec. VSOCK (`AF_VSOCK`) is a socket address family for guest-host communication using the +virtio-vsock transport. KubeVirt already has VSOCK support with per-VMI CID assignment by virt-controller. DCGM 4.5.0 added native support for listening on the VSOCK protocol. The DCGM daemon (`nv-hostengine`) inside the guest can be configured to listen on a VSOCK port, allowing virt-launcher on the host to connect and query GPU metrics using DCGM's client protocol. This provides @@ -120,6 +138,21 @@ proper socket semantics including flow control and connection state detection. - Requires Linux kernel 4.8+ in the guest; older kernels have no support. - Windows guests require virtio-win drivers with VSOCK support. +**Shared VSOCK usage:** + +KubeVirt already uses VSOCK for two purposes: an internal gRPC service on VSOCK port 1 for TLS certificate distribution to guests, and +user-initiated port-forwarding via `virtctl vsock`. GPU metrics collection adds another consumer on the same virtio-vsock device. +Limitations to be aware of: + +- **Bandwidth sharing**: All VSOCK traffic (KubeVirt internal, `virtctl vsock`, and DCGM metrics) shares a single virtio-vsock device per +VM. There is no QoS or prioritization between consumers. +- **No connection isolation**: VSOCK multiplexes connections by port number on the same device. +- **Port management**: DCGM must listen on a VSOCK port that does not conflict with port 1 (reserved by KubeVirt for its internal gRPC +service) or other guest services listening on VSOCK. + +In practice, the GPU metrics payload is small (a few KB per scrape) and collected at most once every 3.25 seconds, so contention is unlikely +under normal operation. + ### 2. Guest: DCGM with VSOCK NVIDIA DCGM runs inside the guest VM as the GPU metrics provider. The DCGM daemon (`nv-hostengine`) is configured to listen on a VSOCK @@ -137,10 +170,11 @@ GPU metrics are integrated into the `GetVMStats` gRPC handler introduced by [VEP A new `GpuStatsRequest` / `gpuStats` field is added to `VMStatsRequest` / `VMStatsResponse`, following the same pattern as the other data categories (domain stats, guest info, filesystems, etc.). -**DomainManager (`LibvirtDomainManager`)**: A `gpuMetricsCache` (`TimeDefinedCache[string]`, 3250ms TTL) caches the metrics from DCGM. -The recalculation function (`scrapeGPUMetrics`) connects to the guest's DCGM via VSOCK (using the VMI's CID and a well-known port), queries -GPU metrics through the DCGM client protocol, and returns the response. This follows the same caching pattern as `domainStatsCache` for -domain stats. +**DomainManager (`LibvirtDomainManager`)**: When the VMI carries the `kubevirt.io/gpu-metrics-collector: "dcgm-vsock"` annotation, +a `gpuMetricsCache` (`TimeDefinedCache[string]`, 3250ms TTL) caches the metrics from DCGM. The recalculation function +(`scrapeGPUMetrics`) connects to the guest's DCGM via VSOCK (using the VMI's CID and a well-known port), queries GPU metrics through the +DCGM client protocol, and returns the response. This follows the same caching pattern as `domainStatsCache` for domain stats. If the +annotation is absent, no VSOCK connection is attempted. **GetVMStats handler**: When the caller includes `GpuStatsRequest` in the `VMStatsRequest`, the handler reads from `gpuMetricsCache` and populates the `gpuStats` field in the response. This keeps GPU metrics collection consistent with the unified monitoring data pipeline. @@ -187,6 +221,8 @@ apiVersion: kubevirt.io/v1 kind: VirtualMachineInstance metadata: name: gpu-workload + annotations: + kubevirt.io/gpu-metrics-collector: "dcgm-vsock" spec: domain: devices: @@ -293,9 +329,9 @@ virt-handler and virt-launcher. - VSOCK must be enabled per-VMI by the user. Once the `GPUMetrics` feature gate is implemented, disabling it or rolling back will stop GPU metrics collection for new VMIs; existing running VMIs are unaffected. - This VEP depends on the `GetVMStats` RPC from VEP #143. If VEP #143 is not yet implemented, GPU metrics cannot be collected. -- DCGM inside the guest is an opt-in installation by the VM user. If DCGM is not installed or not listening on VSOCK, virt-launcher returns -empty `gpuStats` and no GPU metrics are emitted for that VMI. -- No API changes beyond requiring `autoattachVSOCK: true`; no migration compatibility concerns. +- GPU metrics collection is opt-in via the `kubevirt.io/gpu-metrics-collector` annotation. If the annotation is absent, DCGM is not +installed, or VSOCK is not enabled, virt-launcher returns empty `gpuStats` and no GPU metrics are emitted for that VMI. +- No API changes beyond the annotation and requiring `autoattachVSOCK: true`; no migration compatibility concerns. ## Functional Testing Approach @@ -304,6 +340,13 @@ empty `gpuStats` and no GPU metrics are emitted for that VMI. - **Integration tests**: Start a VMI with a mock DCGM VSOCK listener, verify `kubevirt_vmi_gpu_*` metrics are emitted from the observability-controller metrics endpoint. +## Future Work + +### Windows Guest Support + +Windows VSOCK driver support was added recently in virtio-win build 285. Once the driver matures and DCGM's VSOCK support on Windows is +validated, Windows guest support can be added as a follow-up. + ## Implementation History ## Graduation Requirements @@ -319,7 +362,6 @@ observability-controller metrics endpoint. ### Beta -- [ ] Windows guest support validated - [ ] Integration tests with mock DCGM VSOCK listener in kubevirtci - [ ] Prometheus recording rules and/or alerts for common GPU failure scenarios - [ ] DCGM version compatibility validated (minimum version requirements documented)