Skip to content

feat: add csi-volume-device-exporter#1

Open
sradco wants to merge 2 commits into
openshift-virtualization:mainfrom
sradco:initial-exporter
Open

feat: add csi-volume-device-exporter#1
sradco wants to merge 2 commits into
openshift-virtualization:mainfrom
sradco:initial-exporter

Conversation

@sradco
Copy link
Copy Markdown

@sradco sradco commented May 4, 2026

Summary

A Prometheus exporter DaemonSet that maps CSI volumes to their underlying node block devices, enabling correlation of storage path health metrics with Kubernetes workloads.

When a storage path degrades (e.g., a Fibre Channel link drops or an NVMe-oF controller dies), this exporter - combined with new node_exporter collectors — enables alerts that identify which PVs and VMs are affected.

Jira: #https://redhat.atlassian.net/browse/CNV-66837

Components

  • Discovery engine - Reads kubelet vol_data.json + /proc/1/mountinfo for universal CSI driver coverage. Also supports driver-specific JSON (Trident, HPE).
  • Prometheus metrics - csi_volume_node_device_info maps each CSI volume to its block device, plus self-monitoring metrics.
  • Alerts:
    • CSIVolumeMultipathDegraded (warning) - PV-backed multipath has non-active paths
    • CSIVolumeMultipathLost (critical) - All multipath paths down, I/O failing
    • CSIVolumeNVMeSubsystemDegraded (warning) - NVMe-oF subsystem has non-live controllers
    • CSIVolumeNVMeSubsystemLost (critical) - All NVMe-oF controllers dead
    • CSIVolumeDeviceExporterDown (warning) - Exporter not scraped
  • Alert unit tests - promtool-based tests in hack/prom-rule-ci/.
  • Deployment manifests - DaemonSet, PodMonitor, SCC for OpenShift.
  • Runbooks - Actionable runbooks for each alert.
  • CI workflow - GitHub Actions verify (test, lint, alert tests).

Security model

  • Non-privileged container (no capabilities, read-only rootfs)
  • Read-only hostPath mounts (/var/lib/kubelet, /proc, /sys)
  • No Kubernetes API access
  • UBI 9 Minimal base image, static binary

Related PRs

Test plan

  • Unit tests pass (make test)
  • Alert unit tests pass (make test-alerts) — covers all 5 alerts
  • Cluster validation on OpenShift 4.21 (Ceph RBD + Cinder)
  • PromQL three-way join validated with synthetic sysfs fixtures
  • Lint (make lint)

@sradco sradco force-pushed the initial-exporter branch 2 times, most recently from 9e21efa to 3893e1d Compare May 4, 2026 07:54
A Prometheus exporter DaemonSet that maps CSI volumes to their
underlying node block devices, enabling correlation of storage
path health metrics with Kubernetes workloads.

Components:
- Discovery engine (kubelet vol_data.json + mountinfo, Trident, HPE)
- Prometheus metrics (csi_volume_node_device_info + self-monitoring)
- Alerts (CSIVolumeMultipathDegraded, CSIVolumeDeviceExporterDown)
- Alert unit tests (promtool)
- Deployment manifests (DaemonSet, PodMonitor, SCC)
- CI workflow (verify: test, lint, alert tests)

Signed-off-by: Shirly Radco <sradco@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@sradco sradco force-pushed the initial-exporter branch 4 times, most recently from fdaac67 to 6e760c0 Compare May 4, 2026 08:58
Copy link
Copy Markdown

@awels awels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what about discovering other vendors like Dell, portworx, etc? I am assuming they all have their own JSON format. That will be annoying to maintain.

Comment thread deploy/scc.yaml
@@ -0,0 +1,28 @@
apiVersion: security.openshift.io/v1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a special SCC? Can't we use any of the privileged ones that already exist?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented about it in the code now

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered the built-in SCCs.
The closest is node-exporter, but it's owned by the cluster-monitoring operator - binding our ServiceAccount to it creates a hidden dependency on another operator's internals that could break silently if they change it. A custom least-privilege SCC (as done by virt-handler and other CNV DaemonSets) is the correct pattern here. The justification is now documented in a comment in deploy/scc.yaml

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awels do you agree?

Comment thread docs/runbooks/CSIVolumeMultipathLost.md Outdated
1. Check the multipath device state:

```bash
kubectl debug node/$NODE -it --image=registry.access.redhat.com/ubi9/ubi-minimal \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a mix of U/S and D/S commands and images. I don't think registry.access.redhat.com is publicly visible, maybe point to quay.io instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

1. Check the overall multipath state on the affected node:

```bash
kubectl debug node/$NODE -it --image=registry.access.redhat.com/ubi9/ubi-minimal \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a mix of U/S and D/S commands and images. I don't think registry.access.redhat.com is publicly visible, maybe point to quay.io instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

1. Check the NVMe subsystem controller states:

```bash
kubectl debug node/$NODE -it --image=registry.access.redhat.com/ubi9/ubi-minimal \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a mix of U/S and D/S commands and images. I don't think registry.access.redhat.com is publicly visible, maybe point to quay.io instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

1. Check NVMe subsystem state:

```bash
kubectl debug node/$NODE -it --image=registry.access.redhat.com/ubi9/ubi-minimal \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a mix of U/S and D/S commands and images. I don't think registry.access.redhat.com is publicly visible, maybe point to quay.io instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown

@akalenyu akalenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really preliminary pass, interesting stuff

When a storage path degrades (e.g., a Fibre Channel link drops or an NVMe-oF controller dies), this exporter — combined with new node_exporter collectors — enables alerts that identify which PVs and VMs are affected.

Have you considered leveraging kubevirt's PausedIOError condition to possibly achieve similar alerts? https://kubevirt.io/user-guide/storage/disks_and_volumes/#error-policy

@sradco
Copy link
Copy Markdown
Author

sradco commented May 5, 2026

So what about discovering other vendors like Dell, portworx, etc? I am assuming they all have their own JSON format. That will be annoying to maintain.

The exporter has a universal discovery path that reads kubelet's own CSI metadata (vol_data.json + mountinfo) — this works for every CSI driver without any driver-specific code. The Trident/HPE modules are optional enrichment for cases where those drivers expose extra metadata in their own JSON. We don't need to add per-vendor code for basic functionality.

@sradco
Copy link
Copy Markdown
Author

sradco commented May 5, 2026

Really preliminary pass, interesting stuff

When a storage path degrades (e.g., a Fibre Channel link drops or an NVMe-oF controller dies), this exporter — combined with new node_exporter collectors — enables alerts that identify which PVs and VMs are affected.

Have you considered leveraging kubevirt's PausedIOError condition to possibly achieve similar alerts? https://kubevirt.io/user-guide/storage/disks_and_volumes/#error-policy

+1. PausedIOError fires after I/O has already failed and the VM is paused: it's a "damage done" reactive signal. Our multipath/NVMe alerts fire on path degradation, when some paths are unhealthy but I/O may still be working via surviving paths - that's the early warning window to act before workloads are impacted.

Additionally, our alerts cover any PV-backed workload (not just VMs) and identify the root cause (which FC link, NVMe controller, or fabric segment failed), which PausedIOError doesn't provide.

@sradco
Copy link
Copy Markdown
Author

sradco commented May 18, 2026

@awels , @akalenyu , Do you approve this PR?

@akalenyu
Copy link
Copy Markdown

@awels , @akalenyu , Do you approve this PR?

Not just yet, still need some time. This is not a trivial path to take, so, while the impl. may be sound, I am missing some back and forth on the approach itself (theres some security concerns as well, with hostPID for instance).

Meanwhile, I noticed this PR is linked to a csi-addons change, which sounds interesting. Could you elaborate?

@akalenyu
Copy link
Copy Markdown

@awels , @akalenyu , Do you approve this PR?

Not just yet, still need some time. This is not a trivial path to take, so, while the impl. may be sound, I am missing some back and forth on the approach itself (theres some security concerns as well, with hostPID for instance).

Meanwhile, I noticed this PR is linked to a csi-addons change, which sounds interesting. Could you elaborate?

For example, what if we take the csi-addons path, and have each driver implement the RPC and tell us this information?
Here's how it looks like in practice https://github.com/ceph/ceph-csi/blob/a5474f81497297aa9cd341e12529923b1349084d/internal/csi-addons/rbd/reclaimspace.go#L120-L123

@sradco
Copy link
Copy Markdown
Author

sradco commented May 20, 2026

@awels , @akalenyu , Do you approve this PR?

Not just yet, still need some time. This is not a trivial path to take, so, while the impl. may be sound, I am missing some back and forth on the approach itself (theres some security concerns as well, with hostPID for instance).

Thank you! This is a very good point.
I will remove hostPID: true from daemonset.yaml, replace the host-proc volume from hostPath: /proc to hostPath: /proc/1/mountinfo, update the volume mount path accordingly and update ParseMountInfo to accept the direct file path instead of a proc directory.

Meanwhile, I noticed this PR is linked to a csi-addons change, which sounds interesting. Could you elaborate?

@akalenyu, On the csi-addons path, I explored this (hence the linked closed PR).
The conclusion was that it's an architectural mismatch. kubernetes-csi-addons is a CSI API extension project, where every feature adds a gRPC operation that CSI drivers implement via a sidecar.
The exporter has no CRDs, no gRPC, no sidecar, it's a standalone DaemonSet reading host files.
It would establish a completely new paradigm in that project, which is not appropriate without community design discussion first.

More fundamentally: even if each CSI driver exposed block device names via an RPC, that only solves half the problem. The actual path health state comes from node_exporter's multipath/NVMe metrics and the exporter's job is to provide the join key (CSI volume → block device) to correlate those metrics with Kubernetes workloads.

Users today are blind: we have new multipath/NVMe path health metrics in node_exporter but no way to tie them to impacted PVs or VMs. This exporter is the the bridge that works for all drivers immediately, with no driver changes required.

Add three new alerts:
- CSIVolumeMultipathLost (critical): all paths to a multipath device
  are down, I/O is likely failing
- CSIVolumeNVMeSubsystemDegraded (warning): NVMe-oF subsystem has
  at least one non-live controller path
- CSIVolumeNVMeSubsystemLost (critical): all NVMe-oF controller
  paths are dead

The NVMe alerts use node_nvmesubsystem_namespace_info to precisely
map NVMe namespace devices (nvme0n1) to their subsystems, enabling
correct correlation even on nodes with multiple NVMe subsystems.

Includes runbooks and promtool unit tests for all new alerts.

Signed-off-by: Shirly Radco <sradco@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@sradco sradco force-pushed the initial-exporter branch from 50af95b to fdb99da Compare May 20, 2026 10:02
@sradco
Copy link
Copy Markdown
Author

sradco commented May 20, 2026

@akalenyu @awels I updated the code based on the review. Please let me know what you think.
It is a critical priority item

@sradco
Copy link
Copy Markdown
Author

sradco commented May 20, 2026

Hi @jan--f , @jsafrane , I would appreciate your review of this pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants