Skip to content

feat(metrics): emit Prometheus metrics for eBPF load failures#79

Open
Hanu-man12 wants to merge 3 commits into
optiqor:mainfrom
Hanu-man12:feat/bpf-program-metrics
Open

feat(metrics): emit Prometheus metrics for eBPF load failures#79
Hanu-man12 wants to merge 3 commits into
optiqor:mainfrom
Hanu-man12:feat/bpf-program-metrics

Conversation

@Hanu-man12
Copy link
Copy Markdown

What

Adds two per-program Prometheus metrics to track eBPF program load results:
kerno_bpf_program_loaded{program} (GaugeVec) and kerno_bpf_program_load_errors_total{program,reason} (CounterVec).

Why

Fixes #25

How

  • internal/metrics/metrics.go: defined BPFProgramLoaded GaugeVec and BPFProgramLoadErrorsTotal CounterVec; both registered in the custom Prometheus registry.
  • internal/cli/doctor.go: imported the metrics package and wired calls inside buildCollectorsSet(0) + Inc() on load/registration failure, Set(1) on successful load. Reason label reuses the existing classifyLoadError helper.

Testing

  • go build ./... passes

  • go test ./... passes

  • go vet ./... passes

  • golangci-lint run ./... passes

  • Tested locally with: go build + go test ./... via WSL2 (eBPF requires Linux kernel)

  • N/A — pure docs/refactor

  • sudo ./bin/bpf-verify --read 5s confirms 6/6 programs still load

  • ./scripts/verify.sh passes (or specific phase: ./scripts/verify.sh quality)

Checklist

  • PR title follows Conventional Commits (feat(scope): subject)
  • All commits are DCO-signed (git commit -s)
  • No unrelated changes pulled in
  • Documentation updated where user-visible behavior changed
  • Added/updated tests for new code paths
  • If a new doctor rule, paired with a chaos scenario in scripts/verify.sh

Add two new Prometheus metrics for per-program eBPF load tracking:
- kerno_bpf_program_loaded{program} gauge: 1 if loaded, 0 if failed
- kerno_bpf_program_load_errors_total{program,reason} counter

Wire metrics into buildCollectors in doctor.go so each program
reports its load result individually.

Also fix gen_stub.go build tag to use !ebpf instead of architecture
exclusions so stub types compile correctly on amd64/arm64.

Fixes optiqor#25

Signed-off-by: Hanu-man12 <shawharshit116@gmail.com>
- Import internal/metrics package in doctor.go
- Set BPFProgramLoaded{program}=0 and increment
  BPFProgramLoadErrorsTotal{program,reason} on load failure
- Set BPFProgramLoaded{program}=1 on successful load
- Also emit metrics on collector registration failure

Closes optiqor#25

Signed-off-by: Hanu-man12 <shawharshit116@gmail.com>
@Hanu-man12 Hanu-man12 requested a review from btwshivam as a code owner May 15, 2026 18:46
@github-actions
Copy link
Copy Markdown

🚀 First PR — welcome aboard!

A few things to expect:

  1. CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
  2. DCO: every commit needs Signed-off-by:git commit -s adds it automatically.
  3. Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
  4. Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

@github-actions github-actions Bot added the area/bpf eBPF programs and loaders label May 15, 2026
@Hanu-man12 Hanu-man12 changed the title Feat/bpf program metrics feat(metrics): emit Prometheus metrics for eBPF load failures May 15, 2026
Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the diff has 11 places where em-dashes and a checkmark became literal ??? (encoding corruption), including the user-visible output at line 533 (kerno: ??? all kernel signals nominal). also drop the gen_stub.go build-tag change, it's unrelated to #25.

Comment thread internal/cli/doctor.go Outdated
// Single-line "all clear" CI-friendly.
fmt.Fprintf(os.Stdout, "kerno: all kernel signals nominal (%s window, %d events)\n",
// Single-line "all clear" ??? CI-friendly.
fmt.Fprintf(os.Stdout, "kerno: ??? all kernel signals nominal (%s window, %d events)\n",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was kerno: ✓ all kernel signals nominal on main and now shows literal ??? to users in --quiet mode. the file got re-saved through a non-UTF-8 codepage, replacing every em-dash and the checkmark with ???. 11 sites total: lines 135, 155, 295, 372, 374, 448, 455, 496, 521, 532, 533. line 496 (Interrupted ???) and the two classifyLoadError reasons at lines 372/374 are also user-visible. re-save as UTF-8 and either restore the originals or replace the em-dashes with commas/periods to match project style.

Comment thread internal/bpf/gen_stub.go
// definitions and this file is excluded.

//go:build !386 && !amd64 && !arm && !arm64 && !loong64 && !mips64le && !mipsle && !ppc64le && !riscv64 && !wasm
//go:build !ebpf
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this build-tag change is unrelated to #25 (per-program load metrics). drop it from this PR. send it as its own one-line PR if you want it in, or just leave the long negation list, it already works.

Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metric feature is fine, but every in doctor.go got replaced with ??? (11 sites, 4 of them user-facing strings, and a got eaten too). and the gen_stub.go hunk is a rebase artifact since main already has !ebpf there.

Comment thread internal/cli/doctor.go Outdated
// Single-line "all clear" CI-friendly.
fmt.Fprintf(os.Stdout, "kerno: all kernel signals nominal (%s window, %d events)\n",
// Single-line "all clear" ??? CI-friendly.
fmt.Fprintf(os.Stdout, "kerno: ??? all kernel signals nominal (%s window, %d events)\n",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the checkmark here became ???. same pattern across the whole file: every got replaced with ???, 11 sites total. the user-facing ones are this line, the Interrupted ??? at line 496, and the kernel-error hints at lines 372 and 374. set your editor to UTF-8 and re-do the diff, keeping only the metric wiring at lines 320-345.

Comment thread internal/bpf/gen_stub.go
// definitions and this file is excluded.

//go:build !386 && !amd64 && !arm && !arm64 && !loong64 && !mips64le && !mipsle && !ppc64le && !riscv64 && !wasm
//go:build !ebpf
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hunk is a rebase artifact. main already has //go:build !ebpf here (commit fb79aa5). git fetch origin && git rebase origin/main and the file drops out of the diff entirely.

Help: "Number of eBPF programs currently loaded.",
})

// BPFProgramLoaded tracks per-program load result: 1 = loaded, 0 = failed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no test added for the new metrics. a small unit test that calls BPFProgramLoaded.WithLabelValues("x").Set(1), gathers the registry, and asserts the metric is present catches duplicate-registration, wrong-cardinality, and metric-name-typo bugs. CI's Build/Test/Lint also haven't run on this PR yet (rollup only shows PR title lint and labels), so the testing checklist in the PR body is unverified.

Re-save doctor.go as UTF-8; replace 11 ??? artifacts (10 em-dashes
and 1 checkmark in --quiet mode output) with the original characters.
Corruption was introduced when the file was re-saved through a
non-UTF-8 codepage during CRLF normalisation.

Also add unit tests for BPFProgramLoaded GaugeVec and
BPFProgramLoadErrorsTotal CounterVec to verify correct registration,
label cardinality, and value semantics.

Signed-off-by: Hanu-man12 <shawharshit116@gmail.com>
@github-actions github-actions Bot added level:intermediate 50-200 lines or 3-5 files (auto-applied) testing Tests and test coverage and removed level:intermediate 50-200 lines or 3-5 files (auto-applied) labels May 17, 2026
Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metric is wired into the doctor path, but #25 wants it on the daemon's /metrics where it's never set. move it into the start load path. plus the reason label and a stale gen_stub hunk.

Comment thread internal/cli/doctor.go
logger.Debug("failed to load eBPF program; collector disabled",
"program", r.name, "error", err)
// Emit per-program load failure metrics.
metrics.BPFProgramLoaded.WithLabelValues(r.name).Set(0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these only get set in buildCollectors, which is the kerno doctor path. doctor is a one-shot that renders a report and exits without ever serving /metrics, so nothing scrapes them. the daemon (kerno start to runStart) uses buildLoaders (start.go:97) and only sets the aggregate BPFProgramsLoaded (start.go:124), never these per-program ones. a vec with no WithLabelValues call exports nothing, so on the daemon's /metrics kerno_bpf_program_loaded and kerno_bpf_program_load_errors_total never appear.

#25 is about the daemon exposing load failures to prometheus, so the metric has to be set in the start load path. wire it into buildLoaders / runStart, or have start go through buildCollectors, not just doctor.

Comment thread internal/cli/doctor.go
"program", r.name, "error", err)
// Emit per-program load failure metrics.
metrics.BPFProgramLoaded.WithLabelValues(r.name).Set(0)
metrics.BPFProgramLoadErrorsTotal.WithLabelValues(r.name, classifyLoadError(err)).Inc()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classifyLoadError returns a remediation sentence (re-run with sudo, or grant CAP_BPF + CAP_PERFMON to the binary) and "" for anything unmatched. that's a poor label value: long prose, and unknown failures collapse into reason="". use short codes (permission_denied, memlock, btf_missing, verifier_rejected, missing_tracepoint, unknown), which also lines up with the short registration_conflict you already use at line 335. keep the prose for the doctor Hint.

Comment thread internal/bpf/gen_stub.go
// definitions and this file is excluded.

//go:build !386 && !amd64 && !arm && !arm64 && !loong64 && !mips64le && !mipsle && !ppc64le && !riscv64 && !wasm
//go:build !ebpf
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already !ebpf on main (the #60 fix), and it's unrelated to #25. the branch is behind main and re-applies a hunk that's already there. rebase on main and drop this file from the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/bpf eBPF programs and loaders testing Tests and test coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Observability: emit Prometheus metric for eBPF load failures (not just logs)

2 participants