feat(metrics): emit Prometheus metrics for eBPF load failures#79
feat(metrics): emit Prometheus metrics for eBPF load failures#79Hanu-man12 wants to merge 3 commits into
Conversation
Add two new Prometheus metrics for per-program eBPF load tracking:
- kerno_bpf_program_loaded{program} gauge: 1 if loaded, 0 if failed
- kerno_bpf_program_load_errors_total{program,reason} counter
Wire metrics into buildCollectors in doctor.go so each program
reports its load result individually.
Also fix gen_stub.go build tag to use !ebpf instead of architecture
exclusions so stub types compile correctly on amd64/arm64.
Fixes optiqor#25
Signed-off-by: Hanu-man12 <shawharshit116@gmail.com>
- Import internal/metrics package in doctor.go
- Set BPFProgramLoaded{program}=0 and increment
BPFProgramLoadErrorsTotal{program,reason} on load failure
- Set BPFProgramLoaded{program}=1 on successful load
- Also emit metrics on collector registration failure
Closes optiqor#25
Signed-off-by: Hanu-man12 <shawharshit116@gmail.com>
|
🚀 First PR — welcome aboard! A few things to expect:
If you get stuck, reply here or jump to Discussions. We want this PR to land. |
btwshivam
left a comment
There was a problem hiding this comment.
the diff has 11 places where em-dashes and a ✓ checkmark became literal ??? (encoding corruption), including the user-visible output at line 533 (kerno: ??? all kernel signals nominal). also drop the gen_stub.go build-tag change, it's unrelated to #25.
| // Single-line "all clear" — CI-friendly. | ||
| fmt.Fprintf(os.Stdout, "kerno: ✓ all kernel signals nominal (%s window, %d events)\n", | ||
| // Single-line "all clear" ??? CI-friendly. | ||
| fmt.Fprintf(os.Stdout, "kerno: ??? all kernel signals nominal (%s window, %d events)\n", |
There was a problem hiding this comment.
this was kerno: ✓ all kernel signals nominal on main and now shows literal ??? to users in --quiet mode. the file got re-saved through a non-UTF-8 codepage, replacing every em-dash and the checkmark with ???. 11 sites total: lines 135, 155, 295, 372, 374, 448, 455, 496, 521, 532, 533. line 496 (Interrupted ???) and the two classifyLoadError reasons at lines 372/374 are also user-visible. re-save as UTF-8 and either restore the originals or replace the em-dashes with commas/periods to match project style.
| // definitions and this file is excluded. | ||
|
|
||
| //go:build !386 && !amd64 && !arm && !arm64 && !loong64 && !mips64le && !mipsle && !ppc64le && !riscv64 && !wasm | ||
| //go:build !ebpf |
There was a problem hiding this comment.
this build-tag change is unrelated to #25 (per-program load metrics). drop it from this PR. send it as its own one-line PR if you want it in, or just leave the long negation list, it already works.
btwshivam
left a comment
There was a problem hiding this comment.
the metric feature is fine, but every — in doctor.go got replaced with ??? (11 sites, 4 of them user-facing strings, and a ✓ got eaten too). and the gen_stub.go hunk is a rebase artifact since main already has !ebpf there.
| // Single-line "all clear" — CI-friendly. | ||
| fmt.Fprintf(os.Stdout, "kerno: ✓ all kernel signals nominal (%s window, %d events)\n", | ||
| // Single-line "all clear" ??? CI-friendly. | ||
| fmt.Fprintf(os.Stdout, "kerno: ??? all kernel signals nominal (%s window, %d events)\n", |
There was a problem hiding this comment.
the ✓ checkmark here became ???. same pattern across the whole file: every — got replaced with ???, 11 sites total. the user-facing ones are this line, the Interrupted ??? at line 496, and the kernel-error hints at lines 372 and 374. set your editor to UTF-8 and re-do the diff, keeping only the metric wiring at lines 320-345.
| // definitions and this file is excluded. | ||
|
|
||
| //go:build !386 && !amd64 && !arm && !arm64 && !loong64 && !mips64le && !mipsle && !ppc64le && !riscv64 && !wasm | ||
| //go:build !ebpf |
There was a problem hiding this comment.
this hunk is a rebase artifact. main already has //go:build !ebpf here (commit fb79aa5). git fetch origin && git rebase origin/main and the file drops out of the diff entirely.
| Help: "Number of eBPF programs currently loaded.", | ||
| }) | ||
|
|
||
| // BPFProgramLoaded tracks per-program load result: 1 = loaded, 0 = failed. |
There was a problem hiding this comment.
no test added for the new metrics. a small unit test that calls BPFProgramLoaded.WithLabelValues("x").Set(1), gathers the registry, and asserts the metric is present catches duplicate-registration, wrong-cardinality, and metric-name-typo bugs. CI's Build/Test/Lint also haven't run on this PR yet (rollup only shows PR title lint and labels), so the testing checklist in the PR body is unverified.
Re-save doctor.go as UTF-8; replace 11 ??? artifacts (10 em-dashes and 1 checkmark in --quiet mode output) with the original characters. Corruption was introduced when the file was re-saved through a non-UTF-8 codepage during CRLF normalisation. Also add unit tests for BPFProgramLoaded GaugeVec and BPFProgramLoadErrorsTotal CounterVec to verify correct registration, label cardinality, and value semantics. Signed-off-by: Hanu-man12 <shawharshit116@gmail.com>
| logger.Debug("failed to load eBPF program; collector disabled", | ||
| "program", r.name, "error", err) | ||
| // Emit per-program load failure metrics. | ||
| metrics.BPFProgramLoaded.WithLabelValues(r.name).Set(0) |
There was a problem hiding this comment.
these only get set in buildCollectors, which is the kerno doctor path. doctor is a one-shot that renders a report and exits without ever serving /metrics, so nothing scrapes them. the daemon (kerno start to runStart) uses buildLoaders (start.go:97) and only sets the aggregate BPFProgramsLoaded (start.go:124), never these per-program ones. a vec with no WithLabelValues call exports nothing, so on the daemon's /metrics kerno_bpf_program_loaded and kerno_bpf_program_load_errors_total never appear.
#25 is about the daemon exposing load failures to prometheus, so the metric has to be set in the start load path. wire it into buildLoaders / runStart, or have start go through buildCollectors, not just doctor.
| "program", r.name, "error", err) | ||
| // Emit per-program load failure metrics. | ||
| metrics.BPFProgramLoaded.WithLabelValues(r.name).Set(0) | ||
| metrics.BPFProgramLoadErrorsTotal.WithLabelValues(r.name, classifyLoadError(err)).Inc() |
There was a problem hiding this comment.
classifyLoadError returns a remediation sentence (re-run with sudo, or grant CAP_BPF + CAP_PERFMON to the binary) and "" for anything unmatched. that's a poor label value: long prose, and unknown failures collapse into reason="". use short codes (permission_denied, memlock, btf_missing, verifier_rejected, missing_tracepoint, unknown), which also lines up with the short registration_conflict you already use at line 335. keep the prose for the doctor Hint.
| // definitions and this file is excluded. | ||
|
|
||
| //go:build !386 && !amd64 && !arm && !arm64 && !loong64 && !mips64le && !mipsle && !ppc64le && !riscv64 && !wasm | ||
| //go:build !ebpf |
What
Adds two per-program Prometheus metrics to track eBPF program load results:
kerno_bpf_program_loaded{program}(GaugeVec) andkerno_bpf_program_load_errors_total{program,reason}(CounterVec).Why
Fixes #25
How
internal/metrics/metrics.go: definedBPFProgramLoadedGaugeVec andBPFProgramLoadErrorsTotalCounterVec; both registered in the custom Prometheus registry.internal/cli/doctor.go: imported the metrics package and wired calls insidebuildCollectors—Set(0)+Inc()on load/registration failure,Set(1)on successful load. Reason label reuses the existingclassifyLoadErrorhelper.Testing
go build ./...passesgo test ./...passesgo vet ./...passesgolangci-lint run ./...passesTested locally with:
go build+go test ./...via WSL2 (eBPF requires Linux kernel)N/A — pure docs/refactor
sudo ./bin/bpf-verify --read 5sconfirms 6/6 programs still load./scripts/verify.shpasses (or specific phase:./scripts/verify.sh quality)Checklist
feat(scope): subject)git commit -s)scripts/verify.sh