feat(cli): add /healthz and /readyz JSON endpoints to daemon mode#82
Conversation
Signed-off-by: Abhijithmns <abhijithmns07@gmail.com>
|
🚀 First PR — welcome aboard! A few things to expect:
If you get stuck, reply here or jump to Discussions. We want this PR to land. |
btwshivam
left a comment
There was a problem hiding this comment.
the readiness probe returns 503 when zero events have been observed, which takes /metrics offline on an idle host (Prometheus operator stops scraping). plus program_total is a typo for programs_total, and the old programsLoaded/programsTotal got renamed to snake_case without note.
| "version": version.Version, | ||
| "uptime": time.Since(startTime).Seconds(), | ||
| "programs_loaded": loaded, | ||
| "program_total": total, |
There was a problem hiding this comment.
programs_loaded (plural) and program_total (singular) in the same response. typo, should be programs_total. and these field names changed from the previous programsLoaded and programsTotal (camelCase) without a note in the PR, which silently breaks any scrape config or dashboard that consumed the old names. either preserve the old keys alongside the new ones, or call the rename out explicitly so downstream knows.
There was a problem hiding this comment.
ill fix program_total -> programs_total and will note the rename in the PR description. let me know if you'd prefer keeping the old camelCase keys too.
| } | ||
|
|
||
| // Check if collectors are receiving events | ||
| if totalEvents == 0 { |
There was a problem hiding this comment.
this is too strict for a metrics-exporter daemon. on an idle host (low syscall volume, no TCP traffic), no events flow, this returns 503 forever, the Prometheus ServiceMonitor stops scraping kerno, and the operator loses visibility into the very state they need to see. readiness should be 'BPF programs loaded and the HTTP server is accepting connections', not 'kernel activity has been observed'. drop this branch.
There was a problem hiding this comment.
makes sense, ill remove the zero-events check entirely so readyz only depends on bpf programs being loaded. does that sound right?
| w.Header().Set("Content-Type", "application/json") | ||
|
|
||
| // Check if all expected BPF programs are loaded | ||
| if loaded < total { |
There was a problem hiding this comment.
contradicts the daemon's graceful-degradation design (CLAUDE.md invariant 3, 'Failed BPF program logs and skips, doesn't kill the daemon'). runStart deliberately tolerates partial BPF load (e.g. 1 of 6 collectors fails on an older kernel, the rest keep working). returning 503 here means k8s drops the pod from service endpoints whenever any single collector misses, but the doctor pipeline still works on the rest. match the daemon's tolerance, return ready if loaded > 0, or surface a configurable threshold.
There was a problem hiding this comment.
got it, then i'll change to return ready if loaded > 0 to match the daemon's graceful degradation. going with the simpler option for now rather than a configurable threshold. let me know if you'd prefer the threshold approach. also wondering if you'd want readiness to flip to not_ready on SIGTERM for graceful shutdown signaling to k8s ,worth adding or out of scope for this PR?
| b.mu.Lock() | ||
| defer b.mu.Unlock() | ||
|
|
||
| // Count events from the seen map (which tracks cardinality) |
There was a problem hiding this comment.
the comment says 'Count events from the seen map (which tracks cardinality)' then 'We track actual event counts separately for status reporting'. but the implementation just reads seen, which is the cardinality counter (b.seen[metric]++ at line 87 increments per label string, capped at LabelCardinalityLimit). the JSON field events_collected in start.go:280 will report cardinality, not actual event counts, and it caps out on a busy host. either add a real per-collector event counter to Bridge.Start and read that here, or rename the function and the JSON field to LabelCardinality so callers know what they're looking at.
There was a problem hiding this comment.
thats my bad, the comment was misleading seen tracks cardinality not event counts. i'll go with renaming the function and JSON field to 'LabelCardinality' to be accurate rather than pretending it's event counts. unless you'd prefer add a real per-collector counter instead?
| t.Fatalf("response not JSON: %v (body=%q)", err, rec.Body.String()) | ||
| } | ||
| if body["status"] != "ok" { | ||
| t.Errorf("status field = %v, want ok", body["status"]) |
There was a problem hiding this comment.
the assertions on programs_loaded and programs_total got removed from this test, so the field rename and the program_total typo on start.go:243 pass through without a failing test. add back the field checks:
if got := body["programs_loaded"]; got != float64(6) {
t.Errorf("programs_loaded = %v, want 6", got)
}
if got := body["programs_total"]; got != float64(6) {
t.Errorf("programs_total = %v, want 6", got)
}also missing: a TestReadyzHandlerOK for the happy 200 path (loaded == total and events > 0).
There was a problem hiding this comment.
sure ill add those back. also will add TestReadyzHandlerOK ,since we're dropping the zero-events check as per your other comment, for the happy path we can just test loaded > 0. let me know if that's what you had in mind.
|
@btwshivam i've replied to each of the review comments with my proposed approach. let me know if you're fine with the direction and i'll go ahead and push the fixes |
btwshivam
left a comment
There was a problem hiding this comment.
readyz fails closed in two ways that brick a healthy daemon: it 503s on partial bpf load (against the graceful-degradation invariant) and 503s on an idle node with no events. loosen both, details inline.
| w.Header().Set("Content-Type", "application/json") | ||
|
|
||
| // Check if all expected BPF programs are loaded | ||
| if loaded < total { |
There was a problem hiding this comment.
this 503s whenever loaded < total, but graceful degradation is a core invariant: a bpf program that fails to load logs and skips, the daemon keeps running. healthz treats the same partial-load case as 200 and calls it graceful degradation (the comment at line 233), readyz calls it fatal. on any kernel where one of the six programs doesn't load (older or locked-down kernels, a missing tracepoint), the pod stays NotReady forever and gets pulled from service endpoints. readiness shouldn't punish partial load, gate on loaded > 0 or just on the server being up.
| } | ||
|
|
||
| // Check if collectors are receiving events | ||
| if totalEvents == 0 { |
There was a problem hiding this comment.
this 503s when no events have been collected yet. on a quiet or idle node the collectors legitimately see zero events in the window, so the daemon never becomes Ready and kubelet keeps it out of endpoints. for an observer, no-events-yet is not not-ready. #8 just asks for /healthz and /readyz, it doesn't ask readiness to depend on traffic. drop this check, readiness should reflect that the daemon is up and bpf is attached, not that something happened to fire.
| defer b.mu.Unlock() | ||
|
|
||
| // Count events from the seen map (which tracks cardinality) | ||
| // We track actual event counts separately for status reporting |
There was a problem hiding this comment.
this comment says event counts are tracked separately, but there's no separate counter, it reads seen, which is the cardinality-limit map keyed by metric category (syscall, tcp, ...), not by collector. so the returned keys aren't collector names, and you're overloading the same field the cardinality guard uses. if readiness keeps an events signal, add a real per-collector event counter instead of reusing seen, and fix the comment.
| "version": version.Version, | ||
| "uptime": time.Since(startTime).Seconds(), | ||
| "programs_loaded": loaded, | ||
| "program_total": total, |
There was a problem hiding this comment.
program_total is singular next to programs_loaded (plural), and readyz uses a third scheme (bpf_programs_loaded / bpf_programs_total). this also renames the existing healthz fields from programsLoaded/programsTotal, which breaks anyone parsing the current output. pick one naming and keep it consistent across both handlers.
|
direction's good, go ahead. a few of your open questions: don't keep the old camelCase keys, it's a fresh endpoint so there's no schema to preserve. |
|
@btwshivam i guess this commit should fix the issues. |
btwshivam
left a comment
There was a problem hiding this comment.
thanks for working through this, readyz now respects partial load and the events check / CollectorStatus are gone, tests look right.
|
/lgtm |
What
/healthz(liveness) and/readyz(readiness) HTTP endpoints to the Prometheus server in daemon mode, required for Kubernetes probe configuration.Why
Fixes #8
How
readyzHandlerchecks if BPF programs loaded and events are flowing, returns 503 if not readyhealthzHandlerupdated to returnstatus,version,uptime, andprograms_loadedCollectorStatus()toBridgeto expose per-collector event counts to the readiness probeTesting
go build ./...passesgo test ./...passesgo vet ./...passesgolangci-lint run ./...passessudo ./bin/bpf-verify --read 5sconfirms 6/6 programs still load./scripts/verify.shpasses (or specific phase:./scripts/verify.sh quality)Checklist
feat(scope): subject)git commit -s)scripts/verify.sh