feat(cli): add /healthz and /readyz JSON endpoints to daemon mode by Abhijithmns · Pull Request #82 · optiqor/kerno

Abhijithmns · 2026-05-16T14:59:49Z

What

adds /healthz (liveness) and /readyz (readiness) HTTP endpoints to the Prometheus server in daemon mode, required for Kubernetes probe configuration.

Why

Fixes #8

How

readyzHandler checks if BPF programs loaded and events are flowing, returns 503 if not ready
healthzHandler updated to return status, version, uptime, and programs_loaded
added CollectorStatus() to Bridge to expose per-collector event counts to the readiness probe

Testing

N/A — pure docs/refactor
sudo ./bin/bpf-verify --read 5s confirms 6/6 programs still load
./scripts/verify.sh passes (or specific phase: ./scripts/verify.sh quality)

Checklist

PR title follows Conventional Commits (feat(scope): subject)
All commits are DCO-signed (git commit -s)
No unrelated changes pulled in
Documentation updated where user-visible behavior changed
Added/updated tests for new code paths
If a new doctor rule, paired with a chaos scenario in scripts/verify.sh

Signed-off-by: Abhijithmns <abhijithmns07@gmail.com>

github-actions · 2026-05-16T14:59:56Z

🚀 First PR — welcome aboard!

A few things to expect:

CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
DCO: every commit needs Signed-off-by: — git commit -s adds it automatically.
Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

btwshivam

the readiness probe returns 503 when zero events have been observed, which takes /metrics offline on an idle host (Prometheus operator stops scraping). plus program_total is a typo for programs_total, and the old programsLoaded/programsTotal got renamed to snake_case without note.

btwshivam · 2026-05-16T23:39:02Z

+			"version":         version.Version,
+			"uptime":          time.Since(startTime).Seconds(),
+			"programs_loaded": loaded,
+			"program_total":   total,


programs_loaded (plural) and program_total (singular) in the same response. typo, should be programs_total. and these field names changed from the previous programsLoaded and programsTotal (camelCase) without a note in the PR, which silently breaks any scrape config or dashboard that consumed the old names. either preserve the old keys alongside the new ones, or call the rename out explicitly so downstream knows.

ill fix program_total -> programs_total and will note the rename in the PR description. let me know if you'd prefer keeping the old camelCase keys too.

btwshivam · 2026-05-16T23:39:02Z

+		}
+
+		// Check if collectors are receiving events
+		if totalEvents == 0 {


this is too strict for a metrics-exporter daemon. on an idle host (low syscall volume, no TCP traffic), no events flow, this returns 503 forever, the Prometheus ServiceMonitor stops scraping kerno, and the operator loses visibility into the very state they need to see. readiness should be 'BPF programs loaded and the HTTP server is accepting connections', not 'kernel activity has been observed'. drop this branch.

makes sense, ill remove the zero-events check entirely so readyz only depends on bpf programs being loaded. does that sound right?

btwshivam · 2026-05-16T23:39:02Z

+		w.Header().Set("Content-Type", "application/json")
+
+		// Check if all expected BPF programs are loaded
+		if loaded < total {


contradicts the daemon's graceful-degradation design (CLAUDE.md invariant 3, 'Failed BPF program logs and skips, doesn't kill the daemon'). runStart deliberately tolerates partial BPF load (e.g. 1 of 6 collectors fails on an older kernel, the rest keep working). returning 503 here means k8s drops the pod from service endpoints whenever any single collector misses, but the doctor pipeline still works on the rest. match the daemon's tolerance, return ready if loaded > 0, or surface a configurable threshold.

got it, then i'll change to return ready if loaded > 0 to match the daemon's graceful degradation. going with the simpler option for now rather than a configurable threshold. let me know if you'd prefer the threshold approach. also wondering if you'd want readiness to flip to not_ready on SIGTERM for graceful shutdown signaling to k8s ,worth adding or out of scope for this PR?

btwshivam · 2026-05-16T23:39:02Z

+	b.mu.Lock()
+	defer b.mu.Unlock()
+
+	// Count events from the seen map (which tracks cardinality)


the comment says 'Count events from the seen map (which tracks cardinality)' then 'We track actual event counts separately for status reporting'. but the implementation just reads seen, which is the cardinality counter (b.seen[metric]++ at line 87 increments per label string, capped at LabelCardinalityLimit). the JSON field events_collected in start.go:280 will report cardinality, not actual event counts, and it caps out on a busy host. either add a real per-collector event counter to Bridge.Start and read that here, or rename the function and the JSON field to LabelCardinality so callers know what they're looking at.

thats my bad, the comment was misleading seen tracks cardinality not event counts. i'll go with renaming the function and JSON field to 'LabelCardinality' to be accurate rather than pretending it's event counts. unless you'd prefer add a real per-collector counter instead?

btwshivam · 2026-05-16T23:39:02Z

 		t.Fatalf("response not JSON: %v (body=%q)", err, rec.Body.String())
 	}
 	if body["status"] != "ok" {
 		t.Errorf("status field = %v, want ok", body["status"])


the assertions on programs_loaded and programs_total got removed from this test, so the field rename and the program_total typo on start.go:243 pass through without a failing test. add back the field checks:

if got := body["programs_loaded"]; got != float64(6) { t.Errorf("programs_loaded = %v, want 6", got) } if got := body["programs_total"]; got != float64(6) { t.Errorf("programs_total = %v, want 6", got) }

also missing: a TestReadyzHandlerOK for the happy 200 path (loaded == total and events > 0).

sure ill add those back. also will add TestReadyzHandlerOK ,since we're dropping the zero-events check as per your other comment, for the happy path we can just test loaded > 0. let me know if that's what you had in mind.

Abhijithmns · 2026-05-17T04:24:03Z

@btwshivam i've replied to each of the review comments with my proposed approach. let me know if you're fine with the direction and i'll go ahead and push the fixes

btwshivam

readyz fails closed in two ways that brick a healthy daemon: it 503s on partial bpf load (against the graceful-degradation invariant) and 503s on an idle node with no events. loosen both, details inline.

btwshivam · 2026-05-30T23:52:19Z

+		w.Header().Set("Content-Type", "application/json")
+
+		// Check if all expected BPF programs are loaded
+		if loaded < total {


this 503s whenever loaded < total, but graceful degradation is a core invariant: a bpf program that fails to load logs and skips, the daemon keeps running. healthz treats the same partial-load case as 200 and calls it graceful degradation (the comment at line 233), readyz calls it fatal. on any kernel where one of the six programs doesn't load (older or locked-down kernels, a missing tracepoint), the pod stays NotReady forever and gets pulled from service endpoints. readiness shouldn't punish partial load, gate on loaded > 0 or just on the server being up.

btwshivam · 2026-05-30T23:52:19Z

+		}
+
+		// Check if collectors are receiving events
+		if totalEvents == 0 {


this 503s when no events have been collected yet. on a quiet or idle node the collectors legitimately see zero events in the window, so the daemon never becomes Ready and kubelet keeps it out of endpoints. for an observer, no-events-yet is not not-ready. #8 just asks for /healthz and /readyz, it doesn't ask readiness to depend on traffic. drop this check, readiness should reflect that the daemon is up and bpf is attached, not that something happened to fire.

btwshivam · 2026-05-30T23:52:19Z

+	defer b.mu.Unlock()
+
+	// Count events from the seen map (which tracks cardinality)
+	// We track actual event counts separately for status reporting


this comment says event counts are tracked separately, but there's no separate counter, it reads seen, which is the cardinality-limit map keyed by metric category (syscall, tcp, ...), not by collector. so the returned keys aren't collector names, and you're overloading the same field the cardinality guard uses. if readiness keeps an events signal, add a real per-collector event counter instead of reusing seen, and fix the comment.

btwshivam · 2026-05-30T23:52:19Z

+			"version":         version.Version,
+			"uptime":          time.Since(startTime).Seconds(),
+			"programs_loaded": loaded,
+			"program_total":   total,


program_total is singular next to programs_loaded (plural), and readyz uses a third scheme (bpf_programs_loaded / bpf_programs_total). this also renames the existing healthz fields from programsLoaded/programsTotal, which breaks anyone parsing the current output. pick one naming and keep it consistent across both handlers.

btwshivam · 2026-05-30T23:55:00Z

direction's good, go ahead.

a few of your open questions: don't keep the old camelCase keys, it's a fresh endpoint so there's no schema to preserve. loaded > 0 is the right call, skip the configurable threshold. and since you're dropping the events check, readyz no longer needs CollectorStatus at all, just remove it rather than renaming. SIGTERM flipping readyz to not_ready is a nice touch but out of scope for #8, do it as a follow-up.. or add in this as you like

the changes)

Abhijithmns · 2026-05-31T13:05:23Z

@btwshivam i guess this commit should fix the issues.

btwshivam

thanks for working through this, readyz now respects partial load and the events check / CollectorStatus are gone, tests look right.

btwshivam · 2026-06-06T17:42:32Z

/lgtm

feat(cli): add healthz readyz endpoints to daemon mode

92170ec

Signed-off-by: Abhijithmns <abhijithmns07@gmail.com>

Abhijithmns requested a review from btwshivam as a code owner May 16, 2026 14:59

github-actions Bot added testing Tests and test coverage area/ops Operations, deployment, runtime ergonomics labels May 16, 2026

btwshivam requested changes May 16, 2026

View reviewed changes

btwshivam requested changes May 30, 2026

View reviewed changes

This was referenced May 31, 2026

feat: add startupProbe to Kubernetes DaemonSet for improved startup r… #105

Merged

test(metrics): finish table-driving bridge_test.go #141

Closed

test(helm): integration-test chart deploy on k3s #149

Open

feat(cli): add /healthz and /readyz JSON endpoints to daemon mode (with

70c21c5

the changes)

github-actions Bot added the level:critical Touches BPF, security, or release surfaces (auto-applied) label May 31, 2026

Abhijithmns requested a review from btwshivam June 1, 2026 10:19

btwshivam approved these changes Jun 6, 2026

View reviewed changes

btwshivam merged commit 47a7bcc into optiqor:main Jun 6, 2026
8 of 9 checks passed

btwshivam added the gssoc:approved Counted toward leaderboard label Jun 6, 2026

Conversation

Abhijithmns commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Testing

Checklist

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

btwshivam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Abhijithmns May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Abhijithmns commented May 17, 2026

Uh oh!

btwshivam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

btwshivam commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Abhijithmns commented May 31, 2026

Uh oh!

btwshivam left a comment

Choose a reason for hiding this comment

Uh oh!

btwshivam commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Abhijithmns commented May 16, 2026 •

edited

Loading

Abhijithmns May 17, 2026 •

edited

Loading

btwshivam commented May 30, 2026 •

edited

Loading