Skip to content

feat(cli): add /healthz and /readyz JSON endpoints to daemon mode#82

Merged
btwshivam merged 2 commits into
optiqor:mainfrom
Abhijithmns:feat/healthz-readyz-endpoints
Jun 6, 2026
Merged

feat(cli): add /healthz and /readyz JSON endpoints to daemon mode#82
btwshivam merged 2 commits into
optiqor:mainfrom
Abhijithmns:feat/healthz-readyz-endpoints

Conversation

@Abhijithmns
Copy link
Copy Markdown
Contributor

@Abhijithmns Abhijithmns commented May 16, 2026

What

  • adds /healthz (liveness) and /readyz (readiness) HTTP endpoints to the Prometheus server in daemon mode, required for Kubernetes probe configuration.

Why

Fixes #8

How

  • readyzHandler checks if BPF programs loaded and events are flowing, returns 503 if not ready
  • healthzHandler updated to return status, version, uptime, and programs_loaded
  • added CollectorStatus() to Bridge to expose per-collector event counts to the readiness probe

Testing

  • go build ./... passes
  • go test ./... passes
  • go vet ./... passes
  • golangci-lint run ./... passes
  • Tested locally with:
  • N/A — pure docs/refactor
  • sudo ./bin/bpf-verify --read 5s confirms 6/6 programs still load
  • ./scripts/verify.sh passes (or specific phase: ./scripts/verify.sh quality)

Checklist

  • PR title follows Conventional Commits (feat(scope): subject)
  • All commits are DCO-signed (git commit -s)
  • No unrelated changes pulled in
  • Documentation updated where user-visible behavior changed
  • Added/updated tests for new code paths
  • If a new doctor rule, paired with a chaos scenario in scripts/verify.sh

Signed-off-by: Abhijithmns <abhijithmns07@gmail.com>
@Abhijithmns Abhijithmns requested a review from btwshivam as a code owner May 16, 2026 14:59
@github-actions
Copy link
Copy Markdown

🚀 First PR — welcome aboard!

A few things to expect:

  1. CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
  2. DCO: every commit needs Signed-off-by:git commit -s adds it automatically.
  3. Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
  4. Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

@github-actions github-actions Bot added testing Tests and test coverage area/ops Operations, deployment, runtime ergonomics labels May 16, 2026
Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the readiness probe returns 503 when zero events have been observed, which takes /metrics offline on an idle host (Prometheus operator stops scraping). plus program_total is a typo for programs_total, and the old programsLoaded/programsTotal got renamed to snake_case without note.

Comment thread internal/cli/start.go Outdated
"version": version.Version,
"uptime": time.Since(startTime).Seconds(),
"programs_loaded": loaded,
"program_total": total,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

programs_loaded (plural) and program_total (singular) in the same response. typo, should be programs_total. and these field names changed from the previous programsLoaded and programsTotal (camelCase) without a note in the PR, which silently breaks any scrape config or dashboard that consumed the old names. either preserve the old keys alongside the new ones, or call the rename out explicitly so downstream knows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill fix program_total -> programs_total and will note the rename in the PR description. let me know if you'd prefer keeping the old camelCase keys too.

Comment thread internal/cli/start.go Outdated
}

// Check if collectors are receiving events
if totalEvents == 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too strict for a metrics-exporter daemon. on an idle host (low syscall volume, no TCP traffic), no events flow, this returns 503 forever, the Prometheus ServiceMonitor stops scraping kerno, and the operator loses visibility into the very state they need to see. readiness should be 'BPF programs loaded and the HTTP server is accepting connections', not 'kernel activity has been observed'. drop this branch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, ill remove the zero-events check entirely so readyz only depends on bpf programs being loaded. does that sound right?

Comment thread internal/cli/start.go Outdated
w.Header().Set("Content-Type", "application/json")

// Check if all expected BPF programs are loaded
if loaded < total {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contradicts the daemon's graceful-degradation design (CLAUDE.md invariant 3, 'Failed BPF program logs and skips, doesn't kill the daemon'). runStart deliberately tolerates partial BPF load (e.g. 1 of 6 collectors fails on an older kernel, the rest keep working). returning 503 here means k8s drops the pod from service endpoints whenever any single collector misses, but the doctor pipeline still works on the rest. match the daemon's tolerance, return ready if loaded > 0, or surface a configurable threshold.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, then i'll change to return ready if loaded > 0 to match the daemon's graceful degradation. going with the simpler option for now rather than a configurable threshold. let me know if you'd prefer the threshold approach. also wondering if you'd want readiness to flip to not_ready on SIGTERM for graceful shutdown signaling to k8s ,worth adding or out of scope for this PR?

Comment thread internal/metrics/bridge.go Outdated
b.mu.Lock()
defer b.mu.Unlock()

// Count events from the seen map (which tracks cardinality)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment says 'Count events from the seen map (which tracks cardinality)' then 'We track actual event counts separately for status reporting'. but the implementation just reads seen, which is the cardinality counter (b.seen[metric]++ at line 87 increments per label string, capped at LabelCardinalityLimit). the JSON field events_collected in start.go:280 will report cardinality, not actual event counts, and it caps out on a busy host. either add a real per-collector event counter to Bridge.Start and read that here, or rename the function and the JSON field to LabelCardinality so callers know what they're looking at.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats my bad, the comment was misleading seen tracks cardinality not event counts. i'll go with renaming the function and JSON field to 'LabelCardinality' to be accurate rather than pretending it's event counts. unless you'd prefer add a real per-collector counter instead?

t.Fatalf("response not JSON: %v (body=%q)", err, rec.Body.String())
}
if body["status"] != "ok" {
t.Errorf("status field = %v, want ok", body["status"])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the assertions on programs_loaded and programs_total got removed from this test, so the field rename and the program_total typo on start.go:243 pass through without a failing test. add back the field checks:

if got := body["programs_loaded"]; got != float64(6) {
    t.Errorf("programs_loaded = %v, want 6", got)
}
if got := body["programs_total"]; got != float64(6) {
    t.Errorf("programs_total = %v, want 6", got)
}

also missing: a TestReadyzHandlerOK for the happy 200 path (loaded == total and events > 0).

Copy link
Copy Markdown
Contributor Author

@Abhijithmns Abhijithmns May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure ill add those back. also will add TestReadyzHandlerOK ,since we're dropping the zero-events check as per your other comment, for the happy path we can just test loaded > 0. let me know if that's what you had in mind.

@Abhijithmns
Copy link
Copy Markdown
Contributor Author

@btwshivam i've replied to each of the review comments with my proposed approach. let me know if you're fine with the direction and i'll go ahead and push the fixes

Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readyz fails closed in two ways that brick a healthy daemon: it 503s on partial bpf load (against the graceful-degradation invariant) and 503s on an idle node with no events. loosen both, details inline.

Comment thread internal/cli/start.go Outdated
w.Header().Set("Content-Type", "application/json")

// Check if all expected BPF programs are loaded
if loaded < total {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 503s whenever loaded < total, but graceful degradation is a core invariant: a bpf program that fails to load logs and skips, the daemon keeps running. healthz treats the same partial-load case as 200 and calls it graceful degradation (the comment at line 233), readyz calls it fatal. on any kernel where one of the six programs doesn't load (older or locked-down kernels, a missing tracepoint), the pod stays NotReady forever and gets pulled from service endpoints. readiness shouldn't punish partial load, gate on loaded > 0 or just on the server being up.

Comment thread internal/cli/start.go Outdated
}

// Check if collectors are receiving events
if totalEvents == 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 503s when no events have been collected yet. on a quiet or idle node the collectors legitimately see zero events in the window, so the daemon never becomes Ready and kubelet keeps it out of endpoints. for an observer, no-events-yet is not not-ready. #8 just asks for /healthz and /readyz, it doesn't ask readiness to depend on traffic. drop this check, readiness should reflect that the daemon is up and bpf is attached, not that something happened to fire.

Comment thread internal/metrics/bridge.go Outdated
defer b.mu.Unlock()

// Count events from the seen map (which tracks cardinality)
// We track actual event counts separately for status reporting
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment says event counts are tracked separately, but there's no separate counter, it reads seen, which is the cardinality-limit map keyed by metric category (syscall, tcp, ...), not by collector. so the returned keys aren't collector names, and you're overloading the same field the cardinality guard uses. if readiness keeps an events signal, add a real per-collector event counter instead of reusing seen, and fix the comment.

Comment thread internal/cli/start.go Outdated
"version": version.Version,
"uptime": time.Since(startTime).Seconds(),
"programs_loaded": loaded,
"program_total": total,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

program_total is singular next to programs_loaded (plural), and readyz uses a third scheme (bpf_programs_loaded / bpf_programs_total). this also renames the existing healthz fields from programsLoaded/programsTotal, which breaks anyone parsing the current output. pick one naming and keep it consistent across both handlers.

@btwshivam
Copy link
Copy Markdown
Member

btwshivam commented May 30, 2026

direction's good, go ahead.

a few of your open questions: don't keep the old camelCase keys, it's a fresh endpoint so there's no schema to preserve. loaded > 0 is the right call, skip the configurable threshold. and since you're dropping the events check, readyz no longer needs CollectorStatus at all, just remove it rather than renaming. SIGTERM flipping readyz to not_ready is a nice touch but out of scope for #8, do it as a follow-up.. or add in this as you like

@github-actions github-actions Bot added the level:critical Touches BPF, security, or release surfaces (auto-applied) label May 31, 2026
@Abhijithmns
Copy link
Copy Markdown
Contributor Author

@btwshivam i guess this commit should fix the issues.

@Abhijithmns Abhijithmns requested a review from btwshivam June 1, 2026 10:19
Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working through this, readyz now respects partial load and the events check / CollectorStatus are gone, tests look right.

@btwshivam
Copy link
Copy Markdown
Member

/lgtm

@btwshivam btwshivam merged commit 47a7bcc into optiqor:main Jun 6, 2026
8 of 9 checks passed
@btwshivam btwshivam added the gssoc:approved Counted toward leaderboard label Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ops Operations, deployment, runtime ergonomics gssoc:approved Counted toward leaderboard level:critical Touches BPF, security, or release surfaces (auto-applied) testing Tests and test coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add /healthz and /readyz JSON endpoints to daemon mode

2 participants