Skip to content

feat(collector): implement crash-loop safety to keep the daemon alive#95

Open
wazer24 wants to merge 4 commits into
optiqor:mainfrom
wazer24:main
Open

feat(collector): implement crash-loop safety to keep the daemon alive#95
wazer24 wants to merge 4 commits into
optiqor:mainfrom
wazer24:main

Conversation

@wazer24
Copy link
Copy Markdown

@wazer24 wazer24 commented May 21, 2026

What

I've added a safety net so that if one of our background eBPF collectors panics, it doesn't take the whole daemon down with it. It catches the panic, logs a stack trace for us to debug later, and quietly restarts the collector using an exponential backoff.

Why

Fixes #95

While working on the codebase, I noticed a pretty big issue: if a single collector (like the TCP or Memory collector) hits a random snag—say, a corrupt BPF event or a missing cgroup file—the raw goroutine panics. Before this PR, that would either completely crash the kerno daemon (wiping out all of our recent metric histograms) or it would just silently die and the metrics would flatline without anyone noticing. I wanted to make sure our observability tool is actually rock-solid in production!

Main Benefits

  • Keeps the Daemon Alive: One buggy collector won't kill the entire process or leave orphaned BPF programs running in the kernel.
  • Circuit Breaker for Flapping: If a collector gets caught in a true crash-loop (5 panics in 10 minutes), it permanently shuts itself down instead of spinning endlessly and burning CPU. It also fires off a Prometheus alert so we know it's broken.
  • Easy Debugging: You don't have to go digging through journalctl anymore. Full stack traces get saved straight to /var/log/kerno-panics/ whenever a crash happens.
  • Self-Healing: Temporary glitches are automatically handled! The collector just waits a bit (from 1s up to 60s) and tries again.

How

  • I created a centralized PanicHandler in internal/observability/panics.go that tracks crash counts over a 10-minute sliding window to detect flapping.
  • I wrote a RunSafeCollectorGoroutine wrapper and updated all 7 of our existing collectors to use it instead of launching naked goroutines.
  • I hooked up two new Prometheus metrics: kerno_collector_panics_total and kerno_collector_disabled.
  • Finally, I added a top-level defer catch in start.go so that if the core daemon does panic, it cleanly exits with os.Exit(2) so systemd knows exactly how to restart it safely.

Testing

  • go build ./... passes
  • go test ./... passes
  • go vet ./... passes
  • golangci-lint run ./... passes
  • Tested locally with: go test -run TestCollectorPanicRecovery with a mock fault-injecting collector to guarantee the circuit-breaker logic works perfectly.
  • N/A — pure docs/refactor
  • sudo ./bin/bpf-verify --read 5s confirms 6/6 programs still load
  • ./scripts/verify.sh passes (or specific phase: ./scripts/verify.sh quality)

Checklist

  • PR title follows Conventional Commits (feat(scope): subject)
  • All commits are DCO-signed (git commit -s)
  • No unrelated changes pulled in
  • Documentation updated where user-visible behavior changed (Added a section on Panics to CONTRIBUTING.md)
  • Added/updated tests for new code paths
  • If a new doctor rule, paired with a chaos scenario in scripts/verify.sh

wazer24 added 2 commits May 21, 2026 13:31
Signed-off-by: wazer24 <24wazer@rbunagpur.in>
Signed-off-by: wazer24 <24wazer@rbunagpur.in>
@wazer24 wazer24 requested a review from btwshivam as a code owner May 21, 2026 08:27
@github-actions
Copy link
Copy Markdown

🚀 First PR — welcome aboard!

A few things to expect:

  1. CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
  2. DCO: every commit needs Signed-off-by:git commit -s adds it automatically.
  3. Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
  4. Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

@github-actions github-actions Bot added level:critical Touches BPF, security, or release surfaces (auto-applied) documentation Improvements or additions to documentation testing Tests and test coverage area/bpf eBPF programs and loaders area/doctor Diagnostic engine and rules area/ops Operations, deployment, runtime ergonomics area/community Community, contributors, governance labels May 21, 2026
@wazer24
Copy link
Copy Markdown
Author

wazer24 commented May 24, 2026

Hello @btwshivam please review my pr and let me know if there are any changes to make. Thank you .

@btwshivam
Copy link
Copy Markdown
Member

btwshivam commented May 31, 2026

@wazer24 always Make PR with seprate branches.. dont use main.. keep it for rebasing

Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the branch is based on a stale main (you're PRing from your fork's main), so it reverts merged work: any back to interface{} (#54), the cgroup-memory wiring in Signals (#62), and the slash-commands section in CONTRIBUTING. rebase on upstream main first, then the panic-recovery feature notes inline.

@@ -142,10 +145,66 @@ func (r *Registry) Signals(duration time.Duration) *Signals {
s.FD = v
case *MemorySnapshot:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the case *CgroupMemorySnapshot: s.CgroupMemory = v that lives right after this was deleted. that's the only place the cgroup-memory collector feeds Signals, so dropping it silently breaks the memory_limit_pressure and memory_high_throttling rules (they get nil and never fire). it's not intentional, it's fallout from the stale base, same as the any to interface{} revert on line 37 and across the collectors. rebasing on upstream main makes all of these disappear.

Comment thread internal/cli/start.go
defer func() {
if r := recover(); r != nil {
observability.HandleDaemonPanic(r, logger)
os.Exit(2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.Exit(2) here violates invariant 4: no os.Exit outside main.go. runStart is library-ish cli code. let the panic propagate or return an error, and let cmd/kerno/main.go decide the exit code. otherwise deferred cleanup (the http server shutdown, closers) is skipped on a daemon panic.

}()

// Run the actual collector loop
fn()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restarting fn() after a recovered panic is risky if the panic fired while a collector held c.mu (e.g. inside record()). recover doesn't release locks, so the restarted goroutine and any Snapshot() caller then deadlock on c.mu forever, which defeats the keep-alive goal. either document that collector loops must not panic under lock, or reset/recreate per-collector state on restart rather than re-entering the same loop with a half-held lock.

Namespace: Namespace,
Name: "collector_panics_total",
Help: "Total panics recovered in collector goroutines.",
}, []string{"collector", "reason"})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason label is the raw panic message, which is unbounded cardinality, panic strings carry addresses, values, file:line, so every distinct panic becomes a new time series. drop reason from the metric (keep it in the log and the panic file), or map to a small fixed set.

)

const (
panicLogDir = "/var/log/kerno-panics"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/var/log/kerno-panics is hardcoded. in the DaemonSet container that path isn't writable or persistent (and isn't mounted), so the MkdirAll/WriteFile just error every time and the forensic logging quietly does nothing in the environment it's meant for. make the dir configurable (flag/env, default off), or rely on the structured slog stack output instead of writing host files from library code.

@wazer24 wazer24 requested a review from btwshivam May 31, 2026 03:29
@wazer24 wazer24 marked this pull request as draft May 31, 2026 03:34
@wazer24 wazer24 marked this pull request as ready for review May 31, 2026 05:23
@btwshivam
Copy link
Copy Markdown
Member

fix conflict

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/bpf eBPF programs and loaders area/community Community, contributors, governance area/doctor Diagnostic engine and rules area/ops Operations, deployment, runtime ergonomics documentation Improvements or additions to documentation level:critical Touches BPF, security, or release surfaces (auto-applied) testing Tests and test coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants