feat(collector): implement crash-loop safety to keep the daemon alive by wazer24 · Pull Request #95 · optiqor/kerno

wazer24 · 2026-05-21T08:27:01Z

What

I've added a safety net so that if one of our background eBPF collectors panics, it doesn't take the whole daemon down with it. It catches the panic, logs a stack trace for us to debug later, and quietly restarts the collector using an exponential backoff.

Why

Fixes #95

While working on the codebase, I noticed a pretty big issue: if a single collector (like the TCP or Memory collector) hits a random snag—say, a corrupt BPF event or a missing cgroup file—the raw goroutine panics. Before this PR, that would either completely crash the kerno daemon (wiping out all of our recent metric histograms) or it would just silently die and the metrics would flatline without anyone noticing. I wanted to make sure our observability tool is actually rock-solid in production!

Main Benefits

Keeps the Daemon Alive: One buggy collector won't kill the entire process or leave orphaned BPF programs running in the kernel.
Circuit Breaker for Flapping: If a collector gets caught in a true crash-loop (5 panics in 10 minutes), it permanently shuts itself down instead of spinning endlessly and burning CPU. It also fires off a Prometheus alert so we know it's broken.
Easy Debugging: You don't have to go digging through journalctl anymore. Full stack traces get saved straight to /var/log/kerno-panics/ whenever a crash happens.
Self-Healing: Temporary glitches are automatically handled! The collector just waits a bit (from 1s up to 60s) and tries again.

How

I created a centralized PanicHandler in internal/observability/panics.go that tracks crash counts over a 10-minute sliding window to detect flapping.
I wrote a RunSafeCollectorGoroutine wrapper and updated all 7 of our existing collectors to use it instead of launching naked goroutines.
I hooked up two new Prometheus metrics: kerno_collector_panics_total and kerno_collector_disabled.
Finally, I added a top-level defer catch in start.go so that if the core daemon does panic, it cleanly exits with os.Exit(2) so systemd knows exactly how to restart it safely.

Testing

go build ./... passes
go test ./... passes
go vet ./... passes
golangci-lint run ./... passes
Tested locally with: go test -run TestCollectorPanicRecovery with a mock fault-injecting collector to guarantee the circuit-breaker logic works perfectly.

N/A — pure docs/refactor
sudo ./bin/bpf-verify --read 5s confirms 6/6 programs still load
./scripts/verify.sh passes (or specific phase: ./scripts/verify.sh quality)

Checklist

PR title follows Conventional Commits (feat(scope): subject)
All commits are DCO-signed (git commit -s)
No unrelated changes pulled in
Documentation updated where user-visible behavior changed (Added a section on Panics to CONTRIBUTING.md)
Added/updated tests for new code paths
If a new doctor rule, paired with a chaos scenario in scripts/verify.sh

Signed-off-by: wazer24 <24wazer@rbunagpur.in>

github-actions · 2026-05-21T08:27:12Z

🚀 First PR — welcome aboard!

A few things to expect:

CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
DCO: every commit needs Signed-off-by: — git commit -s adds it automatically.
Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

wazer24 · 2026-05-24T02:51:49Z

Hello @btwshivam please review my pr and let me know if there are any changes to make. Thank you .

btwshivam · 2026-05-31T00:07:08Z

@wazer24 always Make PR with seprate branches.. dont use main.. keep it for rebasing

btwshivam

the branch is based on a stale main (you're PRing from your fork's main), so it reverts merged work: any back to interface{} (#54), the cgroup-memory wiring in Signals (#62), and the slash-commands section in CONTRIBUTING. rebase on upstream main first, then the panic-recovery feature notes inline.

btwshivam · 2026-05-31T00:09:57Z

@@ -142,10 +145,66 @@ func (r *Registry) Signals(duration time.Duration) *Signals {
 			s.FD = v
 		case *MemorySnapshot:


the case *CgroupMemorySnapshot: s.CgroupMemory = v that lives right after this was deleted. that's the only place the cgroup-memory collector feeds Signals, so dropping it silently breaks the memory_limit_pressure and memory_high_throttling rules (they get nil and never fire). it's not intentional, it's fallout from the stale base, same as the any to interface{} revert on line 37 and across the collectors. rebasing on upstream main makes all of these disappear.

btwshivam · 2026-05-31T00:09:57Z

+	defer func() {
+		if r := recover(); r != nil {
+			observability.HandleDaemonPanic(r, logger)
+			os.Exit(2)


os.Exit(2) here violates invariant 4: no os.Exit outside main.go. runStart is library-ish cli code. let the panic propagate or return an error, and let cmd/kerno/main.go decide the exit code. otherwise deferred cleanup (the http server shutdown, closers) is skipped on a daemon panic.

btwshivam · 2026-05-31T00:09:57Z

+				}()
+
+				// Run the actual collector loop
+				fn()


restarting fn() after a recovered panic is risky if the panic fired while a collector held c.mu (e.g. inside record()). recover doesn't release locks, so the restarted goroutine and any Snapshot() caller then deadlock on c.mu forever, which defeats the keep-alive goal. either document that collector loops must not panic under lock, or reset/recreate per-collector state on restart rather than re-entering the same loop with a half-held lock.

btwshivam · 2026-05-31T00:09:57Z

+	Namespace: Namespace,
+	Name:      "collector_panics_total",
+	Help:      "Total panics recovered in collector goroutines.",
+}, []string{"collector", "reason"})


the reason label is the raw panic message, which is unbounded cardinality, panic strings carry addresses, values, file:line, so every distinct panic becomes a new time series. drop reason from the metric (keep it in the log and the panic file), or map to a small fixed set.

btwshivam · 2026-05-31T00:09:57Z

+)
+
+const (
+	panicLogDir       = "/var/log/kerno-panics"


/var/log/kerno-panics is hardcoded. in the DaemonSet container that path isn't writable or persistent (and isn't mounted), so the MkdirAll/WriteFile just error every time and the forensic logging quietly does nothing in the environment it's meant for. make the dir configurable (flag/env, default off), or rely on the structured slog stack output instead of writing host files from library code.

btwshivam · 2026-06-06T17:49:04Z

fix conflict

wazer24 added 2 commits May 21, 2026 13:31

docs: document panic recovery mechanism

4448f73

Signed-off-by: wazer24 <24wazer@rbunagpur.in>

feat(collector): implement crash-loop safety and panic recovery

5f9eec0

Signed-off-by: wazer24 <24wazer@rbunagpur.in>

wazer24 requested a review from btwshivam as a code owner May 21, 2026 08:27

Merge branch 'optiqor:main' into main

01a0c4f

btwshivam requested changes May 31, 2026

View reviewed changes

btwshivam mentioned this pull request May 31, 2026

test(cli): daemon graceful shutdown on SIGTERM #152

Open

wazer24 requested a review from btwshivam May 31, 2026 03:29

wazer24 marked this pull request as draft May 31, 2026 03:34

Merge branch 'optiqor:main' into main

cb82466

This was referenced May 31, 2026

Main panic recovery #158

Closed

feat(collector): implement crash-loop safety and panic recovery #159

Open

wazer24 marked this pull request as ready for review May 31, 2026 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(collector): implement crash-loop safety to keep the daemon alive#95

feat(collector): implement crash-loop safety to keep the daemon alive#95
wazer24 wants to merge 4 commits into
optiqor:mainfrom
wazer24:main

wazer24 commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

wazer24 commented May 24, 2026

Uh oh!

btwshivam commented May 31, 2026 •

edited

Loading

Uh oh!

btwshivam left a comment

Uh oh!

btwshivam May 31, 2026

Uh oh!

btwshivam May 31, 2026

Uh oh!

btwshivam May 31, 2026

Uh oh!

btwshivam May 31, 2026

Uh oh!

btwshivam May 31, 2026

Uh oh!

btwshivam commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -142,10 +145,66 @@ func (r Registry) Signals(duration time.Duration) Signals {
		s.FD = v
		case *MemorySnapshot:

Conversation

wazer24 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Main Benefits

How

Testing

Checklist

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

wazer24 commented May 24, 2026

Uh oh!

btwshivam commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

btwshivam left a comment

Choose a reason for hiding this comment

Uh oh!

btwshivam May 31, 2026

Choose a reason for hiding this comment

Uh oh!

btwshivam May 31, 2026

Choose a reason for hiding this comment

Uh oh!

btwshivam May 31, 2026

Choose a reason for hiding this comment

Uh oh!

btwshivam May 31, 2026

Choose a reason for hiding this comment

Uh oh!

btwshivam May 31, 2026

Choose a reason for hiding this comment

Uh oh!

btwshivam commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wazer24 commented May 21, 2026 •

edited

Loading

btwshivam commented May 31, 2026 •

edited

Loading