feat(collector): implement crash-loop safety to keep the daemon alive#95
feat(collector): implement crash-loop safety to keep the daemon alive#95wazer24 wants to merge 4 commits into
Conversation
Signed-off-by: wazer24 <24wazer@rbunagpur.in>
Signed-off-by: wazer24 <24wazer@rbunagpur.in>
|
🚀 First PR — welcome aboard! A few things to expect:
If you get stuck, reply here or jump to Discussions. We want this PR to land. |
|
Hello @btwshivam please review my pr and let me know if there are any changes to make. Thank you . |
|
@wazer24 always Make PR with seprate branches.. dont use main.. keep it for rebasing |
| @@ -142,10 +145,66 @@ func (r *Registry) Signals(duration time.Duration) *Signals { | |||
| s.FD = v | |||
| case *MemorySnapshot: | |||
There was a problem hiding this comment.
the case *CgroupMemorySnapshot: s.CgroupMemory = v that lives right after this was deleted. that's the only place the cgroup-memory collector feeds Signals, so dropping it silently breaks the memory_limit_pressure and memory_high_throttling rules (they get nil and never fire). it's not intentional, it's fallout from the stale base, same as the any to interface{} revert on line 37 and across the collectors. rebasing on upstream main makes all of these disappear.
| defer func() { | ||
| if r := recover(); r != nil { | ||
| observability.HandleDaemonPanic(r, logger) | ||
| os.Exit(2) |
There was a problem hiding this comment.
os.Exit(2) here violates invariant 4: no os.Exit outside main.go. runStart is library-ish cli code. let the panic propagate or return an error, and let cmd/kerno/main.go decide the exit code. otherwise deferred cleanup (the http server shutdown, closers) is skipped on a daemon panic.
| }() | ||
|
|
||
| // Run the actual collector loop | ||
| fn() |
There was a problem hiding this comment.
restarting fn() after a recovered panic is risky if the panic fired while a collector held c.mu (e.g. inside record()). recover doesn't release locks, so the restarted goroutine and any Snapshot() caller then deadlock on c.mu forever, which defeats the keep-alive goal. either document that collector loops must not panic under lock, or reset/recreate per-collector state on restart rather than re-entering the same loop with a half-held lock.
| Namespace: Namespace, | ||
| Name: "collector_panics_total", | ||
| Help: "Total panics recovered in collector goroutines.", | ||
| }, []string{"collector", "reason"}) |
There was a problem hiding this comment.
the reason label is the raw panic message, which is unbounded cardinality, panic strings carry addresses, values, file:line, so every distinct panic becomes a new time series. drop reason from the metric (keep it in the log and the panic file), or map to a small fixed set.
| ) | ||
|
|
||
| const ( | ||
| panicLogDir = "/var/log/kerno-panics" |
There was a problem hiding this comment.
/var/log/kerno-panics is hardcoded. in the DaemonSet container that path isn't writable or persistent (and isn't mounted), so the MkdirAll/WriteFile just error every time and the forensic logging quietly does nothing in the environment it's meant for. make the dir configurable (flag/env, default off), or rely on the structured slog stack output instead of writing host files from library code.
|
fix conflict |
What
I've added a safety net so that if one of our background eBPF collectors panics, it doesn't take the whole daemon down with it. It catches the panic, logs a stack trace for us to debug later, and quietly restarts the collector using an exponential backoff.
Why
Fixes #95
While working on the codebase, I noticed a pretty big issue: if a single collector (like the TCP or Memory collector) hits a random snag—say, a corrupt BPF event or a missing cgroup file—the raw goroutine panics. Before this PR, that would either completely crash the
kernodaemon (wiping out all of our recent metric histograms) or it would just silently die and the metrics would flatline without anyone noticing. I wanted to make sure our observability tool is actually rock-solid in production!Main Benefits
journalctlanymore. Full stack traces get saved straight to/var/log/kerno-panics/whenever a crash happens.How
PanicHandlerininternal/observability/panics.gothat tracks crash counts over a 10-minute sliding window to detect flapping.RunSafeCollectorGoroutinewrapper and updated all 7 of our existing collectors to use it instead of launching naked goroutines.kerno_collector_panics_totalandkerno_collector_disabled.defercatch instart.goso that if the core daemon does panic, it cleanly exits withos.Exit(2)so systemd knows exactly how to restart it safely.Testing
go build ./...passesgo test ./...passesgo vet ./...passesgolangci-lint run ./...passesgo test -run TestCollectorPanicRecoverywith a mock fault-injecting collector to guarantee the circuit-breaker logic works perfectly.sudo ./bin/bpf-verify --read 5sconfirms 6/6 programs still load./scripts/verify.shpasses (or specific phase:./scripts/verify.sh quality)Checklist
feat(scope): subject)git commit -s)CONTRIBUTING.md)scripts/verify.sh