Skip to content

feat(collector): implement crash-loop safety and panic recovery#159

Open
wazer24 wants to merge 3 commits into
optiqor:mainfrom
wazer24:main-panic-recovery
Open

feat(collector): implement crash-loop safety and panic recovery#159
wazer24 wants to merge 3 commits into
optiqor:mainfrom
wazer24:main-panic-recovery

Conversation

@wazer24
Copy link
Copy Markdown

@wazer24 wazer24 commented May 31, 2026

What

Implemented crash-loop safety and panic recovery mechanisms for the collector goroutines, while resolving architectural feedback on graceful daemon shutdown and Prometheus cardinality. Also, restored missing BPF collector map wirings and updated modern Go syntax across the package.

Why

Fixes #95

How

  • Panic Recovery & Crash-Looping: Updated RunSafeCollectorGoroutine in collector.go to recover from panics, log them, and permanently disable the offending collector instead of endlessly looping to prevent deadlocks.
  • Graceful Shutdown: Removed hard os.Exit(2) calls from library code (internal/cli/start.go). Daemon panic errors are now bubbled up gracefully and handled in main.go.
  • Configurable Panic Logs: Introduced KERNO_PANIC_LOG_DIR. Leaving this environment variable unset disables file-based panic logging, defaulting safely to stdout/stderr.
  • Metrics Cardinality Protection: Removed the dynamic reason label from CollectorPanicsTotal to prevent excessive Prometheus memory consumption.
  • Restored Functionality: Restored the CgroupMemory wiring in Registry.Signals that was accidentally dropped in an older base branch.
  • Syntax Cleanup: Refactored all internal/collector/*.go Snapshot methods from interface{} to any.

Testing

  • go build ./... passes
  • go test ./... passes
  • go vet ./... passes
  • golangci-lint run ./... passes
  • Tested locally with: go test -v ./internal/collector -run TestCollectorPanicRecovery
  • N/A — pure docs/refactor
  • sudo ./bin/bpf-verify --read 5s confirms 6/6 programs still load
  • ./scripts/verify.sh passes (or specific phase: ./scripts/verify.sh quality)

Checklist

  • PR title follows Conventional Commits (feat(scope): subject)
  • All commits are DCO-signed (git commit -s)
  • No unrelated changes pulled in
  • Documentation updated where user-visible behavior changed
  • Added/updated tests for new code paths
  • If a new doctor rule, paired with a chaos scenario in scripts/verify.sh

wazer24 added 2 commits May 31, 2026 09:25
Signed-off-by: wazer24 <24wazer@rbunagpur.in>
Signed-off-by: wazer24 <24wazer@rbunagpur.in>
@wazer24 wazer24 requested a review from btwshivam as a code owner May 31, 2026 04:47
@github-actions
Copy link
Copy Markdown

🚀 First PR — welcome aboard!

A few things to expect:

  1. CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
  2. DCO: every commit needs Signed-off-by:git commit -s adds it automatically.
  3. Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
  4. Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

@github-actions github-actions Bot added level:critical Touches BPF, security, or release surfaces (auto-applied) documentation Improvements or additions to documentation testing Tests and test coverage area/bpf eBPF programs and loaders area/doctor Diagnostic engine and rules area/ops Operations, deployment, runtime ergonomics area/community Community, contributors, governance labels May 31, 2026
Signed-off-by: wazer24 <24wazer@rbunagpur.in>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/bpf eBPF programs and loaders area/community Community, contributors, governance area/doctor Diagnostic engine and rules area/ops Operations, deployment, runtime ergonomics documentation Improvements or additions to documentation level:critical Touches BPF, security, or release surfaces (auto-applied) testing Tests and test coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant