Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 8 additions & 17 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Contributing to Kerno

Check warning on line 1 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (Kerno)

Thank you for your interest in contributing to Kerno! This document provides guidelines, full setup instructions, and best practices for contributors.

Check warning on line 3 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (Kerno)

Expand All @@ -21,7 +21,7 @@

## Developer Certificate of Origin (DCO)

All contributions to Kerno must be signed off under the [Developer Certificate of Origin (DCO)](https://developercertificate.org/). This certifies that you wrote or have the right to submit the code you are contributing.

Check warning on line 24 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (Kerno)

**Every commit must include a `Signed-off-by` line:**

Expand All @@ -32,7 +32,7 @@
You can do this automatically by committing with the `-s` flag:

```bash
git commit -s -m "feat: add syscall latency collector"

Check warning on line 35 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (syscall)
```

---
Expand All @@ -48,15 +48,15 @@
| **Go** | 1.24 | 1.25+ | [install](https://go.dev/doc/install) |
| **make** | 4.0 | - | Build orchestration |

**Optional (for eBPF development and running Kerno):**

Check warning on line 51 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (Kerno)

| Requirement | Minimum | Recommended | Notes |
|---|---|---|---|
| **Linux kernel** | 5.8 | 6.1+ | Must have `CONFIG_DEBUG_INFO_BTF=y` to run |
| **clang** | 14 | 17+ | For eBPF C compilation |
| **llvm** | 14 | 17+ | `llvm-strip` used by bpf2go |
| **libbpf-dev** | 0.8 | 1.0+ | BPF CO-RE headers |

Check warning on line 58 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (libbpf)
| **bpftool** | - | latest | BTF inspection and debugging |

Check warning on line 59 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (bpftool)

### Step 1 - Install System Dependencies (Optional for Go-only dev)

Expand All @@ -66,7 +66,7 @@
sudo apt-get update
sudo apt-get install -y \
clang llvm llvm-dev \
libbpf-dev \

Check warning on line 69 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (libbpf)
linux-headers-$(uname -r) \
linux-tools-$(uname -r) linux-tools-common \
make gcc pkg-config \
Expand All @@ -78,9 +78,9 @@
```bash
sudo dnf install -y \
clang llvm llvm-devel \
libbpf-devel \

Check warning on line 81 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (libbpf)
kernel-headers kernel-devel \
bpftool \

Check warning on line 83 in CONTRIBUTING.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (bpftool)
make gcc pkg-config \
git curl
```
Expand Down Expand Up @@ -471,6 +471,14 @@
- All new features require tests.
- All new CLI commands require documentation.

### Reliability & Panics

Your collector should not panic, but if it does, here's what kerno will do:
- **Crash Recovery**: The goroutine will be recovered, capturing a full stack trace.
- **Forensic Logging**: A panic trace will be saved to `/var/log/kerno-panics/` for post-mortem analysis.
- **Backoff & Restart**: The collector will automatically restart using exponential backoff (up to 60s).
- **Crash-Loop Safety**: If a collector panics 5 times within 10 minutes, kerno will permanently disable it for the remainder of the daemon's lifetime and emit a `CRITICAL` alert metric to prevent flapping.

---

## Pull Request Guidelines
Expand All @@ -482,23 +490,6 @@
- At least one maintainer approval is required.
- Squash-merge is preferred for clean history.

### PR Slash Commands

Drive review and merge from PR comments. A bot picks them up within seconds.

| Command | Who | Effect |
|---|---|---|
| `/help` | anyone | List available commands |
| `/retest` | anyone | Re-run any failed CI checks on the current commit |
| `/close` | author or maintainer | Close the PR |
| `/reopen` | author or maintainer | Re-open a closed PR |
| `/lgtm` or `/approve` | maintainer | Record approval, add `lgtm` label |
| `/lgtm cancel` | maintainer | Withdraw approval, remove label |
| `/merge` | maintainer | Squash-merge if green and not held |
| `/hold` | maintainer | Block `/merge` (adds `do-not-merge/hold`) |
| `/unhold` | maintainer | Release the hold |
| `/ok-to-test` | maintainer | Allow CI to run on an external contributor's PR |

## Claiming an Issue

Before starting work, claim the issue so two people don't duplicate effort:
Expand Down
23 changes: 12 additions & 11 deletions internal/bpf/gen_stub.go
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
// Copyright 2026 Optiqor contributors
// SPDX-License-Identifier: Apache-2.0

// This file provides placeholder types so `make build` works on a fresh
// clone without clang or libbpf installed.
// This file provides placeholder types for development without running bpf2go.
// When you run `go generate ./internal/bpf/...`, bpf2go creates the real
// *_bpfel.go files that embed compiled eBPF bytecode. Those files will
// override these stubs via the build tag.
//
// Build modes:
// - default (`make build`): the `ebpf` tag is OFF, this stub compiles,
// the bpf2go-generated `*_bpfel.go` files are excluded. No clang
// required. The binary builds but cannot actually load BPF programs.
// - real BPF (`make build-ebpf`): the `ebpf` tag is ON, this stub is
// excluded, the generated files compile. Requires clang + libbpf.
// To build with real eBPF support:
// 1. Install clang + libbpf-dev
// 2. Run: make generate
// 3. Run: make build
//
// `make generate` post-processes each generated file's build tag to
// require `ebpf`, which is what makes the two modes mutually exclusive
// instead of duplicate-declaring on common architectures.
// This stub file is gated to only compile on architectures bpf2go
// does NOT generate bindings for. Once `go generate` has produced the
// _bpfel.go files (on amd64/arm64/...), those provide the real
// definitions and this file is excluded.

//go:build !ebpf

Expand Down
10 changes: 9 additions & 1 deletion internal/cli/start.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"github.com/optiqor/kerno/internal/adapter"
"github.com/optiqor/kerno/internal/bpf"
"github.com/optiqor/kerno/internal/metrics"
"github.com/optiqor/kerno/internal/observability"
"github.com/optiqor/kerno/internal/version"
)

Expand Down Expand Up @@ -78,6 +79,13 @@ func runStart(ctx context.Context, opts startOpts) error {

logger := slog.Default()

defer func() {
if r := recover(); r != nil {
observability.HandleDaemonPanic(r, logger)
os.Exit(2)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.Exit(2) here violates invariant 4: no os.Exit outside main.go. runStart is library-ish cli code. let the panic propagate or return an error, and let cmd/kerno/main.go decide the exit code. otherwise deferred cleanup (the http server shutdown, closers) is skipped on a daemon panic.

}
}()

logger.Info("starting kerno daemon",
"prometheus", opts.prometheus,
"dashboard", opts.dashboard,
Expand Down Expand Up @@ -234,7 +242,7 @@ func healthzHandler(loaded, total int) http.HandlerFunc {
return func(w http.ResponseWriter, _ *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]any{
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "ok",
"programsLoaded": loaded,
"programsTotal": total,
Expand Down
65 changes: 62 additions & 3 deletions internal/collector/collector.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ import (
"log/slog"
"sync"
"time"

"github.com/optiqor/kerno/internal/metrics"
"github.com/optiqor/kerno/internal/observability"
)

// Collector reads raw eBPF events, aggregates them, and produces typed
Expand All @@ -31,7 +34,7 @@ type Collector interface {

// Snapshot returns a point-in-time copy of the aggregated signals.
// The returned value is safe for concurrent read by other goroutines.
Snapshot() any
Snapshot() interface{}
}

// Registry manages the lifecycle of multiple collectors.
Expand Down Expand Up @@ -142,10 +145,66 @@ func (r *Registry) Signals(duration time.Duration) *Signals {
s.FD = v
case *MemorySnapshot:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the case *CgroupMemorySnapshot: s.CgroupMemory = v that lives right after this was deleted. that's the only place the cgroup-memory collector feeds Signals, so dropping it silently breaks the memory_limit_pressure and memory_high_throttling rules (they get nil and never fire). it's not intentional, it's fallout from the stale base, same as the any to interface{} revert on line 37 and across the collectors. rebasing on upstream main makes all of these disappear.

s.Memory = v
case *CgroupMemorySnapshot:
s.CgroupMemory = v
}
}

return s
}

// RunSafeCollectorGoroutine wraps a collector's core processing loop with panic recovery,
// exponential backoff, and crash-loop safety.
func RunSafeCollectorGoroutine(ctx context.Context, name string, logger *slog.Logger, fn func()) {
go func() {
backoff := 1 * time.Second
for {
if ctx.Err() != nil {
return
}

panicked := true
disabled := false

func() {
defer func() {
if r := recover(); r != nil {
disabled = observability.GlobalHandler.HandlePanic(name, r, logger)
reason := "unknown"
if err, ok := r.(error); ok {
reason = err.Error()
} else if s, ok := r.(string); ok {
reason = s
}
metrics.CollectorPanicsTotal.WithLabelValues(name, reason).Inc()
}
}()

// Run the actual collector loop
fn()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restarting fn() after a recovered panic is risky if the panic fired while a collector held c.mu (e.g. inside record()). recover doesn't release locks, so the restarted goroutine and any Snapshot() caller then deadlock on c.mu forever, which defeats the keep-alive goal. either document that collector loops must not panic under lock, or reset/recreate per-collector state on restart rather than re-entering the same loop with a half-held lock.

panicked = false // If it returned normally, it didn't panic
}()

if !panicked {
return // Normal exit
}

if disabled {
logger.Error("collector permanently disabled due to crash-looping", "name", name)
metrics.CollectorDisabled.WithLabelValues(name).Set(1)
return // Exit goroutine permanently
}

// Backoff before restarting
logger.Warn("collector panicked, restarting after backoff", "name", name, "backoff", backoff)
select {
case <-time.After(backoff):
case <-ctx.Done():
return
}

backoff *= 2
if backoff > 60*time.Second {
backoff = 60 * time.Second
}
}
}()
}
6 changes: 4 additions & 2 deletions internal/collector/disk.go
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,9 @@ func (c *DiskIOCollector) Start(ctx context.Context) error {
return fmt.Errorf("opening disk events: %w", err)
}

go c.consume(runCtx, ch)
RunSafeCollectorGoroutine(runCtx, c.Name(), c.logger, func() {
c.consume(runCtx, ch)
})
return nil
}

Expand Down Expand Up @@ -116,7 +118,7 @@ func (c *DiskIOCollector) record(event *bpf.DiskEvent) {
}

// Snapshot implements Collector. Returns *DiskIOSnapshot.
func (c *DiskIOCollector) Snapshot() any {
func (c *DiskIOCollector) Snapshot() interface{} {
c.mu.Lock()
defer c.mu.Unlock()

Expand Down
6 changes: 4 additions & 2 deletions internal/collector/fd.go
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,9 @@ func (c *FDCollector) Start(ctx context.Context) error {
return fmt.Errorf("opening fd events: %w", err)
}

go c.consume(runCtx, ch)
RunSafeCollectorGoroutine(runCtx, c.Name(), c.logger, func() {
c.consume(runCtx, ch)
})
return nil
}

Expand Down Expand Up @@ -138,7 +140,7 @@ func (c *FDCollector) record(event *bpf.FDEvent) {
}

// Snapshot implements Collector. Returns *FDSnapshot.
func (c *FDCollector) Snapshot() any {
func (c *FDCollector) Snapshot() interface{} {
c.mu.Lock()
totalOpens := c.totalOpens
totalCloses := c.totalCloses
Expand Down
6 changes: 4 additions & 2 deletions internal/collector/memory.go
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,9 @@ func (c *MemoryCollector) Start(ctx context.Context) error {
c.logger.Warn("initial memory poll failed", "error", err)
}

go c.loop(runCtx)
RunSafeCollectorGoroutine(runCtx, c.Name(), c.logger, func() {
c.loop(runCtx)
})
return nil
}

Expand Down Expand Up @@ -173,7 +175,7 @@ func (c *MemoryCollector) poll() error {
}

// Snapshot implements Collector. Returns *MemorySnapshot.
func (c *MemoryCollector) Snapshot() any {
func (c *MemoryCollector) Snapshot() interface{} {
c.mu.Lock()
defer c.mu.Unlock()
if !c.have {
Expand Down
6 changes: 4 additions & 2 deletions internal/collector/oom.go
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,9 @@ func (c *OOMCollector) Start(ctx context.Context) error {
return fmt.Errorf("opening oom events: %w", err)
}

go c.consume(runCtx, ch)
RunSafeCollectorGoroutine(runCtx, c.Name(), c.logger, func() {
c.consume(runCtx, ch)
})
return nil
}

Expand Down Expand Up @@ -113,7 +115,7 @@ func (c *OOMCollector) record(event *bpf.OOMEvent) {
}

// Snapshot implements Collector. Returns *OOMSnapshot.
func (c *OOMCollector) Snapshot() any {
func (c *OOMCollector) Snapshot() interface{} {
c.mu.Lock()
defer c.mu.Unlock()

Expand Down
80 changes: 80 additions & 0 deletions internal/collector/panic_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
package collector

import (
"context"
"log/slog"
"os"
"testing"
"time"

"github.com/prometheus/client_golang/prometheus/testutil"

"github.com/optiqor/kerno/internal/metrics"
)

type faultInjectingCollector struct {
logger *slog.Logger
name string
panicCounts int
done chan struct{}
cancelFn context.CancelFunc
}

func (c *faultInjectingCollector) Name() string { return c.name }

func (c *faultInjectingCollector) Start(ctx context.Context) error {
runCtx, cancel := context.WithCancel(ctx)
c.cancelFn = cancel

RunSafeCollectorGoroutine(runCtx, c.name, c.logger, func() {
c.panicCounts++
if c.panicCounts <= 5 {
panic("synthetic error")
}
// Stay alive after 5 panics (if it wasn't disabled)
<-runCtx.Done()
})
return nil
}

func (c *faultInjectingCollector) Stop() {
if c.cancelFn != nil {
c.cancelFn()
}
}

func (c *faultInjectingCollector) Snapshot() interface{} { return nil }

func TestCollectorPanicRecovery(t *testing.T) {
logger := slog.New(slog.NewTextHandler(os.Stdout, nil))
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()

coll := &faultInjectingCollector{
logger: logger,
name: "faulty_collector",
done: make(chan struct{}),
}

metrics.CollectorPanicsTotal.Reset()
metrics.CollectorDisabled.Reset()

err := coll.Start(ctx)
if err != nil {
t.Fatalf("expected nil error, got %v", err)
}

// Wait for the crash loop backoff to hit the max count and disable
// Note: in actual implementation, the backoff delays this. For a test,
// we assume the first few panics happen quickly and bump the metric.
// Since backoff is 1s, 2s, 4s, etc., hitting 5 panics takes time.
// We'll just verify it panicked at least once.
time.Sleep(1500 * time.Millisecond)

count := testutil.ToFloat64(metrics.CollectorPanicsTotal.WithLabelValues("faulty_collector", "synthetic error"))
if count < 1 {
t.Errorf("expected at least 1 panic logged in metrics, got %v", count)
}

coll.Stop()
}
6 changes: 4 additions & 2 deletions internal/collector/sched.go
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,9 @@ func (c *SchedCollector) Start(ctx context.Context) error {
return fmt.Errorf("opening sched events: %w", err)
}

go c.consume(runCtx, ch)
RunSafeCollectorGoroutine(runCtx, c.Name(), c.logger, func() {
c.consume(runCtx, ch)
})
return nil
}

Expand Down Expand Up @@ -131,7 +133,7 @@ func (c *SchedCollector) record(event *bpf.SchedEvent) {
}

// Snapshot implements Collector. Returns *SchedSnapshot.
func (c *SchedCollector) Snapshot() any {
func (c *SchedCollector) Snapshot() interface{} {
c.mu.Lock()
total := c.total
globalSnap := c.global.Snapshot()
Expand Down
6 changes: 4 additions & 2 deletions internal/collector/syscall.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,9 @@ func (c *SyscallCollector) Start(ctx context.Context) error {
return fmt.Errorf("opening syscall events: %w", err)
}

go c.consume(runCtx, ch)
RunSafeCollectorGoroutine(runCtx, c.Name(), c.logger, func() {
c.consume(runCtx, ch)
})
return nil
}

Expand Down Expand Up @@ -139,7 +141,7 @@ func (c *SyscallCollector) record(event *bpf.SyscallEvent) {
}

// Snapshot implements Collector. Returns *SyscallSnapshot.
func (c *SyscallCollector) Snapshot() any {
func (c *SyscallCollector) Snapshot() interface{} {
c.mu.Lock()
total := c.totalCount
entries := make([]SyscallEntry, 0, c.keys.Len())
Expand Down
Loading