Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 1 addition & 6 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
{
"name": "kerno",
"image": "mcr.microsoft.com/devcontainers/go:1-1.25-bookworm",

"features": {
"ghcr.io/devcontainers/features/docker-in-docker:2": {},
"ghcr.io/devcontainers/features/github-cli:1": {},
Expand All @@ -10,9 +9,7 @@
"helm": "latest"
}
},

"postCreateCommand": "bash -lc 'sudo apt-get update && sudo apt-get install -y --no-install-recommends clang llvm libbpf-dev linux-headers-generic linux-tools-common bpftool jq make && go install github.com/cilium/ebpf/cmd/bpf2go@latest && curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(go env GOPATH)/bin v1.62.0 || true'",

"customizations": {
"vscode": {
"extensions": [
Expand Down Expand Up @@ -44,10 +41,8 @@
}
}
},

"remoteUser": "vscode",

"mounts": [
"source=/sys/kernel/btf/vmlinux,target=/sys/kernel/btf/vmlinux,type=bind,readonly,optional=true"
]
}
}
204 changes: 204 additions & 0 deletions .github/workflows/soak.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
# Copyright 2026 Optiqor contributors
# SPDX-License-Identifier: Apache-2.0

name: Soak Test (24h)

on:
schedule:
- cron: '0 2 * * *' # nightly at 02:00 UTC
workflow_dispatch: # allow manual trigger

concurrency:
group: soak
cancel-in-progress: false # never cancel a running soak

permissions:
contents: read

env:
GO_VERSION: "1.25"
SOAK_DURATION: 86400 # 24 h in seconds
POLL_INTERVAL: 300 # scrape every 5 min
METRICS_PORT: 9090
PPROF_PORT: 6060
CSV: soak-results/metrics.csv

jobs:
soak:
name: 24-hour soak
runs-on: ubuntu-22.04
timeout-minutes: 1500 # 25 h hard ceiling

steps:
# ── Checkout ────────────────────────────────────────────────────────
- uses: actions/checkout@v6

- uses: actions/setup-go@v6
with:
go-version: ${{ env.GO_VERSION }}
cache: true

# ── System deps ─────────────────────────────────────────────────────
- name: Install system dependencies
run: |
sudo apt-get update -qq
sudo apt-get install -y --no-install-recommends \
clang llvm libbpf-dev linux-headers-generic \
linux-tools-common bpftool jq make bc

# ── Build ───────────────────────────────────────────────────────────
- name: Build kerno
run: make build

- name: Grant BPF capabilities
run: |
sudo setcap \
'cap_bpf,cap_perfmon,cap_sys_ptrace,cap_sys_admin,cap_net_admin,cap_dac_read_search+ep' \
./bin/kerno

# ── Prepare output dirs ─────────────────────────────────────────────
- name: Prepare output directories
run: |
mkdir -p soak-results/pprof
echo "ts_unix,rss_kb,goroutines,fds,bpf_maps,throughput_eps,doctor_p99_ms" \
> ${{ env.CSV }}

# ── Launch kerno ────────────────────────────────────────────────────
- name: Start kerno daemon
run: |
./bin/kerno start \
--metrics-addr :${{ env.METRICS_PORT }} \
--pprof-addr :${{ env.PPROF_PORT }} \
--log-level info \
> soak-results/kerno.log 2>&1 &
echo "KERNO_PID=$!" >> $GITHUB_ENV
sleep 5
kill -0 $KERNO_PID || (echo "kerno failed to start"; cat soak-results/kerno.log; exit 1)

# ── Launch chaos ────────────────────────────────────────────────────
- name: Start chaos load
run: |
./bin/kerno chaos \
--induce cascade \
--duration ${{ env.SOAK_DURATION }}s \
> soak-results/chaos.log 2>&1 &
echo "CHAOS_PID=$!" >> $GITHUB_ENV
sleep 3

# ── Monitoring loop ─────────────────────────────────────────────────
- name: Run monitoring loop
run: |
bash scripts/soak-watch.sh \
--duration ${{ env.SOAK_DURATION }} \
--interval ${{ env.POLL_INTERVAL }} \
--csv ${{ env.CSV }} \
--pprof-port ${{ env.PPROF_PORT }} \
--metrics-port ${{ env.METRICS_PORT }} \
--pprof-dir soak-results/pprof \
--pid ${{ env.KERNO_PID }}

# ── Stop processes ───────────────────────────────────────────────────
- name: Stop kerno and chaos
if: always()
run: |
kill ${{ env.KERNO_PID }} 2>/dev/null || true
kill ${{ env.CHAOS_PID }} 2>/dev/null || true
sleep 2

# ── Panic / Fatal check ──────────────────────────────────────────────
- name: Check for panics and fatals
run: |
echo "=== Scanning logs for panics/fatals ==="
PANIC_COUNT=$(grep -cEi 'panic|fatal|deadline exceeded' soak-results/kerno.log || true)
echo "panic/fatal count: $PANIC_COUNT"
if [ "$PANIC_COUNT" -gt 0 ]; then
echo "FAIL: $PANIC_COUNT panic/fatal/deadline-exceeded lines found"
grep -Ei 'panic|fatal|deadline exceeded' soak-results/kerno.log | tail -40
exit 1
fi
echo "PASS: No panics or fatals"

# ── Assert pass criteria ─────────────────────────────────────────────
- name: Assert soak pass criteria
run: |
python3 - <<'PYEOF'
import csv, sys

rows = []
with open("${{ env.CSV }}") as f:
for r in csv.DictReader(f):
rows.append(r)

if len(rows) < 13: # need at least hour-1 + hour-24 data points
print(f"Only {len(rows)} rows — soak may not have run long enough")
sys.exit(1)

def v(row, key):
return float(row[key]) if row[key] not in ('', 'N/A') else None

# Hour-1 baseline = row 12 (index 12, ~60 min at 5-min intervals)
baseline = rows[12]
final = rows[-1]

failures = []

# RSS: final < 1.5x hour-1
rss_base = v(baseline, 'rss_kb')
rss_final = v(final, 'rss_kb')
if rss_base and rss_final:
ratio = rss_final / rss_base
print(f"RSS ratio final/hour1: {ratio:.3f} (limit 1.5)")
if ratio >= 1.5:
failures.append(f"RSS leak: {rss_final:.0f} KB final vs {rss_base:.0f} KB at hour-1 (ratio {ratio:.2f})")

# Goroutines: final <= hour-1 + 50
gor_base = v(baseline, 'goroutines')
gor_final = v(final, 'goroutines')
if gor_base and gor_final:
delta = gor_final - gor_base
print(f"Goroutine delta: {delta:.0f} (limit +50)")
if delta > 50:
failures.append(f"Goroutine leak: {gor_final:.0f} final vs {gor_base:.0f} at hour-1 (delta {delta:.0f})")

# FD: compare warm-up end (row 6, ~30 min) to final
warmup = rows[6]
fd_warm = v(warmup, 'fds')
fd_final = v(final, 'fds')
if fd_warm and fd_final:
fd_delta = fd_final - fd_warm
print(f"FD delta post-warmup: {fd_delta:.0f} (limit 50)")
if fd_delta > 50:
failures.append(f"FD leak: {fd_final:.0f} final vs {fd_warm:.0f} post-warmup (delta {fd_delta:.0f})")

# Throughput stability: stdev/mean <= 0.20 across all rows
tps = [v(r, 'throughput_eps') for r in rows if v(r, 'throughput_eps') is not None]
if len(tps) > 5:
mean = sum(tps) / len(tps)
variance = sum((x - mean)**2 for x in tps) / len(tps)
stdev = variance ** 0.5
cv = stdev / mean if mean > 0 else 0
print(f"Throughput CV: {cv:.3f} (limit 0.20), mean={mean:.1f} eps")
if cv > 0.20:
failures.append(f"Throughput unstable: CV={cv:.3f} > 0.20")

if failures:
print("\nSOAK FAILED:")
for f in failures:
print(f" FAIL: {f}")
sys.exit(1)

print("\nSOAK PASSED: All criteria met.")
PYEOF

# ── Upload artifacts ─────────────────────────────────────────────────
- name: Upload soak artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: soak-report-${{ github.run_id }}
retention-days: 30
path: |
soak-results/metrics.csv
soak-results/kerno.log
soak-results/chaos.log
soak-results/pprof/
36 changes: 19 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,48 @@
<div align="center">
<div align="center">

# KERNO

Check warning on line 3 in README.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (KERNO)

### The production incident diagnosis engine for Kubernetes

**Your cluster broke. Your dashboards are green. Users are paging.**
**Run `kerno doctor`. 30 seconds. Root cause. Plain English.**

Check warning on line 8 in README.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (kerno)

<sub>Same single binary runs on bare metal, VMs, EC2, GCE - wherever Linux lives.</sub>

[![CI](https://github.com/optiqor/kerno/actions/workflows/ci.yml/badge.svg)](https://github.com/optiqor/kerno/actions/workflows/ci.yml)
[![Soak](https://github.com/optiqor/kerno/actions/workflows/soak.yml/badge.svg)](https://github.com/optiqor/kerno/actions/workflows/soak.yml)
[![Go Report Card](https://goreportcard.com/badge/github.com/optiqor/kerno)](https://goreportcard.com/report/github.com/optiqor/kerno)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![Release](https://img.shields.io/github/v/release/optiqor/kerno?include_prereleases)](https://github.com/optiqor/kerno/releases)
[![GHCR](https://img.shields.io/badge/ghcr.io-optiqor%2Fkerno-blue?logo=docker)](https://github.com/optiqor/kerno/pkgs/container/kerno)
![Go Version](https://img.shields.io/github/go-mod/go-version/optiqor/kerno)

[**Quick Start**](#quick-start) · [**How It Works**](#how-it-works) · [**Features**](#features) · [**Kubernetes**](#kubernetes-deployment) · [**Docs**](docs/architecture.md)
[**Quick Start**](#quick-start) · [**How It Works**](#how-it-works) · [**Features**](#features) · [**Kubernetes**](#kubernetes-deployment) · [**Docs**](docs/architecture.md)

<img src="demo.gif" alt="kerno doctor demo" width="900" />

Check warning on line 22 in README.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (kerno)

</div>

---

## What is Kerno?

Check warning on line 28 in README.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (Kerno)

Kerno is a **Kubernetes-native incident diagnosis engine** built on eBPF.

Check warning on line 30 in README.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (Kerno)
It runs as a DaemonSet on every node, watches the kernel - not your app - and answers a single question on demand:

> *Why is production broken right now?*

```bash
kubectl -n kerno-system exec ds/kerno -- kerno doctor

Check warning on line 36 in README.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (kerno)

Check warning on line 36 in README.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (kerno)

Check warning on line 36 in README.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (kerno)
```

30 seconds later you get a ranked diagnostic report with **plain-English causes, evidence, ETAs, and copy-paste fix steps** - no dashboards to wire, no query language to learn, no agents in your app.

The kernel knows minutes before your APM. Hours before your users. Kerno makes that visible.

Check warning on line 41 in README.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (Kerno)

**Same binary outside Kubernetes too.** `curl | bash` it onto any bare-metal box, EC2 instance, or systemd VM and `sudo kerno doctor` works exactly the same.

## Why Kerno?

Check warning on line 45 in README.md

View workflow job for this annotation

GitHub Actions / Spell check

Unknown word (Kerno)

It's 3am. PagerDuty fires. Latency is up, error budget is burning, and every dashboard you own is **green**.

Expand All @@ -61,8 +62,8 @@
end

subgraph Tools["WHO WATCHES WHAT"]
APM["Datadog · New Relic<br/>Prometheus · Grafana"]
CRun["Pixie · Tetragon<br/>Inspektor Gadget"]
APM["Datadog · New Relic<br/>Prometheus · Grafana"]
CRun["Pixie · Tetragon<br/>Inspektor Gadget"]
Kerno["<b>KERNO</b><br/><i>eBPF kernel tracing</i>"]
Bare["(nobody)"]
end
Expand Down Expand Up @@ -115,7 +116,7 @@

> **Requires:** kernel **5.8+** with BTF (every major managed K8s qualifies: EKS, GKE, AKS, DOKS, Linode, Civo). For raw manifests/Helm you'll need cluster-admin.

### 1 · Kubernetes (primary)
### 1 · Kubernetes (primary)

```bash
helm install kerno ./deploy/helm/kerno \
Expand All @@ -140,7 +141,7 @@

---

### 2 · Bare metal · VMs · EC2 · GCE
### 2 · Bare metal · VMs · EC2 · GCE

The same binary, the same command. No Kubernetes required.

Expand All @@ -156,7 +157,7 @@
journalctl -u kerno -f
```

### 3 · Docker (ad-hoc, any host with a privileged daemon)
### 3 · Docker (ad-hoc, any host with a privileged daemon)

```bash
docker run --rm --privileged --pid=host \
Expand Down Expand Up @@ -223,7 +224,7 @@
| `/sys/kernel/debug` | tracepoints, kprobes |
| `/sys/kernel/btf` | CO-RE type resolution |
| `/sys/fs/bpf` | BPF map pinning |
| `/proc` | PID cgroup pod resolution |
| `/proc` | PID → cgroup → pod resolution |
| `/sys/fs/cgroup` | container resource accounting |
| `/sys/class/net` | per-interface TCP counters |
| `/sys/block` | per-device disk stats |
Expand Down Expand Up @@ -316,7 +317,7 @@

```mermaid
flowchart TB
subgraph Kernel["KERNEL SPACE · eBPF Programs"]
subgraph Kernel["KERNEL SPACE · eBPF Programs"]
direction LR
P1["syscall<br/>latency"]
P2["tcp<br/>monitor"]
Expand All @@ -328,20 +329,20 @@

RB[("Ring Buffers<br/>256KB per program<br/>zero-copy mmap")]

subgraph UserSpace["USER SPACE · Go"]
subgraph UserSpace["USER SPACE · Go"]
direction TB
Loader["BPF Loaders<br/>cilium/ebpf"]
Collector["Collectors<br/>percentile aggregation"]
Signals[("Signals Snapshot<br/>single source of truth")]
Adapter["Environment Adapter<br/>k8s · systemd · bare metal"]
Adapter["Environment Adapter<br/>k8s · systemd · bare metal"]
end

subgraph Outputs["OUTPUTS"]
direction TB
Doctor["Doctor Engine<br/>11 diagnostic rules"]
AI["AI Layer <i>(optional)</i><br/>root cause analysis"]
Prom["Prometheus<br/>/metrics :9090"]
CLI["Terminal<br/>pretty · JSON"]
CLI["Terminal<br/>pretty · JSON"]
end

P1 & P2 & P3 & P4 & P5 & P6 --> RB
Expand Down Expand Up @@ -465,7 +466,7 @@
kubectl -n kerno-system exec ds/kerno -- kerno trace sched --threshold 10ms
```

### Continuous monitoring - "alert me when…"
### Continuous monitoring - "alert me when…"

```bash
# TCP connections with retransmits
Expand Down Expand Up @@ -515,9 +516,9 @@

**Environment auto-detection.** Kerno picks one of three adapters and enriches every event - no configuration required:

- **Kubernetes** (in-cluster token present) pod, namespace, node, deployment
- **Systemd** (PID 1 is systemd) unit, slice, scope
- **Bare metal** hostname, cgroup path
- **Kubernetes** (in-cluster token present) → pod, namespace, node, deployment
- **Systemd** (PID 1 is systemd) → unit, slice, scope
- **Bare metal** → hostname, cgroup path

**AI (optional).** The AI layer runs **after** the deterministic rule engine - it correlates cross-signals and explains root causes, it never replaces rules. Three providers (**Anthropic**, **OpenAI**, **Ollama** for air-gapped), three privacy modes (`full` / `redacted` / `summary`), TTL cache + token-bucket rate limiting, graceful fallback to a deterministic template on failure. No LLM SDK dependencies - pure `net/http`.

Expand Down Expand Up @@ -654,6 +655,7 @@

---

If Kerno saved your on-call shift, consider leaving a **** it helps other engineers find the project.
If Kerno saved your on-call shift, consider leaving a **⭐** it helps other engineers find the project.

</div>

Loading
Loading