Skip to content

feat(ci): add 24-hour nightly soak test with leak detection#84

Closed
purvask2006-collab wants to merge 6 commits into
optiqor:mainfrom
purvask2006-collab:feat/soak-test
Closed

feat(ci): add 24-hour nightly soak test with leak detection#84
purvask2006-collab wants to merge 6 commits into
optiqor:mainfrom
purvask2006-collab:feat/soak-test

Conversation

@purvask2006-collab
Copy link
Copy Markdown

@purvask2006-collab purvask2006-collab commented May 16, 2026

What

Adds a nightly 24-hour soak test that continuously monitors kerno under
chaos load and asserts that RSS, goroutine count, file descriptors, and
event throughput stay within defined bounds across the full run.

Why

Fixes #

How

The workflow builds kerno, grants BPF capabilities via setcap, then runs
kerno start and kerno chaos --induce cascade --duration 86400s in
parallel. Every 5 minutes, soak-watch.sh scrapes RSS from /proc, goroutine
count and heap profiles from the pprof endpoint, open FDs, pinned BPF map
count via bpftool, and event throughput from the Prometheus metrics endpoint.
All data is appended to a CSV. Pprof snapshots are saved at hours 1, 6, 12,
18, and 24. At the end of the run a Python assertion script evaluates all
pass criteria and exits non-zero on any regression. The full CSV, logs, and
pprof dumps are uploaded as a GitHub Actions artifact on every run regardless
of pass or fail, so any failure can be reproduced locally.

The local script (scripts/soak-watch.sh) accepts --duration and --interval
flags so engineers can run a short smoke soak (10 minutes) without waiting
24 hours.

Testing

  • go build ./... passes
  • go test ./... passes
  • go vet ./... passes
  • golangci-lint run ./... passes
  • N/A — pure CI infrastructure and docs, no changes to Go source
  • Tested locally with: bash scripts/soak-watch.sh --duration 600 --interval 60 --pid <kerno-pid>

Checklist

  • PR title follows Conventional Commits (feat(scope): subject)
  • All commits are DCO-signed (git commit -s)
  • No unrelated changes pulled in
  • Documentation updated — docs/soak.md added with local run guide and
    failure interpretation instructions
  • Soak badge added to README.md
  • N/A — no new doctor rules

…qor#47)

- Add internal/ai/http_client.go: shared *http.Client builder
  * Honours HTTPS_PROXY/HTTP_PROXY/NO_PROXY env vars (Go default)
  * config.ai.proxy overrides env for explicit per-provider proxy
  * config.ai.ca_cert_file appended to system root pool (never replaces)
  * Actionable TLS error: cert subject + issuer + enterprise docs link
  * config.ai.insecure_skip_verify for dev only (loud stderr warning)
  * config.ai.timeout with 30s default

- Update anthropic.go, openai.go, ollama.go to use shared client
  instead of inline &http.Client{}

- Add config fields: ai.proxy, ai.ca_cert_file,
  ai.insecure_skip_verify, ai.timeout

- Add internal/ai/http_client_test.go: 7 tests covering
  custom CA success, wrong CA error, default case, bad CA file,
  insecure skip verify, explicit proxy, empty proxy

- Add docs/enterprise.md: 4 deployment scenarios, mitmproxy
  verification steps, openssl CA extraction one-liner

Closes optiqor#47
…or#47)

- Update anthropic.go, openai.go, ollama.go to use shared NewHTTPClient
  instead of inline &http.Client{}
- Add ai.proxy, ai.ca_cert_file, ai.insecure_skip_verify, ai.timeout
  fields to AIConfig in internal/config/config.go

Signed-off-by: purvask2006-collab <purvask2006@gmail.com>
…nvironments with authenticating proxies and MITM CA certificates.

Signed-off-by: purvask2006-collab <purvask2006@gmail.com>
- Add internal/doctor/baselines.go: sliding ring-buffer Tracker with
  sigma mode (normal metrics) and ratio mode (skewed/log-distributed)
- Add internal/doctor/baselines_test.go: 12 tests covering warmup
  suppression, stable-then-spike detection, 3x WARNING / 10x CRITICAL
- Extend internal/config/config.go with BaselinesConfigYAML + helper
- Wire Tracker into engine.go via WithBaselines()
- Add adaptive overlays to all four threshold rules in rules.go
- Add BaselineAnnotation field to Finding; render highlighted in render.go
- Update AI system prompt in prompt.go to reference baseline context

Static absolute floors are preserved alongside adaptive limits.
@github-actions
Copy link
Copy Markdown

🚀 First PR — welcome aboard!

A few things to expect:

  1. CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
  2. DCO: every commit needs Signed-off-by:git commit -s adds it automatically.
  3. Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
  4. Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

@github-actions github-actions Bot added documentation Improvements or additions to documentation testing Tests and test coverage area/doctor Diagnostic engine and rules area/integrations External integrations (sinks, exports, CI) area/ops Operations, deployment, runtime ergonomics labels May 16, 2026
@purvask2006-collab purvask2006-collab force-pushed the feat/soak-test branch 2 times, most recently from e364986 to 63c9a44 Compare May 16, 2026 15:59
Signed-off-by: purvask2006-collab <purvask2006@gmail.com>
@purvask2006-collab purvask2006-collab changed the title feat(ci): 24-hour nightly soak test with leak detection feat(ci): add 24-hour nightly soak test with leak detection May 16, 2026
@btwshivam
Copy link
Copy Markdown
Member

dont spam PR. first get your past pr merged

@btwshivam
Copy link
Copy Markdown
Member

btwshivam commented May 16, 2026

please focus on learning rather then blindly using AI for points

@purvask2006-collab
Copy link
Copy Markdown
Author

Understood, and I apologize for the friction. I got ahead of myself trying to fix things and open this soak test infrastructure without realising I was cluttering the workflow and pulling in unrelated commits.

I really want to learn the proper maintenance workflow here. I will stop opening new PRs, put this branch into a draft, and focus 100% of my attention on fixing the build, reverting the out-of-scope model changes, and getting the enterprise HTTP client PR cleanly merged first.

@purvask2006-collab
Copy link
Copy Markdown
Author

@btwshivam
To give some context on why I spent time building this: I didn’t mean for it to look like a generic AI-generated contribution. I explicitly built this infrastructure to address a real engineering gap I encountered while testing my own eBPF and AI provider code locally.

Here is why a standard test suite isn't enough for Kerno, and why I want to get this landed:
1- Kerno manages raw kernel-space eBPF maps and continuous event streams. Standard unit tests won't catch slow memory accumulation, file descriptor leaks, or unpinned BPF map bloat over time.
2- I designed soak-watch.sh to target /proc/[pid]/stat for precise RSS footprints and systematically scrape the Go runtime pprof endpoints. This gives us concrete, empirical CSV data across hours of runtime rather than relying on assumptions about runtime performance.
3-Running kerno chaos --induce cascade alongside the automated metric collection ensures the agent is aggressively stressed, guaranteeing our goroutine and memory boundaries hold true under worst-case production scenarios.
One more thing: The script scrapes event throughput data directly from Kerno's Prometheus metrics endpoint. Given that kerno chaos --induce cascade creates highly volatile spikes, should we smooth out the throughput pass/fail assertion using a rolling average, or should we strictly fail the test if throughput drops below a hard floor at any point during the 24-hour cycle?

@btwshivam
Copy link
Copy Markdown
Member

btwshivam commented May 16, 2026

@purvask2006-collab Hello!! slow down i understand your efforts.. i am saying first focus on getting your open PRs get merged then we can jump on this.. calmly and without any distraction

@btwshivam
Copy link
Copy Markdown
Member

also purva always rebase your branch with main before making pr so other branch changes wont get carrired away

@btwshivam
Copy link
Copy Markdown
Member

btwshivam commented May 16, 2026

@btwshivam To give some context on why I spent time building this: I didn’t mean for it to look like a generic AI-generated contribution. I explicitly built this infrastructure to address a real engineering gap I encountered while testing my own eBPF and AI provider code locally.

Here is why a standard test suite isn't enough for Kerno, and why I want to get this landed: 1- Kerno manages raw kernel-space eBPF maps and continuous event streams. Standard unit tests won't catch slow memory accumulation, file descriptor leaks, or unpinned BPF map bloat over time. 2- I designed soak-watch.sh to target /proc/[pid]/stat for precise RSS footprints and systematically scrape the Go runtime pprof endpoints. This gives us concrete, empirical CSV data across hours of runtime rather than relying on assumptions about runtime performance. 3-Running kerno chaos --induce cascade alongside the automated metric collection ensures the agent is aggressively stressed, guaranteeing our goroutine and memory boundaries hold true under worst-case production scenarios. One more thing: The script scrapes event throughput data directly from Kerno's Prometheus metrics endpoint. Given that kerno chaos --induce cascade creates highly volatile spikes, should we smooth out the throughput pass/fail assertion using a rolling average, or should we strictly fail the test if throughput drops below a hard floor at any point during the 24-hour cycle?

ill get back on your finding when ill review this PR.. for now rebase it.. and remove extra file changes

@purvask2006-collab
Copy link
Copy Markdown
Author

Yes, thank you. so much. I am working on it, I will make the changes that are required.

@btwshivam
Copy link
Copy Markdown
Member

fix conflicts and rebase with name.. then ill review it.. tag me thnks!

@purvask2006-collab
Copy link
Copy Markdown
Author

Hi @btwshivam,

Got it. I am currently working on rebasing the branch with main, resolving the merge conflicts, and cleaning up any extra file changes as requested.

I will tag you here as soon as the branch is clean and ready for your review.

@btwshivam btwshivam closed this Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/doctor Diagnostic engine and rules area/integrations External integrations (sinks, exports, CI) area/ops Operations, deployment, runtime ergonomics documentation Improvements or additions to documentation testing Tests and test coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants