FALCO-BUNDLE: mount custom-rules + disable falcosidekick UI/Redis (F-1+F-2+F-3)#49
Merged
Conversation
…+F-2) F-1: Mount the in-repo falco-custom-rules ConfigMap (security ns) into /etc/falco/rules.d via the chart's mounts.volumes/volumeMounts hooks. falco.rules_files already enumerated /etc/falco/rules.d, so this wires the Git-managed rule library into the running engine without a chart fork. F-2: Disable the falcosidekick web UI Deployment and its bundled Redis StatefulSet (1Gi iSCSI PVC) that the falcosecurity/falco 8.0.0 chart turns on by default via the falcosidekick 0.12.x subchart. Alerts are surfaced via Falco → falcosidekick HTTP (the falco.http_output block is preserved), so the web UI + Redis add storage + attack surface for no win on a single-operator homelab. Heads-up: post-merge cleanup will leave an orphan security-falco-falcosidekick-ui-redis-data PV on the Retain-policy iSCSI StorageClass — manual TrueNAS zvol cleanup needed to reclaim the 1Gi. F-3: No HR change needed. Static driver.kind=modern_ebpf was already correct; runtime probe (k8sop dev kubectl -n security logs ds/security-falco-falcosecurity -c falco) confirmed the engine line 'Opening syscall source with modern BPF probe.' on all 6 pods. Logged in PR body, no fallback observed. Also tweak accept.sh: drop accept.sh from yamllint TOUCH_PATHS (it's a shell script, not YAML) and add a bash -n syntax check instead.
alexrf45
added a commit
that referenced
this pull request
May 28, 2026
## Closes
**O-7 remaining alerts** from
[_docs/reviews/home-0ps-review-2026-05-28.md](../blob/dev/_docs/reviews/home-0ps-review-2026-05-28.md).
The two CNPG alerts (`CNPGBackupStale`, `CNPGDumpCronJobStale`) already
exist; this PR adds the three remaining.
## What's in this PR
### `CertExpiringSoon` (2 variants)
- **Warning** when expiry < **14 days**, sustained 1h → routes to
`slack-warning`
- **Critical** when expiry < **3 days**, sustained 1h → routes to
`slack-critical`
- Metric: **`certmanager_certificate_expiration_timestamp_seconds`** ←
**brief said `cert_manager_*` (underscore split); live cluster exports
`certmanager_*` (no underscore).** Verified via Prometheus
`/api/v1/label/__name__/values`.
- Labels: `severity`, `app: cert-manager` for runbook routing
### `PVCNearFull` (2 variants)
- **Warning** when `available/capacity < 0.10`, sustained 30m →
`slack-warning`
- **Critical** when `available/capacity < 0.05`, sustained 10m →
`slack-critical`
- Metrics: `kubelet_volume_stats_available_bytes /
kubelet_volume_stats_capacity_bytes`
- **Excludes `*-dumps-pvc`** via `persistentvolumeclaim!~".*-dumps-pvc"`
— S-6 dump PVCs fill the volume by design
- Labels: `severity`, `pvc: {{ $labels.persistentvolumeclaim }}`
### `GatusEndpointDown` (1 variant)
- **Warning** when `gatus_results_endpoint_success == 0` for 5m →
`slack-warning`
- Metric verified live via in-cluster Prometheus API — gauge labelled by
`key` (e.g. `applications_authentik`), `group`, `name`, `type`
- Labels: `severity=warning`, `app=gatus`, `endpoint: {{ $labels.key }}`
## Existing rules preserved (guarded by accept.sh)
Survival assertions confirm these are all still present and unchanged:
`CNPGBackupStale`, `CNPGDumpCronJobStale`, `PodOOMKilled`,
`FluxKustomizationNotReady`, `FluxHelmReleaseNotReady`,
`FluxResourceSuspended`.
## Acceptance test
CI runs `.claude/sprints/o-7/accept.sh` via
`.github/workflows/sprint-accept.yml`. Local pass:
```text
[accept:O-7] yamllint 2 files
[accept:O-7] kubectl kustomize _lib/observability/kube-prometheus-stack
[accept:O-7] assert: exactly one custom PrometheusRule renders
[accept:O-7] assert: existing rules CNPGBackupStale + CNPGDumpCronJobStale preserved
[accept:O-7] assert: new alert 'CertExpiringSoon' exists with severity label
[accept:O-7] assert: new alert 'PVCNearFull' exists with severity label
[accept:O-7] assert: new alert 'GatusEndpointDown' exists with severity label
[accept:O-7] assert: CertExpiringSoon expr references certmanager_certificate_expiration_timestamp_seconds
[accept:O-7] assert: CertExpiringSoon has a warning-severity variant
[accept:O-7] assert: CertExpiringSoon has a critical-severity variant
[accept:O-7] assert: PVCNearFull expr references kubelet_volume_stats_available_bytes + capacity_bytes
[accept:O-7] assert: PVCNearFull excludes *-dumps-pvc via persistentvolumeclaim filter
[accept:O-7] assert: PVCNearFull has a warning-severity variant
[accept:O-7] assert: PVCNearFull has a critical-severity variant
[accept:O-7] assert: GatusEndpointDown expr references gatus_results_endpoint_success
[accept:O-7] PASS
```
## Post-merge manual checks
- `kube dev -n monitoring exec
deploy/monitoring-kube-prometheus-stack-prometheus -- wget -qO-
"localhost:9090/api/v1/rules"` — confirm 3 new rule groups loaded
- Force a synthetic test via Alertmanager UI / silence to verify Slack
routing on critical paths (optional — pattern matches existing CNPG
alerts)
## Worktree
`/tmp/sprints/o-7` (clean up after merge: `git worktree remove
/tmp/sprints/o-7 && git branch -D sprint/o-7`)
## Generated by
`/sprint-orchestrate FALCO-BUNDLE O-7` — wave 1 (parallel with
FALCO-BUNDLE, separate PR #49).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
alexrf45
added a commit
that referenced
this pull request
May 29, 2026
…gate Heavy observability+CI day. New review doc captures full state. Closed since 2026-05-28: F-1+F-2+F-3 (PR #49 FALCO-BUNDLE), O-7 remaining alerts (PR #50 + 8a7a6f7 PVCNearFull fix), O-12 (CI lint workflow), O-13 (flux PodMonitor — pre-existing latent scrape bug, never caught data for 4 days), O-16 (kromgo configMapGenerator auto-rollout). New: O-15 (kromgo flux_version "No Data" — KSM label allowlist needed; top of next sprint), O-17 (TargetDown × 2 firing on authentik metrics scrape), O-18 (tailscale-operator PodMonitor 0 targets since bootstrap). Hyg-2 (orphan Falco Redis PVC after F-2 descope), Hyg-3 (3 cluster-configs comment-warnings). Recommended next sprint: O-15 + O-17 + O-18 as a 90-min observability cleanup bundle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes
F-1, F-2, F-3 from _docs/reviews/home-0ps-review-2026-05-28.md. All three findings touch the same Falco HR file, so they ship as one PR.
What's in this PR
F-1 — mount custom-rules ConfigMap
mounts.volumes+mounts.volumeMountsreference the existingfalco-custom-rulesConfigMap, mounted at/etc/falco/rules.dfalco.rules_filesalready included that path (no change needed there; guarded by assertion inaccept.sh)F-2 — disable falcosidekick UI + Redis
falcosidekick.webui.enabled: falsefalcosidekick.webui.redis.enabled: false← path correction: the brief mentionedfalcosidekick.config.redis.*, but in falcosidekick subchart 0.12.x Redis is actually nested underwebui(confirmed against upstreamvalues.yamlbefore committing)falco.http_output.url → falco-falcosidekick:2801(the chart routes via http_output → falcosidekick → Slack, not via a directfalcosidekick.config.slack.*block on the HR). Accept-test guardshttp_output.enabled+ the URL target, plus a future-state guard for anyfalcosidekick.config.slack.webhookurlif someone later wires that direct path.F-3 — modern_ebpf driver verified live
security-falco-falcosecurity-4l8fkshows:kmod/ legacy / fallback strings in the last 400 lines across the 6 DS pods on Talos6.18.32-talos.driver.kind: modern_ebpfis already set in the HR; the assertion guards it.Post-merge manual cleanup (NOT in this PR)
The PV
security-falco-falcosidekick-ui-redis-data(1Gi, Retain) will go unbound after the chart removes the StatefulSet. The data won't auto-delete becausereclaimPolicy: Retain. Follow-ups:kube dev -n security get sts,pvc | grep redisvolumeAttributes.volume_idorvolumeAttributes.iscsi_target_iqn.Acceptance test
CI runs
.claude/sprints/falco-bundle/accept.shvia.github/workflows/sprint-accept.yml. Local pass:Out of scope for this sprint
falco-custom-rulesConfigMap content — kept as the existing thin starter set. Future sprint expands the rule library.Worktree
/tmp/sprints/falco-bundle(clean up after merge:git worktree remove /tmp/sprints/falco-bundle && git branch -D sprint/falco-bundle)Generated by
/sprint-orchestrate FALCO-BUNDLE O-7— wave 1 (parallel with O-7, separate PR for the alert work).🤖 Generated with Claude Code