Skip to content

FALCO-BUNDLE: mount custom-rules + disable falcosidekick UI/Redis (F-1+F-2+F-3)#49

Merged
alexrf45 merged 2 commits into
devfrom
sprint/falco-bundle
May 28, 2026
Merged

FALCO-BUNDLE: mount custom-rules + disable falcosidekick UI/Redis (F-1+F-2+F-3)#49
alexrf45 merged 2 commits into
devfrom
sprint/falco-bundle

Conversation

@alexrf45

Copy link
Copy Markdown
Owner

Closes

F-1, F-2, F-3 from _docs/reviews/home-0ps-review-2026-05-28.md. All three findings touch the same Falco HR file, so they ship as one PR.

What's in this PR

F-1 — mount custom-rules ConfigMap

  • mounts.volumes + mounts.volumeMounts reference the existing falco-custom-rules ConfigMap, mounted at /etc/falco/rules.d
  • falco.rules_files already included that path (no change needed there; guarded by assertion in accept.sh)

F-2 — disable falcosidekick UI + Redis

  • falcosidekick.webui.enabled: false
  • falcosidekick.webui.redis.enabled: falsepath correction: the brief mentioned falcosidekick.config.redis.*, but in falcosidekick subchart 0.12.x Redis is actually nested under webui (confirmed against upstream values.yaml before committing)
  • Slack output preserved via falco.http_output.url → falco-falcosidekick:2801 (the chart routes via http_output → falcosidekick → Slack, not via a direct falcosidekick.config.slack.* block on the HR). Accept-test guards http_output.enabled + the URL target, plus a future-state guard for any falcosidekick.config.slack.webhookurl if someone later wires that direct path.

F-3 — modern_ebpf driver verified live

  • No HR change needed. Live DaemonSet log on security-falco-falcosecurity-4l8fk shows:
    Thu May 28 20:27:02 2026: Opening 'syscall' source with modern BPF probe.
    
  • No kmod / legacy / fallback strings in the last 400 lines across the 6 DS pods on Talos 6.18.32-talos.
  • driver.kind: modern_ebpf is already set in the HR; the assertion guards it.

Post-merge manual cleanup (NOT in this PR)

The PV security-falco-falcosidekick-ui-redis-data (1Gi, Retain) will go unbound after the chart removes the StatefulSet. The data won't auto-delete because reclaimPolicy: Retain. Follow-ups:

  1. After Flux reconciles, confirm the StatefulSet + PVC are gone: kube dev -n security get sts,pvc | grep redis
  2. The underlying TrueNAS zvol (named per the dynamic-provisioner pattern) needs hand-deletion to reclaim the 1Gi. Locate via the PV's volumeAttributes.volume_id or volumeAttributes.iscsi_target_iqn.

Acceptance test

CI runs .claude/sprints/falco-bundle/accept.sh via .github/workflows/sprint-accept.yml. Local pass:

[accept:FALCO-BUNDLE] yamllint 1 files
[accept:FALCO-BUNDLE] kubectl kustomize _lib/controllers/falco
[accept:FALCO-BUNDLE] F-1 assert: mounts.volumes references configMap falco-custom-rules
[accept:FALCO-BUNDLE] F-1 assert: mounts.volumeMounts has /etc/falco/rules.d entry tied to the custom-rules volume
[accept:FALCO-BUNDLE] F-1 assert: falco.rules_files includes /etc/falco/rules.d
[accept:FALCO-BUNDLE] F-2 assert: falcosidekick.webui.enabled == false
[accept:FALCO-BUNDLE] F-2 assert: falcosidekick.webui.redis.enabled == false
[accept:FALCO-BUNDLE] F-2 assert: falco.http_output.enabled == true (alert plumbing preserved)
[accept:FALCO-BUNDLE] F-2 assert: falco.http_output.url targets falcosidekick
[accept:FALCO-BUNDLE] F-3 assert: driver.kind == modern_ebpf (static)
[accept:FALCO-BUNDLE] PASS

Out of scope for this sprint

  • falco-custom-rules ConfigMap content — kept as the existing thin starter set. Future sprint expands the rule library.

Worktree

/tmp/sprints/falco-bundle (clean up after merge: git worktree remove /tmp/sprints/falco-bundle && git branch -D sprint/falco-bundle)

Generated by

/sprint-orchestrate FALCO-BUNDLE O-7 — wave 1 (parallel with O-7, separate PR for the alert work).

🤖 Generated with Claude Code

alexrf45 added 2 commits May 28, 2026 16:56
…+F-2)

F-1: Mount the in-repo falco-custom-rules ConfigMap (security ns) into
/etc/falco/rules.d via the chart's mounts.volumes/volumeMounts hooks.
falco.rules_files already enumerated /etc/falco/rules.d, so this wires
the Git-managed rule library into the running engine without a chart
fork.

F-2: Disable the falcosidekick web UI Deployment and its bundled Redis
StatefulSet (1Gi iSCSI PVC) that the falcosecurity/falco 8.0.0 chart
turns on by default via the falcosidekick 0.12.x subchart. Alerts are
surfaced via Falco → falcosidekick HTTP (the falco.http_output block
is preserved), so the web UI + Redis add storage + attack surface for
no win on a single-operator homelab. Heads-up: post-merge cleanup will
leave an orphan security-falco-falcosidekick-ui-redis-data PV on the
Retain-policy iSCSI StorageClass — manual TrueNAS zvol cleanup needed
to reclaim the 1Gi.

F-3: No HR change needed. Static driver.kind=modern_ebpf was already
correct; runtime probe (k8sop dev kubectl -n security logs
ds/security-falco-falcosecurity -c falco) confirmed the engine line
'Opening syscall source with modern BPF probe.' on all 6 pods. Logged
in PR body, no fallback observed.

Also tweak accept.sh: drop accept.sh from yamllint TOUCH_PATHS (it's
a shell script, not YAML) and add a bash -n syntax check instead.
@alexrf45 alexrf45 merged commit c492d45 into dev May 28, 2026
1 check passed
@alexrf45 alexrf45 deleted the sprint/falco-bundle branch May 28, 2026 21:05
alexrf45 added a commit that referenced this pull request May 28, 2026
## Closes

**O-7 remaining alerts** from
[_docs/reviews/home-0ps-review-2026-05-28.md](../blob/dev/_docs/reviews/home-0ps-review-2026-05-28.md).
The two CNPG alerts (`CNPGBackupStale`, `CNPGDumpCronJobStale`) already
exist; this PR adds the three remaining.

## What's in this PR

### `CertExpiringSoon` (2 variants)

- **Warning** when expiry < **14 days**, sustained 1h → routes to
`slack-warning`
- **Critical** when expiry < **3 days**, sustained 1h → routes to
`slack-critical`
- Metric: **`certmanager_certificate_expiration_timestamp_seconds`** ←
**brief said `cert_manager_*` (underscore split); live cluster exports
`certmanager_*` (no underscore).** Verified via Prometheus
`/api/v1/label/__name__/values`.
- Labels: `severity`, `app: cert-manager` for runbook routing

### `PVCNearFull` (2 variants)

- **Warning** when `available/capacity < 0.10`, sustained 30m →
`slack-warning`
- **Critical** when `available/capacity < 0.05`, sustained 10m →
`slack-critical`
- Metrics: `kubelet_volume_stats_available_bytes /
kubelet_volume_stats_capacity_bytes`
- **Excludes `*-dumps-pvc`** via `persistentvolumeclaim!~".*-dumps-pvc"`
— S-6 dump PVCs fill the volume by design
- Labels: `severity`, `pvc: {{ $labels.persistentvolumeclaim }}`

### `GatusEndpointDown` (1 variant)

- **Warning** when `gatus_results_endpoint_success == 0` for 5m →
`slack-warning`
- Metric verified live via in-cluster Prometheus API — gauge labelled by
`key` (e.g. `applications_authentik`), `group`, `name`, `type`
- Labels: `severity=warning`, `app=gatus`, `endpoint: {{ $labels.key }}`

## Existing rules preserved (guarded by accept.sh)

Survival assertions confirm these are all still present and unchanged:
`CNPGBackupStale`, `CNPGDumpCronJobStale`, `PodOOMKilled`,
`FluxKustomizationNotReady`, `FluxHelmReleaseNotReady`,
`FluxResourceSuspended`.

## Acceptance test

CI runs `.claude/sprints/o-7/accept.sh` via
`.github/workflows/sprint-accept.yml`. Local pass:

```text
[accept:O-7] yamllint 2 files
[accept:O-7] kubectl kustomize _lib/observability/kube-prometheus-stack
[accept:O-7] assert: exactly one custom PrometheusRule renders
[accept:O-7] assert: existing rules CNPGBackupStale + CNPGDumpCronJobStale preserved
[accept:O-7] assert: new alert 'CertExpiringSoon' exists with severity label
[accept:O-7] assert: new alert 'PVCNearFull' exists with severity label
[accept:O-7] assert: new alert 'GatusEndpointDown' exists with severity label
[accept:O-7] assert: CertExpiringSoon expr references certmanager_certificate_expiration_timestamp_seconds
[accept:O-7] assert: CertExpiringSoon has a warning-severity variant
[accept:O-7] assert: CertExpiringSoon has a critical-severity variant
[accept:O-7] assert: PVCNearFull expr references kubelet_volume_stats_available_bytes + capacity_bytes
[accept:O-7] assert: PVCNearFull excludes *-dumps-pvc via persistentvolumeclaim filter
[accept:O-7] assert: PVCNearFull has a warning-severity variant
[accept:O-7] assert: PVCNearFull has a critical-severity variant
[accept:O-7] assert: GatusEndpointDown expr references gatus_results_endpoint_success
[accept:O-7] PASS
```

## Post-merge manual checks

- `kube dev -n monitoring exec
deploy/monitoring-kube-prometheus-stack-prometheus -- wget -qO-
"localhost:9090/api/v1/rules"` — confirm 3 new rule groups loaded
- Force a synthetic test via Alertmanager UI / silence to verify Slack
routing on critical paths (optional — pattern matches existing CNPG
alerts)

## Worktree

`/tmp/sprints/o-7` (clean up after merge: `git worktree remove
/tmp/sprints/o-7 && git branch -D sprint/o-7`)

## Generated by

`/sprint-orchestrate FALCO-BUNDLE O-7` — wave 1 (parallel with
FALCO-BUNDLE, separate PR #49).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
alexrf45 added a commit that referenced this pull request May 29, 2026
…gate

Heavy observability+CI day. New review doc captures full state.

Closed since 2026-05-28: F-1+F-2+F-3 (PR #49 FALCO-BUNDLE),
O-7 remaining alerts (PR #50 + 8a7a6f7 PVCNearFull fix),
O-12 (CI lint workflow), O-13 (flux PodMonitor — pre-existing
latent scrape bug, never caught data for 4 days), O-16 (kromgo
configMapGenerator auto-rollout).

New: O-15 (kromgo flux_version "No Data" — KSM label allowlist
needed; top of next sprint), O-17 (TargetDown × 2 firing on
authentik metrics scrape), O-18 (tailscale-operator PodMonitor
0 targets since bootstrap). Hyg-2 (orphan Falco Redis PVC after
F-2 descope), Hyg-3 (3 cluster-configs comment-warnings).

Recommended next sprint: O-15 + O-17 + O-18 as a 90-min
observability cleanup bundle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant