Skip to content

O-7: cert expiry + PVC near-full + Gatus endpoint-down alerts#50

Merged
alexrf45 merged 2 commits into
devfrom
sprint/o-7
May 28, 2026
Merged

O-7: cert expiry + PVC near-full + Gatus endpoint-down alerts#50
alexrf45 merged 2 commits into
devfrom
sprint/o-7

Conversation

@alexrf45

Copy link
Copy Markdown
Owner

Closes

O-7 remaining alerts from _docs/reviews/home-0ps-review-2026-05-28.md. The two CNPG alerts (CNPGBackupStale, CNPGDumpCronJobStale) already exist; this PR adds the three remaining.

What's in this PR

CertExpiringSoon (2 variants)

  • Warning when expiry < 14 days, sustained 1h → routes to slack-warning
  • Critical when expiry < 3 days, sustained 1h → routes to slack-critical
  • Metric: certmanager_certificate_expiration_timestamp_secondsbrief said cert_manager_* (underscore split); live cluster exports certmanager_* (no underscore). Verified via Prometheus /api/v1/label/__name__/values.
  • Labels: severity, app: cert-manager for runbook routing

PVCNearFull (2 variants)

  • Warning when available/capacity < 0.10, sustained 30m → slack-warning
  • Critical when available/capacity < 0.05, sustained 10m → slack-critical
  • Metrics: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes
  • Excludes *-dumps-pvc via persistentvolumeclaim!~".*-dumps-pvc" — S-6 dump PVCs fill the volume by design
  • Labels: severity, pvc: {{ $labels.persistentvolumeclaim }}

GatusEndpointDown (1 variant)

  • Warning when gatus_results_endpoint_success == 0 for 5m → slack-warning
  • Metric verified live via in-cluster Prometheus API — gauge labelled by key (e.g. applications_authentik), group, name, type
  • Labels: severity=warning, app=gatus, endpoint: {{ $labels.key }}

Existing rules preserved (guarded by accept.sh)

Survival assertions confirm these are all still present and unchanged: CNPGBackupStale, CNPGDumpCronJobStale, PodOOMKilled, FluxKustomizationNotReady, FluxHelmReleaseNotReady, FluxResourceSuspended.

Acceptance test

CI runs .claude/sprints/o-7/accept.sh via .github/workflows/sprint-accept.yml. Local pass:

[accept:O-7] yamllint 2 files
[accept:O-7] kubectl kustomize _lib/observability/kube-prometheus-stack
[accept:O-7] assert: exactly one custom PrometheusRule renders
[accept:O-7] assert: existing rules CNPGBackupStale + CNPGDumpCronJobStale preserved
[accept:O-7] assert: new alert 'CertExpiringSoon' exists with severity label
[accept:O-7] assert: new alert 'PVCNearFull' exists with severity label
[accept:O-7] assert: new alert 'GatusEndpointDown' exists with severity label
[accept:O-7] assert: CertExpiringSoon expr references certmanager_certificate_expiration_timestamp_seconds
[accept:O-7] assert: CertExpiringSoon has a warning-severity variant
[accept:O-7] assert: CertExpiringSoon has a critical-severity variant
[accept:O-7] assert: PVCNearFull expr references kubelet_volume_stats_available_bytes + capacity_bytes
[accept:O-7] assert: PVCNearFull excludes *-dumps-pvc via persistentvolumeclaim filter
[accept:O-7] assert: PVCNearFull has a warning-severity variant
[accept:O-7] assert: PVCNearFull has a critical-severity variant
[accept:O-7] assert: GatusEndpointDown expr references gatus_results_endpoint_success
[accept:O-7] PASS

Post-merge manual checks

  • kube dev -n monitoring exec deploy/monitoring-kube-prometheus-stack-prometheus -- wget -qO- "localhost:9090/api/v1/rules" — confirm 3 new rule groups loaded
  • Force a synthetic test via Alertmanager UI / silence to verify Slack routing on critical paths (optional — pattern matches existing CNPG alerts)

Worktree

/tmp/sprints/o-7 (clean up after merge: git worktree remove /tmp/sprints/o-7 && git branch -D sprint/o-7)

Generated by

/sprint-orchestrate FALCO-BUNDLE O-7 — wave 1 (parallel with FALCO-BUNDLE, separate PR #49).

🤖 Generated with Claude Code

@alexrf45 alexrf45 merged commit df2c03a into dev May 28, 2026
1 check passed
@alexrf45 alexrf45 deleted the sprint/o-7 branch May 28, 2026 21:06
alexrf45 added a commit that referenced this pull request May 29, 2026
…tage

(mul $value 100) is a Sprig function; Prometheus alert templates use Go
text/template + Prometheus extensions only. The expression parses through
YAML lint + CRD schema but fails at the operator's mutating webhook on
apply, blocking the whole observability Kustomization on reconcile.
humanizePercentage accepts the 0-1 ratio the expr already produces and
renders with the "%" suffix built in.

Surfaced when Flux re-applied this rule during the kromgo reconcile
(11f008e); the latent bug shipped with O-7 (PR #50) because PR CI doesn't
run server-side validation on PrometheusRule manifests. Tracked as O-12
in _docs/reviews/home-0ps-review-2026-05-28.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexrf45 added a commit that referenced this pull request May 29, 2026
…gate

Heavy observability+CI day. New review doc captures full state.

Closed since 2026-05-28: F-1+F-2+F-3 (PR #49 FALCO-BUNDLE),
O-7 remaining alerts (PR #50 + 8a7a6f7 PVCNearFull fix),
O-12 (CI lint workflow), O-13 (flux PodMonitor — pre-existing
latent scrape bug, never caught data for 4 days), O-16 (kromgo
configMapGenerator auto-rollout).

New: O-15 (kromgo flux_version "No Data" — KSM label allowlist
needed; top of next sprint), O-17 (TargetDown × 2 firing on
authentik metrics scrape), O-18 (tailscale-operator PodMonitor
0 targets since bootstrap). Hyg-2 (orphan Falco Redis PVC after
F-2 descope), Hyg-3 (3 cluster-configs comment-warnings).

Recommended next sprint: O-15 + O-17 + O-18 as a 90-min
observability cleanup bundle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant