O-7: cert expiry + PVC near-full + Gatus endpoint-down alerts#50
Merged
Conversation
alexrf45
added a commit
that referenced
this pull request
May 29, 2026
…tage (mul $value 100) is a Sprig function; Prometheus alert templates use Go text/template + Prometheus extensions only. The expression parses through YAML lint + CRD schema but fails at the operator's mutating webhook on apply, blocking the whole observability Kustomization on reconcile. humanizePercentage accepts the 0-1 ratio the expr already produces and renders with the "%" suffix built in. Surfaced when Flux re-applied this rule during the kromgo reconcile (11f008e); the latent bug shipped with O-7 (PR #50) because PR CI doesn't run server-side validation on PrometheusRule manifests. Tracked as O-12 in _docs/reviews/home-0ps-review-2026-05-28.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexrf45
added a commit
that referenced
this pull request
May 29, 2026
…gate Heavy observability+CI day. New review doc captures full state. Closed since 2026-05-28: F-1+F-2+F-3 (PR #49 FALCO-BUNDLE), O-7 remaining alerts (PR #50 + 8a7a6f7 PVCNearFull fix), O-12 (CI lint workflow), O-13 (flux PodMonitor — pre-existing latent scrape bug, never caught data for 4 days), O-16 (kromgo configMapGenerator auto-rollout). New: O-15 (kromgo flux_version "No Data" — KSM label allowlist needed; top of next sprint), O-17 (TargetDown × 2 firing on authentik metrics scrape), O-18 (tailscale-operator PodMonitor 0 targets since bootstrap). Hyg-2 (orphan Falco Redis PVC after F-2 descope), Hyg-3 (3 cluster-configs comment-warnings). Recommended next sprint: O-15 + O-17 + O-18 as a 90-min observability cleanup bundle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes
O-7 remaining alerts from _docs/reviews/home-0ps-review-2026-05-28.md. The two CNPG alerts (
CNPGBackupStale,CNPGDumpCronJobStale) already exist; this PR adds the three remaining.What's in this PR
CertExpiringSoon(2 variants)slack-warningslack-criticalcertmanager_certificate_expiration_timestamp_seconds← brief saidcert_manager_*(underscore split); live cluster exportscertmanager_*(no underscore). Verified via Prometheus/api/v1/label/__name__/values.severity,app: cert-managerfor runbook routingPVCNearFull(2 variants)available/capacity < 0.10, sustained 30m →slack-warningavailable/capacity < 0.05, sustained 10m →slack-criticalkubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes*-dumps-pvcviapersistentvolumeclaim!~".*-dumps-pvc"— S-6 dump PVCs fill the volume by designseverity,pvc: {{ $labels.persistentvolumeclaim }}GatusEndpointDown(1 variant)gatus_results_endpoint_success == 0for 5m →slack-warningkey(e.g.applications_authentik),group,name,typeseverity=warning,app=gatus,endpoint: {{ $labels.key }}Existing rules preserved (guarded by accept.sh)
Survival assertions confirm these are all still present and unchanged:
CNPGBackupStale,CNPGDumpCronJobStale,PodOOMKilled,FluxKustomizationNotReady,FluxHelmReleaseNotReady,FluxResourceSuspended.Acceptance test
CI runs
.claude/sprints/o-7/accept.shvia.github/workflows/sprint-accept.yml. Local pass:Post-merge manual checks
kube dev -n monitoring exec deploy/monitoring-kube-prometheus-stack-prometheus -- wget -qO- "localhost:9090/api/v1/rules"— confirm 3 new rule groups loadedWorktree
/tmp/sprints/o-7(clean up after merge:git worktree remove /tmp/sprints/o-7 && git branch -D sprint/o-7)Generated by
/sprint-orchestrate FALCO-BUNDLE O-7— wave 1 (parallel with FALCO-BUNDLE, separate PR #49).🤖 Generated with Claude Code