feat(alerting): page when the kube-state-metrics CR-state projection goes dark#60
Merged
Conversation
…goes dark The seven agent persona dashboards read kube_customresource_* — projected by the single customResourceState config in the kube-state-metrics addon. KSM parses that config as one unit, so a malformed block, a CRD-list typo, an RBAC Forbidden on agents.nanohype.dev, or a KSM rename/outage drops every kube_customresource_* series at once and silently no-data's all those boards. Nothing alerted on that. The existing absent() rules (agent-operator, fleet-vend) watch controller_runtime_* — which arrives via the operator/provider pod-annotation scrape, a different path that stays green through a KSM-side break. Adds the ksm-health GrafanaAlertRuleGroup: absent(kube_customresource_status_phase) for 10m → page (folderRef slo-alerts, datasourceUid managed-prometheus, same query+threshold model as the other absent() canaries). noDataState OK so a healthy KSM (absent() returns empty) doesn't fire; the 10m for-duration rides out a KSM pod rollout without flapping.
CI Results
All validations passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The 7 agent persona dashboards read
kube_customresource_*from the single customResourceState config in the kube-state-metrics addon. KSM parses it as one unit — a malformed block, CRD-list typo, RBAC Forbidden, or KSM outage drops all those series at once and silently no-data's every persona board.Nothing caught that. The existing
absent()alerts watchcontroller_runtime_*, which arrives via a different (pod-annotation) scrape path that stays green through a KSM-side break.Adds
ksm-health—absent(kube_customresource_status_phase)for 10m → page, mirroring the existing absent()-canary query+threshold model (folderRefslo-alerts, pinnedmanaged-prometheusUID).noDataState: OKso a healthy KSM doesn't fire; 10m rides out a KSM pod rollout.yamllint + all overlays build; folderRef resolves.