Skip to content

feat(alerting): page when the kube-state-metrics CR-state projection goes dark#60

Merged
stxkxs merged 1 commit into
mainfrom
ksm-health-alert
Jun 24, 2026
Merged

feat(alerting): page when the kube-state-metrics CR-state projection goes dark#60
stxkxs merged 1 commit into
mainfrom
ksm-health-alert

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 24, 2026

Copy link
Copy Markdown
Member

The 7 agent persona dashboards read kube_customresource_* from the single customResourceState config in the kube-state-metrics addon. KSM parses it as one unit — a malformed block, CRD-list typo, RBAC Forbidden, or KSM outage drops all those series at once and silently no-data's every persona board.

Nothing caught that. The existing absent() alerts watch controller_runtime_*, which arrives via a different (pod-annotation) scrape path that stays green through a KSM-side break.

Adds ksm-healthabsent(kube_customresource_status_phase) for 10m → page, mirroring the existing absent()-canary query+threshold model (folderRef slo-alerts, pinned managed-prometheus UID). noDataState: OK so a healthy KSM doesn't fire; 10m rides out a KSM pod rollout.

yamllint + all overlays build; folderRef resolves.

…goes dark

The seven agent persona dashboards read kube_customresource_* — projected by the
single customResourceState config in the kube-state-metrics addon. KSM parses that
config as one unit, so a malformed block, a CRD-list typo, an RBAC Forbidden on
agents.nanohype.dev, or a KSM rename/outage drops every kube_customresource_* series
at once and silently no-data's all those boards.

Nothing alerted on that. The existing absent() rules (agent-operator, fleet-vend)
watch controller_runtime_* — which arrives via the operator/provider pod-annotation
scrape, a different path that stays green through a KSM-side break.

Adds the ksm-health GrafanaAlertRuleGroup: absent(kube_customresource_status_phase)
for 10m → page (folderRef slo-alerts, datasourceUid managed-prometheus, same
query+threshold model as the other absent() canaries). noDataState OK so a healthy
KSM (absent() returns empty) doesn't fire; the 10m for-duration rides out a KSM pod
rollout without flapping.
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production
hub

All validations passed.

@stxkxs stxkxs merged commit 2f17ba3 into main Jun 24, 2026
8 checks passed
@stxkxs stxkxs deleted the ksm-health-alert branch June 24, 2026 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant