feat(observability): SRE dashboard + Grafana-managed SLO alerts for portal#48
Open
stxkxs wants to merge 2 commits into
Open
feat(observability): SRE dashboard + Grafana-managed SLO alerts for portal#48stxkxs wants to merge 2 commits into
stxkxs wants to merge 2 commits into
Conversation
portal (the ops control-plane: API server + River worker) emitted a rich portal_* metric set but nothing rendered it. Adds an authored GrafanaDashboard CR the grafana-operator reconciles onto the external Amazon Managed Grafana, alongside the agent-* persona boards. Self-contained PromQL over the portal_* series in AMP — no recording-rule ruler required. dashboards/base/platform/portal.yaml — five rows: - API SLO & error budget (99.9%/30d): inline 30d availability vs objective, error budget remaining, and fast/slow burn-rate stats (1h/6h, thresholds at 14.4x and 6x) computed from the http_request_duration_seconds histogram's 5xx/total count ratio. - API golden signals: request rate by status, 5xx error ratio, latency p50/p95/p99, slowest routes by p99, in-flight + pgxpool saturation, and acquire-wait rate. - tofu/terragrunt runs: completion rate by operation+status and run duration p50/p95 by operation — the core infra-execution health signal. - worker River jobs: jobs by state (backlog/running/trouble) and job error/panic rate by kind. - watcher loops: time-since-last-tick per loop (staleness), tick p95, and panic rate — the tenant/cluster watcher liveness signal. Registered in dashboards/base/kustomization.yaml. .yamllint.yaml's embedded-JSON ignore generalized to cover authored app dashboards (portal.yaml), matching the existing agent-* exclusion — the JSON is validated by kustomize build + the GrafanaDashboard schema. kustomize build green.
CI Results
All validations passed. |
Wires portal's burn-rate + liveness alerting onto the external Amazon Managed
Grafana via the grafana-operator, so the SLO has teeth in prod (the EKS clusters
have no in-cluster Prometheus ruler — alerts are Grafana-managed, evaluated by
AMG against the AMP datasource).
- alerting/folder.yaml — a GrafanaFolder ("SLO & burn-rate alerts") the rule
groups attach to.
- alerting/portal.yaml — GrafanaAlertRuleGroup with four rules:
- PortalErrorBudgetFastBurn (page) — dual-window 1h & 5m burn > 14.4x the
99.9% objective, encoded as a `> bool` product so one instant query yields 1/0.
- PortalErrorBudgetSlowBurn (page) — dual-window 6h & 30m burn > 6x.
- PortalWatcherStalled (page) — most-stale watcher loop hasn't ticked in >15m.
- PortalWorkerJobErrorsHigh (ticket) — River job error rate sustained >0.1/s.
- datasources/prometheus.yaml — pinned uid `managed-prometheus` so alert rules
reference the AMP datasource deterministically instead of relying on isDefault.
- kustomization + .yamllint (alert exprs carry long PromQL) updated.
Prereq: the AMG workspace must have Grafana alerting enabled; the operator
provisions rules via the Alerting Provisioning API. Contact-point routing is a
follow-up. kustomize build green.
CI Results
All validations passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
portal (the ops control-plane — API server + River worker) emits a rich
portal_*metric set but nothing rendered it and nothing alerted on it (the audit graded it D). This makes portal fully SRE-observable: an authored dashboard and Grafana-managed SLO/burn-rate alerts, both reconciled onto the external Amazon Managed Grafana by the grafana-operator.Dashboard —
dashboards/base/platform/portal.yamlA
GrafanaDashboardCR (self-contained PromQL over theportal_*series in AMP — no ruler needed). Five rows keyed on portal's own nouns:Alerting —
dashboards/base/alerting/Grafana-managed (the EKS clusters have no in-cluster Prometheus ruler), evaluated by AMG against AMP:
folder.yaml— aGrafanaFolderfor SLO alerts.portal.yaml— aGrafanaAlertRuleGroup: dual-window fast burn (1h&5m > 14.4×, page), slow burn (6h&30m > 6×, page), watcher-stall (page), worker-job-errors (ticket). Dual-window encoded as a> boolproduct so one instant query yields 1/0.datasources/prometheus.yaml— pinned datasource uidmanaged-prometheusfor deterministic alert refs.Notes
observability-slostandard (nanohype#123).kustomize build dashboards/basegreen; embedded JSON + alert conditions validated.