feat(observability): SRE dashboard + Grafana-managed SLO alerts for portal by stxkxs · Pull Request #48 · nanohype/eks-gitops

stxkxs · 2026-06-23T01:47:48Z

What

portal (the ops control-plane — API server + River worker) emits a rich portal_* metric set but nothing rendered it and nothing alerted on it (the audit graded it D). This makes portal fully SRE-observable: an authored dashboard and Grafana-managed SLO/burn-rate alerts, both reconciled onto the external Amazon Managed Grafana by the grafana-operator.

Dashboard — `dashboards/base/platform/portal.yaml`

A GrafanaDashboard CR (self-contained PromQL over the portal_* series in AMP — no ruler needed). Five rows keyed on portal's own nouns:

API SLO & error budget (99.9%/30d) — 30d availability, budget remaining, fast/slow burn-rate stats.
API golden signals — rate by status, 5xx ratio, latency p50/p95/p99, slowest routes, in-flight + pgxpool saturation, acquire-wait rate.
tofu/terragrunt runs — completion rate by operation+status, duration p50/p95.
worker River jobs — jobs by state, error/panic rate.
watcher loops — time-since-last-tick, tick p95, panic rate.

Alerting — `dashboards/base/alerting/`

Grafana-managed (the EKS clusters have no in-cluster Prometheus ruler), evaluated by AMG against AMP:

folder.yaml — a GrafanaFolder for SLO alerts.
portal.yaml — a GrafanaAlertRuleGroup: dual-window fast burn (1h&5m > 14.4×, page), slow burn (6h&30m > 6×, page), watcher-stall (page), worker-job-errors (ticket). Dual-window encoded as a > bool product so one instant query yields 1/0.
datasources/prometheus.yaml — pinned datasource uid managed-prometheus for deterministic alert refs.

Notes

Follows the observability-slo standard (nanohype#123).
kustomize build dashboards/base green; embedded JSON + alert conditions validated.
Prereq: AMG workspace must have Grafana alerting enabled. Contact-point routing is a follow-up.

portal (the ops control-plane: API server + River worker) emitted a rich portal_* metric set but nothing rendered it. Adds an authored GrafanaDashboard CR the grafana-operator reconciles onto the external Amazon Managed Grafana, alongside the agent-* persona boards. Self-contained PromQL over the portal_* series in AMP — no recording-rule ruler required. dashboards/base/platform/portal.yaml — five rows: - API SLO & error budget (99.9%/30d): inline 30d availability vs objective, error budget remaining, and fast/slow burn-rate stats (1h/6h, thresholds at 14.4x and 6x) computed from the http_request_duration_seconds histogram's 5xx/total count ratio. - API golden signals: request rate by status, 5xx error ratio, latency p50/p95/p99, slowest routes by p99, in-flight + pgxpool saturation, and acquire-wait rate. - tofu/terragrunt runs: completion rate by operation+status and run duration p50/p95 by operation — the core infra-execution health signal. - worker River jobs: jobs by state (backlog/running/trouble) and job error/panic rate by kind. - watcher loops: time-since-last-tick per loop (staleness), tick p95, and panic rate — the tenant/cluster watcher liveness signal. Registered in dashboards/base/kustomization.yaml. .yamllint.yaml's embedded-JSON ignore generalized to cover authored app dashboards (portal.yaml), matching the existing agent-* exclusion — the JSON is validated by kustomize build + the GrafanaDashboard schema. kustomize build green.

github-actions · 2026-06-23T01:48:18Z

CI Results

Check	Status
YAML Lint	✅

Environment	Kustomize Build
dev	✅
staging	✅
production	✅

All validations passed.

Wires portal's burn-rate + liveness alerting onto the external Amazon Managed Grafana via the grafana-operator, so the SLO has teeth in prod (the EKS clusters have no in-cluster Prometheus ruler — alerts are Grafana-managed, evaluated by AMG against the AMP datasource). - alerting/folder.yaml — a GrafanaFolder ("SLO & burn-rate alerts") the rule groups attach to. - alerting/portal.yaml — GrafanaAlertRuleGroup with four rules: - PortalErrorBudgetFastBurn (page) — dual-window 1h & 5m burn > 14.4x the 99.9% objective, encoded as a `> bool` product so one instant query yields 1/0. - PortalErrorBudgetSlowBurn (page) — dual-window 6h & 30m burn > 6x. - PortalWatcherStalled (page) — most-stale watcher loop hasn't ticked in >15m. - PortalWorkerJobErrorsHigh (ticket) — River job error rate sustained >0.1/s. - datasources/prometheus.yaml — pinned uid `managed-prometheus` so alert rules reference the AMP datasource deterministically instead of relying on isDefault. - kustomization + .yamllint (alert exprs carry long PromQL) updated. Prereq: the AMG workspace must have Grafana alerting enabled; the operator provisions rules via the Alerting Provisioning API. Contact-point routing is a follow-up. kustomize build green.

github-actions · 2026-06-23T01:55:32Z

CI Results

Check	Status
YAML Lint	✅

Environment	Kustomize Build
dev	✅
staging	✅
production	✅

All validations passed.

stxkxs changed the title ~~feat(dashboards): authored SRE dashboard for portal~~ feat(observability): SRE dashboard + Grafana-managed SLO alerts for portal Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): SRE dashboard + Grafana-managed SLO alerts for portal#48

feat(observability): SRE dashboard + Grafana-managed SLO alerts for portal#48
stxkxs wants to merge 2 commits into
mainfrom
portal-sre-dashboard

stxkxs commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stxkxs commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Dashboard — dashboards/base/platform/portal.yaml

Alerting — dashboards/base/alerting/

Notes

Uh oh!

github-actions Bot commented Jun 23, 2026

CI Results

Uh oh!

github-actions Bot commented Jun 23, 2026

CI Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stxkxs commented Jun 23, 2026 •

edited

Loading

Dashboard — `dashboards/base/platform/portal.yaml`

Alerting — `dashboards/base/alerting/`