Skip to content

feat(observability): SRE dashboard + Grafana-managed SLO alerts for portal#48

Open
stxkxs wants to merge 2 commits into
mainfrom
portal-sre-dashboard
Open

feat(observability): SRE dashboard + Grafana-managed SLO alerts for portal#48
stxkxs wants to merge 2 commits into
mainfrom
portal-sre-dashboard

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 23, 2026

Copy link
Copy Markdown
Member

What

portal (the ops control-plane — API server + River worker) emits a rich portal_* metric set but nothing rendered it and nothing alerted on it (the audit graded it D). This makes portal fully SRE-observable: an authored dashboard and Grafana-managed SLO/burn-rate alerts, both reconciled onto the external Amazon Managed Grafana by the grafana-operator.

Dashboard — dashboards/base/platform/portal.yaml

A GrafanaDashboard CR (self-contained PromQL over the portal_* series in AMP — no ruler needed). Five rows keyed on portal's own nouns:

  • API SLO & error budget (99.9%/30d) — 30d availability, budget remaining, fast/slow burn-rate stats.
  • API golden signals — rate by status, 5xx ratio, latency p50/p95/p99, slowest routes, in-flight + pgxpool saturation, acquire-wait rate.
  • tofu/terragrunt runs — completion rate by operation+status, duration p50/p95.
  • worker River jobs — jobs by state, error/panic rate.
  • watcher loops — time-since-last-tick, tick p95, panic rate.

Alerting — dashboards/base/alerting/

Grafana-managed (the EKS clusters have no in-cluster Prometheus ruler), evaluated by AMG against AMP:

  • folder.yaml — a GrafanaFolder for SLO alerts.
  • portal.yaml — a GrafanaAlertRuleGroup: dual-window fast burn (1h&5m > 14.4×, page), slow burn (6h&30m > 6×, page), watcher-stall (page), worker-job-errors (ticket). Dual-window encoded as a > bool product so one instant query yields 1/0.
  • datasources/prometheus.yaml — pinned datasource uid managed-prometheus for deterministic alert refs.

Notes

  • Follows the observability-slo standard (nanohype#123).
  • kustomize build dashboards/base green; embedded JSON + alert conditions validated.
  • Prereq: AMG workspace must have Grafana alerting enabled. Contact-point routing is a follow-up.

portal (the ops control-plane: API server + River worker) emitted a rich
portal_* metric set but nothing rendered it. Adds an authored GrafanaDashboard
CR the grafana-operator reconciles onto the external Amazon Managed Grafana,
alongside the agent-* persona boards. Self-contained PromQL over the portal_*
series in AMP — no recording-rule ruler required.

dashboards/base/platform/portal.yaml — five rows:
- API SLO & error budget (99.9%/30d): inline 30d availability vs objective,
  error budget remaining, and fast/slow burn-rate stats (1h/6h, thresholds at
  14.4x and 6x) computed from the http_request_duration_seconds histogram's
  5xx/total count ratio.
- API golden signals: request rate by status, 5xx error ratio, latency
  p50/p95/p99, slowest routes by p99, in-flight + pgxpool saturation, and
  acquire-wait rate.
- tofu/terragrunt runs: completion rate by operation+status and run duration
  p50/p95 by operation — the core infra-execution health signal.
- worker River jobs: jobs by state (backlog/running/trouble) and job
  error/panic rate by kind.
- watcher loops: time-since-last-tick per loop (staleness), tick p95, and panic
  rate — the tenant/cluster watcher liveness signal.

Registered in dashboards/base/kustomization.yaml. .yamllint.yaml's embedded-JSON
ignore generalized to cover authored app dashboards (portal.yaml), matching the
existing agent-* exclusion — the JSON is validated by kustomize build + the
GrafanaDashboard schema. kustomize build green.
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production

All validations passed.

Wires portal's burn-rate + liveness alerting onto the external Amazon Managed
Grafana via the grafana-operator, so the SLO has teeth in prod (the EKS clusters
have no in-cluster Prometheus ruler — alerts are Grafana-managed, evaluated by
AMG against the AMP datasource).

- alerting/folder.yaml — a GrafanaFolder ("SLO & burn-rate alerts") the rule
  groups attach to.
- alerting/portal.yaml — GrafanaAlertRuleGroup with four rules:
  - PortalErrorBudgetFastBurn (page) — dual-window 1h & 5m burn > 14.4x the
    99.9% objective, encoded as a `> bool` product so one instant query yields 1/0.
  - PortalErrorBudgetSlowBurn (page) — dual-window 6h & 30m burn > 6x.
  - PortalWatcherStalled (page) — most-stale watcher loop hasn't ticked in >15m.
  - PortalWorkerJobErrorsHigh (ticket) — River job error rate sustained >0.1/s.
- datasources/prometheus.yaml — pinned uid `managed-prometheus` so alert rules
  reference the AMP datasource deterministically instead of relying on isDefault.
- kustomization + .yamllint (alert exprs carry long PromQL) updated.

Prereq: the AMG workspace must have Grafana alerting enabled; the operator
provisions rules via the Alerting Provisioning API. Contact-point routing is a
follow-up. kustomize build green.
@stxkxs stxkxs changed the title feat(dashboards): authored SRE dashboard for portal feat(observability): SRE dashboard + Grafana-managed SLO alerts for portal Jun 23, 2026
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production

All validations passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant