Skip to content

feat(observability): GrafanaDashboard CR + SLO row, export to grafana-agent#8

Merged
stxkxs merged 1 commit into
mainfrom
o11y-prod-shape
Jun 24, 2026
Merged

feat(observability): GrafanaDashboard CR + SLO row, export to grafana-agent#8
stxkxs merged 1 commit into
mainfrom
o11y-prod-shape

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 24, 2026

Copy link
Copy Markdown
Member

Brings the tenant's observability to its correct prod shape (grafana-operator + grafana-agent → AMP/AMG, not a kube-prometheus sidecar or cluster otel-collector).

  • Dashboard → GrafanaDashboard CR. The vendored tenant-chart-base helper now emits the CR (instanceSelector dashboards: external) the grafana-operator reconciles onto Amazon Managed Grafana. The vendored library is synced to the canonical nanohype skeleton (adds _servicemonitor.tpl; _helpers gains commonLabels for the tagging-governance labels).
  • SLO row. The board leads with crawl-availability — availability (30d), error-budget remaining, burn rate — inline over the real competitive_intelligence_crawl_sources_total{outcome} counter, self-contained (no ruler needed).
  • Telemetry path. OTLP → grafana-agent.monitoring.svc:4318 (forwards traces→Tempo, metrics→AMP, logs→Loki); NetworkPolicy egress opened to the monitoring namespace.
  • Docs. CLAUDE / ARCHITECTURE / RUNBOOK / chart README / Dockerfile / metrics.ts now describe AMP/Tempo/Loki, not Grafana Cloud / Mimir / cluster otel-collector.

Validated: helm template emits the CR (no stale ConfigMap) with OTLP→grafana-agent; dashboard JSON parses; helm lint clean. The Loki logs row is a follow-up (the Alloy stream-label selector needs live-cluster verification).

Closes #7.

…O row, export to grafana-agent

Brings the tenant's observability to its correct prod shape — the org runs
grafana-operator + grafana-agent (Alloy) → Amazon Managed Prometheus/Grafana,
not a kube-prometheus-stack sidecar or a cluster otel-collector.

- Dashboard delivery: the vendored tenant-chart-base helper now emits the
  GrafanaDashboard CR (instanceSelector dashboards: external) the grafana-operator
  reconciles onto Amazon Managed Grafana — the portable path that works on both EKS
  and the local kx cluster. The whole vendored library is synced to the canonical
  nanohype skeleton (adds _servicemonitor.tpl; _helpers gains commonLabels merging
  for the tagging-governance labels).
- SLO row: the board leads with a crawl-availability SLO row — availability (30d),
  error-budget remaining, and burn rate — inline over the real good/bad counter
  competitive_intelligence_crawl_sources_total{outcome}, self-contained so it renders
  against AMP with no recording-rule ruler.
- Telemetry path: OTLP exports to the grafana-agent OTLP receiver
  (grafana-agent.monitoring.svc:4318), which forwards traces → Tempo, metrics → AMP,
  logs → Loki; the NetworkPolicy allows egress to the monitoring namespace on 4318.
- Docs: CLAUDE.md / ARCHITECTURE / RUNBOOK / chart README / Dockerfile / metrics.ts
  describe the real backends (AMP/Tempo/Loki via grafana-agent), not Grafana Cloud /
  Mimir / a cluster otel-collector.

The Loki logs row is a follow-up — the exact Alloy stream-label selector needs
verifying on a live cluster. Closes #7.
@stxkxs stxkxs merged commit 15873a5 into main Jun 24, 2026
9 checks passed
@stxkxs stxkxs deleted the o11y-prod-shape branch June 24, 2026 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

o11y retrofit: dashboard reaches prod (GrafanaDashboard CR), OTLP repoint, SLO row, logs row

1 participant