Skip to content

feat(hub): full observability for the eks-fleet hub (metrics + logs + traces + dashboards)#54

Merged
stxkxs merged 3 commits into
mainfrom
hub-observability-env
Jun 24, 2026
Merged

feat(hub): full observability for the eks-fleet hub (metrics + logs + traces + dashboards)#54
stxkxs merged 3 commits into
mainfrom
hub-observability-env

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 23, 2026

Copy link
Copy Markdown
Member

What

Wires the eks-fleet hub into the observability fabric so it runs the full three-pillar stack (metrics → AMP, logs → Loki, traces → Tempo) + dashboards — closing #50. The hub registers as environment=hub; the appsets had no env filter, so a hub cluster matched every workload appset and ~13 would error on a missing values-hub. Curated per the agreed exclude + curate approach.

Pairs with landing-zone#57 (AMP/AMG + the hub-eks-* IRSA roles).

Hub env

  • environments/hub/cluster-config.yaml — the env ConfigMap (cluster_name: hub-eks, Karpenter off, observability retention).
  • dashboards/overlays/hub/kustomization.yaml — patches the Grafana CR + AMP datasource to the hub AMG/AMP endpoints.

Full observability (kept in addons-observability) — values-hub.yaml for

grafana-agent (the collector: metrics→AMP via hub-eks-grafana-agent-amp IRSA, logs→Loki, traces→Tempo), loki, tempo, kube-state-metrics, opencost, grafana-operator. Loki/Tempo use gp3 PVCs (no buckets); each binds its hub-eks-<addon> role.

Bootstrap deps the hub keeps

values-hub.yaml for cert-manager, external-secrets (the AMG-token ExternalSecret chain), metrics-server, prometheus-operator-crds, reloader; overlays/hub/ for storage-classes (gp3 for the Loki/Tempo PVCs), priority-classes, portal-reader.

Exclusions

environment NotIn [hub] on the 9 workload appsets the hub doesn't run (networking, security, operations ×2, ai-platform, argo-platform, apps-tenants, druid-tenants, kyverno-policies). No effect on dev/staging/prod (none are hub); accelerators + agent-operator are already gated by the eks-agent-platform/enabled label the hub lacks.

Deploy-time prerequisite

After the AMG workspace exists (landing-zone#57): aws grafana create-workspace-service-account-token → store as Secrets-Manager key eks-grafana-token (JSON, key token). The grafana-operator's ExternalSecret reads it to push dashboards.

Validation & quality-check

yamllint + kustomize build of all 4 new overlays green; CI matrix extended to build the hub overlays. Independently quality-checked (Systems A / Architecture A / Security A− / Code-quality A; Patterns A− / Docs B+ / Consistency B+) — verified non-hollow: every hub-eks-* IRSA name matches what landing-zone mints, every kept appset's addons have a hub variant (no missingkey breaks), the exclusions don't touch existing envs, and all three pillars are present. The check's findings (CI matrix, placeholder alignment, the env-doc) are fixed in this PR.

Once merged, the fleet-vend dashboard (#53) is unblocked.

stxkxs added 2 commits June 23, 2026 13:59
… traces + dashboards)

Wires the hub into the observability fabric so its Crossplane/provider metrics
reach AMP and the dashboards (portal, operator, fleet-vend) render — closing #50.
The hub registers as environment=hub; the appsets had no env filter, so a hub
cluster matched every workload appset and ~13 errored on a missing values-hub.

Curated per the "exclude + curate" decision — the hub runs the full three-pillar
observability stack + its bootstrap deps, and is excluded from the workload catalog.

Hub env:
- environments/hub/cluster-config.yaml — the env ConfigMap (cluster_name hub-eks,
  Karpenter off, observability retention).
- dashboards/overlays/hub/kustomization.yaml — patches the Grafana CR + AMP
  datasource to the hub AMG/AMP endpoints.

Full observability (kept in addons-observability) — values-hub.yaml for:
- grafana-agent (collector: metrics→AMP via hub-eks-grafana-agent-amp IRSA, logs→Loki,
  traces→Tempo), loki, tempo, kube-state-metrics, opencost, grafana-operator. Loki/
  Tempo use gp3 PVCs (no buckets); each binds its hub-eks-<addon> IRSA role.

Bootstrap deps the hub keeps — values-hub.yaml for cert-manager, external-secrets
(the AMG-token ExternalSecret chain), metrics-server, prometheus-operator-crds,
reloader; overlays/hub for storage-classes (gp3 for the Loki/Tempo PVCs),
priority-classes, portal-reader.

Exclusions — environment NotIn [hub] on the workload appsets the hub does not run:
networking, security, operations(+kustomize), ai-platform, argo-platform,
apps-tenants, druid-tenants, kyverno-policies. (No effect on dev/staging/prod;
accelerators + agent-operator are already gated by the eks-agent-platform/enabled
label the hub lacks.)

Depends on landing-zone#57 (AMP/AMG + the hub-eks-* IRSA roles) and a deploy-time
step: create the AMG service-account token as Secrets-Manager key eks-grafana-token.
yamllint + kustomize build of the new overlays green.
…ceholder + doc

Quality-check follow-ups on the hub env:
- .github/workflows/ci.yml + Taskfile.yaml: add 'hub' to the validate matrix /
  pr-summary / help text so the hub overlays are kustomize-built in CI (they were
  silently skipped — the build globs */overlays/<matrix.env>).
- grafana-agent/values-hub.yaml: AMP workspace placeholder ws-PLACEHOLDER (match the
  dev/staging/prod siblings) instead of ws-PLACEHOLDER-hub.
- docs/configuration/environments.md: add the hub row to the environment table.
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production
hub

All validations passed.

veleroEnabled/goldilocksEnabled/hubbleUiEnabled/trivyAdmissionEnabled and the
karpenter* + loki/tempoRetentionDays keys are read by nothing — addon execution
is gated by appset membership (the hub is excluded via NotIn [hub]), and Loki/
Tempo retention is set authoritatively in their values-hub.yaml. They were
keep-in-step copies from the workload template, defensive belt-and-suspenders on
top of the real exclusion mechanism. Keep only the load-bearing identity keys.
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production
hub

All validations passed.

@stxkxs stxkxs merged commit c6e89a8 into main Jun 24, 2026
8 checks passed
@stxkxs stxkxs deleted the hub-observability-env branch June 24, 2026 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant