feat(hub): full observability for the eks-fleet hub (metrics + logs + traces + dashboards)#54
Merged
Merged
Conversation
… traces + dashboards) Wires the hub into the observability fabric so its Crossplane/provider metrics reach AMP and the dashboards (portal, operator, fleet-vend) render — closing #50. The hub registers as environment=hub; the appsets had no env filter, so a hub cluster matched every workload appset and ~13 errored on a missing values-hub. Curated per the "exclude + curate" decision — the hub runs the full three-pillar observability stack + its bootstrap deps, and is excluded from the workload catalog. Hub env: - environments/hub/cluster-config.yaml — the env ConfigMap (cluster_name hub-eks, Karpenter off, observability retention). - dashboards/overlays/hub/kustomization.yaml — patches the Grafana CR + AMP datasource to the hub AMG/AMP endpoints. Full observability (kept in addons-observability) — values-hub.yaml for: - grafana-agent (collector: metrics→AMP via hub-eks-grafana-agent-amp IRSA, logs→Loki, traces→Tempo), loki, tempo, kube-state-metrics, opencost, grafana-operator. Loki/ Tempo use gp3 PVCs (no buckets); each binds its hub-eks-<addon> IRSA role. Bootstrap deps the hub keeps — values-hub.yaml for cert-manager, external-secrets (the AMG-token ExternalSecret chain), metrics-server, prometheus-operator-crds, reloader; overlays/hub for storage-classes (gp3 for the Loki/Tempo PVCs), priority-classes, portal-reader. Exclusions — environment NotIn [hub] on the workload appsets the hub does not run: networking, security, operations(+kustomize), ai-platform, argo-platform, apps-tenants, druid-tenants, kyverno-policies. (No effect on dev/staging/prod; accelerators + agent-operator are already gated by the eks-agent-platform/enabled label the hub lacks.) Depends on landing-zone#57 (AMP/AMG + the hub-eks-* IRSA roles) and a deploy-time step: create the AMG service-account token as Secrets-Manager key eks-grafana-token. yamllint + kustomize build of the new overlays green.
…ceholder + doc Quality-check follow-ups on the hub env: - .github/workflows/ci.yml + Taskfile.yaml: add 'hub' to the validate matrix / pr-summary / help text so the hub overlays are kustomize-built in CI (they were silently skipped — the build globs */overlays/<matrix.env>). - grafana-agent/values-hub.yaml: AMP workspace placeholder ws-PLACEHOLDER (match the dev/staging/prod siblings) instead of ws-PLACEHOLDER-hub. - docs/configuration/environments.md: add the hub row to the environment table.
CI Results
All validations passed. |
veleroEnabled/goldilocksEnabled/hubbleUiEnabled/trivyAdmissionEnabled and the karpenter* + loki/tempoRetentionDays keys are read by nothing — addon execution is gated by appset membership (the hub is excluded via NotIn [hub]), and Loki/ Tempo retention is set authoritatively in their values-hub.yaml. They were keep-in-step copies from the workload template, defensive belt-and-suspenders on top of the real exclusion mechanism. Keep only the load-bearing identity keys.
CI Results
All validations passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Wires the eks-fleet hub into the observability fabric so it runs the full three-pillar stack (metrics → AMP, logs → Loki, traces → Tempo) + dashboards — closing #50. The hub registers as
environment=hub; the appsets had no env filter, so a hub cluster matched every workload appset and ~13 would error on a missingvalues-hub. Curated per the agreed exclude + curate approach.Pairs with landing-zone#57 (AMP/AMG + the
hub-eks-*IRSA roles).Hub env
environments/hub/cluster-config.yaml— the env ConfigMap (cluster_name: hub-eks, Karpenter off, observability retention).dashboards/overlays/hub/kustomization.yaml— patches the Grafana CR + AMP datasource to the hub AMG/AMP endpoints.Full observability (kept in
addons-observability) —values-hub.yamlforgrafana-agent(the collector: metrics→AMP viahub-eks-grafana-agent-ampIRSA, logs→Loki, traces→Tempo),loki,tempo,kube-state-metrics,opencost,grafana-operator. Loki/Tempo use gp3 PVCs (no buckets); each binds itshub-eks-<addon>role.Bootstrap deps the hub keeps
values-hub.yamlfor cert-manager, external-secrets (the AMG-token ExternalSecret chain), metrics-server, prometheus-operator-crds, reloader;overlays/hub/for storage-classes (gp3 for the Loki/Tempo PVCs), priority-classes, portal-reader.Exclusions
environment NotIn [hub]on the 9 workload appsets the hub doesn't run (networking, security, operations ×2, ai-platform, argo-platform, apps-tenants, druid-tenants, kyverno-policies). No effect on dev/staging/prod (none arehub); accelerators + agent-operator are already gated by theeks-agent-platform/enabledlabel the hub lacks.Deploy-time prerequisite
After the AMG workspace exists (landing-zone#57):
aws grafana create-workspace-service-account-token→ store as Secrets-Manager keyeks-grafana-token(JSON, keytoken). The grafana-operator's ExternalSecret reads it to push dashboards.Validation & quality-check
yamllint+kustomize buildof all 4 new overlays green; CI matrix extended to build the hub overlays. Independently quality-checked (Systems A / Architecture A / Security A− / Code-quality A; Patterns A− / Docs B+ / Consistency B+) — verified non-hollow: everyhub-eks-*IRSA name matches what landing-zone mints, every kept appset's addons have a hub variant (nomissingkeybreaks), the exclusions don't touch existing envs, and all three pillars are present. The check's findings (CI matrix, placeholder alignment, the env-doc) are fixed in this PR.Once merged, the fleet-vend dashboard (#53) is unblocked.