Problem
The eks-fleet hub (the cluster-vending management cluster) is not wired into the observability fabric, so nothing on the hub reaches Amazon Managed Prometheus / Grafana. This blocks the fleet-vend dashboard + alerts (authored, see the draft PR) from ever showing data — and means the hub's Crossplane/provider metrics are dark.
Evidence
- The hub registers its ArgoCD cluster Secret with
environment = "hub" (landing-zone/live/aws/fleet/us-west-2/hub/env.hcl:2).
- The observability + dashboards ApplicationSets resolve per-env files under
goTemplateOptions: ["missingkey=error"]:
applicationsets/dashboards.yaml:34 → path: overlays/{{ .metadata.labels.environment }}
applicationsets/addons-observability.yaml → values-{{ environment }}.yaml
- But eks-gitops ships only
dev / staging / production:
dashboards/overlays/ → dev, production, staging (no hub)
addons/observability/grafana-agent/ → values-{dev,staging,production}.yaml (no values-hub.yaml); same for grafana-operator/loki/tempo/kube-state-metrics
environments/ → dev, staging, production (no hub)
- The fleet landing-zone tree (
live/aws/fleet/.../hub/) provisions network/cluster/cluster-bootstrap/fleet-hub only — no AMP workspace, no AMG, no grafana-agent IRSA for the fleet account.
Net: with environment=hub and no matching env, the observability Applications error on the missing path → no grafana-agent (nothing scrapes), no grafana-operator (no GrafanaDashboard/GrafanaAlertRuleGroup reconciler), no AMP/AMG to write to or read from. This is the same class of gap flagged in applicationsets/clusters-appset.yaml:38-40.
Fix (spans eks-gitops + landing-zone)
eks-gitops:
- Add a
hub environment: environments/hub/cluster-config.yaml, dashboards/overlays/hub/, and values-hub.yaml for the observability addons (grafana-agent, grafana-operator, loki, tempo, kube-state-metrics) — minimally enough to run grafana-agent (→ the fleet AMP) + grafana-operator (→ the fleet AMG) on the hub. Decide which non-observability addons the hub should/shouldn't get (the hub runs Crossplane + ArgoCD + portal, not the full workload catalog).
landing-zone (fleet tree):
- Provision an AMP workspace + AMG workspace in the fleet account, and the grafana-agent IRSA role (
<env>-eks-grafana-agent-amp equivalent) + the AMG service role, wired into values-hub.yaml (AMP_REMOTE_WRITE_URL, the AMG endpoint patched into dashboards/base/grafana.yaml's overlay).
Then
Un-draft the fleet-vend dashboard + alerts — they query controller_runtime_*{namespace="crossplane-system"} from provider-opentofu (scrape annotation added in eks-fleet#), which only reaches AMP once the above is in place.
Follow-on (richer signal)
Once the hub is observability-wired, extend addons/observability/kube-state-metrics/values.yaml customResourceState with the eks-fleet Cluster XR (fleet.nanohype.dev) + provider-opentofu Workspace (opentofu.m.upbound.io) conditions, so the dashboard gains true per-cluster vend inventory/readiness + the Synced=False apply-error chase — not just provider reconcile RED. Validate the KSM config on the live hub (a malformed customResourceState entry breaks all kube_customresource_*). This is the difference between "provider reconcile-error rate as a vend-success proxy" and a real per-vend SLO.
Problem
The eks-fleet hub (the cluster-vending management cluster) is not wired into the observability fabric, so nothing on the hub reaches Amazon Managed Prometheus / Grafana. This blocks the
fleet-venddashboard + alerts (authored, see the draft PR) from ever showing data — and means the hub's Crossplane/provider metrics are dark.Evidence
environment = "hub"(landing-zone/live/aws/fleet/us-west-2/hub/env.hcl:2).goTemplateOptions: ["missingkey=error"]:applicationsets/dashboards.yaml:34→path: overlays/{{ .metadata.labels.environment }}applicationsets/addons-observability.yaml→values-{{ environment }}.yamldev/staging/production:dashboards/overlays/→ dev, production, staging (nohub)addons/observability/grafana-agent/→values-{dev,staging,production}.yaml(novalues-hub.yaml); same for grafana-operator/loki/tempo/kube-state-metricsenvironments/→ dev, staging, production (nohub)live/aws/fleet/.../hub/) provisions network/cluster/cluster-bootstrap/fleet-hub only — no AMP workspace, no AMG, no grafana-agent IRSA for the fleet account.Net: with
environment=huband no matching env, the observability Applications error on the missing path → no grafana-agent (nothing scrapes), no grafana-operator (no GrafanaDashboard/GrafanaAlertRuleGroup reconciler), no AMP/AMG to write to or read from. This is the same class of gap flagged inapplicationsets/clusters-appset.yaml:38-40.Fix (spans eks-gitops + landing-zone)
eks-gitops:
hubenvironment:environments/hub/cluster-config.yaml,dashboards/overlays/hub/, andvalues-hub.yamlfor the observability addons (grafana-agent, grafana-operator, loki, tempo, kube-state-metrics) — minimally enough to run grafana-agent (→ the fleet AMP) + grafana-operator (→ the fleet AMG) on the hub. Decide which non-observability addons the hub should/shouldn't get (the hub runs Crossplane + ArgoCD + portal, not the full workload catalog).landing-zone (fleet tree):
<env>-eks-grafana-agent-ampequivalent) + the AMG service role, wired intovalues-hub.yaml(AMP_REMOTE_WRITE_URL, the AMG endpoint patched intodashboards/base/grafana.yaml's overlay).Then
Un-draft the
fleet-venddashboard + alerts — they querycontroller_runtime_*{namespace="crossplane-system"}from provider-opentofu (scrape annotation added in eks-fleet#), which only reaches AMP once the above is in place.Follow-on (richer signal)
Once the hub is observability-wired, extend
addons/observability/kube-state-metrics/values.yamlcustomResourceState with the eks-fleetClusterXR (fleet.nanohype.dev) + provider-opentofuWorkspace(opentofu.m.upbound.io) conditions, so the dashboard gains true per-cluster vend inventory/readiness + the Synced=False apply-error chase — not just provider reconcile RED. Validate the KSM config on the live hub (a malformed customResourceState entry breaks allkube_customresource_*). This is the difference between "provider reconcile-error rate as a vend-success proxy" and a real per-vend SLO.