Skip to content

Wire the eks-fleet hub into the observability fabric (blocks the fleet-vend dashboard) #50

Description

@stxkxs

Problem

The eks-fleet hub (the cluster-vending management cluster) is not wired into the observability fabric, so nothing on the hub reaches Amazon Managed Prometheus / Grafana. This blocks the fleet-vend dashboard + alerts (authored, see the draft PR) from ever showing data — and means the hub's Crossplane/provider metrics are dark.

Evidence

  • The hub registers its ArgoCD cluster Secret with environment = "hub" (landing-zone/live/aws/fleet/us-west-2/hub/env.hcl:2).
  • The observability + dashboards ApplicationSets resolve per-env files under goTemplateOptions: ["missingkey=error"]:
    • applicationsets/dashboards.yaml:34path: overlays/{{ .metadata.labels.environment }}
    • applicationsets/addons-observability.yamlvalues-{{ environment }}.yaml
  • But eks-gitops ships only dev / staging / production:
    • dashboards/overlays/ → dev, production, staging (no hub)
    • addons/observability/grafana-agent/values-{dev,staging,production}.yaml (no values-hub.yaml); same for grafana-operator/loki/tempo/kube-state-metrics
    • environments/ → dev, staging, production (no hub)
  • The fleet landing-zone tree (live/aws/fleet/.../hub/) provisions network/cluster/cluster-bootstrap/fleet-hub only — no AMP workspace, no AMG, no grafana-agent IRSA for the fleet account.

Net: with environment=hub and no matching env, the observability Applications error on the missing path → no grafana-agent (nothing scrapes), no grafana-operator (no GrafanaDashboard/GrafanaAlertRuleGroup reconciler), no AMP/AMG to write to or read from. This is the same class of gap flagged in applicationsets/clusters-appset.yaml:38-40.

Fix (spans eks-gitops + landing-zone)

eks-gitops:

  • Add a hub environment: environments/hub/cluster-config.yaml, dashboards/overlays/hub/, and values-hub.yaml for the observability addons (grafana-agent, grafana-operator, loki, tempo, kube-state-metrics) — minimally enough to run grafana-agent (→ the fleet AMP) + grafana-operator (→ the fleet AMG) on the hub. Decide which non-observability addons the hub should/shouldn't get (the hub runs Crossplane + ArgoCD + portal, not the full workload catalog).

landing-zone (fleet tree):

  • Provision an AMP workspace + AMG workspace in the fleet account, and the grafana-agent IRSA role (<env>-eks-grafana-agent-amp equivalent) + the AMG service role, wired into values-hub.yaml (AMP_REMOTE_WRITE_URL, the AMG endpoint patched into dashboards/base/grafana.yaml's overlay).

Then

Un-draft the fleet-vend dashboard + alerts — they query controller_runtime_*{namespace="crossplane-system"} from provider-opentofu (scrape annotation added in eks-fleet#), which only reaches AMP once the above is in place.

Follow-on (richer signal)

Once the hub is observability-wired, extend addons/observability/kube-state-metrics/values.yaml customResourceState with the eks-fleet Cluster XR (fleet.nanohype.dev) + provider-opentofu Workspace (opentofu.m.upbound.io) conditions, so the dashboard gains true per-cluster vend inventory/readiness + the Synced=False apply-error chase — not just provider reconcile RED. Validate the KSM config on the live hub (a malformed customResourceState entry breaks all kube_customresource_*). This is the difference between "provider reconcile-error rate as a vend-success proxy" and a real per-vend SLO.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions