Skip to content

Extend KSM CR-state with eks-fleet Cluster XR + provider-opentofu Workspace (per-cluster vend inventory) #56

Description

@stxkxs

Follow-on from #50 (hub now observability-wired) and the CR-state deepening in #55.

What

The fleet-vend dashboard currently observes the provider's aggregate reconcile
RED (controller_runtime_*{namespace="crossplane-system"}) — a vend-success proxy.
The richer signal is per-cluster vend inventory + readiness, which needs the
kube-state-metrics customResourceState config (addons/observability/kube-state-metrics/values.yaml)
extended with the fleet CRDs:

  • Cluster XR (fleet.nanohype.dev / the eks-fleet XRD) — phase + conditions, so
    the dashboard shows how many clusters exist and which are Ready vs failing.
  • provider-opentofu Workspace (opentofu.m.upbound.io/v1beta1) — the Synced
    condition is the apply-error chase (a failing tofu plan/apply surfaces here).

This is the difference between "provider reconcile-error rate as a proxy" and a real
per-vend SLO with a Synced=False apply-error drill-down.

Why it was deferred from #55

These are hub-only CRDs (the vend pipeline runs only on the hub), so they belong
in a values-hub.yaml customResourceState delta, not the base — and the base/delta
list-merge for customResourceState needs validating on a live hub. KSM parses the
whole config as one unit: a malformed entry breaks all kube_customresource_*.

Cardinality caveat (load-bearing)

Do not project Workspace.status.conditions[].reason / .message as raw metric
labels — they're unbounded (per-apply tofu error text) and will blow up series
cardinality. Project only condition_type + condition_status (bounded), same as
the existing condition blocks. The error text belongs in the dashboard's drill-down
(a logs/Workspace panel), not in a metric label.

Acceptance

  • values-hub.yaml customResourceState delta adds Cluster XR + Workspace blocks
    with bounded labels only.
  • RBAC grants KSM list/watch on fleet.nanohype.dev clusters + opentofu.m.upbound.io workspaces.
  • Validated on a live hub: kube_customresource_* series still present for all
    existing CRDs (no blast-radius regression), new per-cluster series appear.
  • fleet-vend dashboard gains a per-cluster inventory/readiness row.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions