Follow-on from #50 (hub now observability-wired) and the CR-state deepening in #55.
What
The fleet-vend dashboard currently observes the provider's aggregate reconcile
RED (controller_runtime_*{namespace="crossplane-system"}) — a vend-success proxy.
The richer signal is per-cluster vend inventory + readiness, which needs the
kube-state-metrics customResourceState config (addons/observability/kube-state-metrics/values.yaml)
extended with the fleet CRDs:
Cluster XR (fleet.nanohype.dev / the eks-fleet XRD) — phase + conditions, so
the dashboard shows how many clusters exist and which are Ready vs failing.
- provider-opentofu
Workspace (opentofu.m.upbound.io/v1beta1) — the Synced
condition is the apply-error chase (a failing tofu plan/apply surfaces here).
This is the difference between "provider reconcile-error rate as a proxy" and a real
per-vend SLO with a Synced=False apply-error drill-down.
Why it was deferred from #55
These are hub-only CRDs (the vend pipeline runs only on the hub), so they belong
in a values-hub.yaml customResourceState delta, not the base — and the base/delta
list-merge for customResourceState needs validating on a live hub. KSM parses the
whole config as one unit: a malformed entry breaks all kube_customresource_*.
Cardinality caveat (load-bearing)
Do not project Workspace.status.conditions[].reason / .message as raw metric
labels — they're unbounded (per-apply tofu error text) and will blow up series
cardinality. Project only condition_type + condition_status (bounded), same as
the existing condition blocks. The error text belongs in the dashboard's drill-down
(a logs/Workspace panel), not in a metric label.
Acceptance
values-hub.yaml customResourceState delta adds Cluster XR + Workspace blocks
with bounded labels only.
- RBAC grants KSM list/watch on
fleet.nanohype.dev clusters + opentofu.m.upbound.io workspaces.
- Validated on a live hub:
kube_customresource_* series still present for all
existing CRDs (no blast-radius regression), new per-cluster series appear.
- fleet-vend dashboard gains a per-cluster inventory/readiness row.
Follow-on from #50 (hub now observability-wired) and the CR-state deepening in #55.
What
The fleet-vend dashboard currently observes the provider's aggregate reconcile
RED (
controller_runtime_*{namespace="crossplane-system"}) — a vend-success proxy.The richer signal is per-cluster vend inventory + readiness, which needs the
kube-state-metrics
customResourceStateconfig (addons/observability/kube-state-metrics/values.yaml)extended with the fleet CRDs:
ClusterXR (fleet.nanohype.dev/ the eks-fleet XRD) — phase + conditions, sothe dashboard shows how many clusters exist and which are Ready vs failing.
Workspace(opentofu.m.upbound.io/v1beta1) — the Syncedcondition is the apply-error chase (a failing tofu plan/apply surfaces here).
This is the difference between "provider reconcile-error rate as a proxy" and a real
per-vend SLO with a Synced=False apply-error drill-down.
Why it was deferred from #55
These are hub-only CRDs (the vend pipeline runs only on the hub), so they belong
in a
values-hub.yamlcustomResourceState delta, not the base — and the base/deltalist-merge for customResourceState needs validating on a live hub. KSM parses the
whole config as one unit: a malformed entry breaks all
kube_customresource_*.Cardinality caveat (load-bearing)
Do not project
Workspace.status.conditions[].reason/.messageas raw metriclabels — they're unbounded (per-apply tofu error text) and will blow up series
cardinality. Project only
condition_type+condition_status(bounded), same asthe existing condition blocks. The error text belongs in the dashboard's drill-down
(a logs/Workspace panel), not in a metric label.
Acceptance
values-hub.yamlcustomResourceState delta adds Cluster XR + Workspace blockswith bounded labels only.
fleet.nanohype.devclusters +opentofu.m.upbound.ioworkspaces.kube_customresource_*series still present for allexisting CRDs (no blast-radius regression), new per-cluster series appear.