fix(dashboards): de-hollow agent-* panels via KSM + deepen CR-state coverage to the full operator surface by stxkxs · Pull Request #55 · nanohype/eks-gitops

stxkxs · 2026-06-23T21:48:59Z

What

Two related changes, both about making the agent-* persona dashboards read real series instead of inventing metrics nothing emits.

1. De-hollow the agent-* persona panels (the original scope)

The agent-founder / agent-finance / agent-ops / agent-agentgateway dashboards had panels querying metrics that no exporter produces. Rewrote them to read kube_customresource_* (the operator's CRD status, projected by kube-state-metrics) and corrected the agentgateway metric names to the documented agentgateway_llm_* family.

2. Deepen the kube-state-metrics CR-state config to the full CRD surface

The de-hollowed panels read kube_customresource_*, so those series have to exist. The customResourceState config projected only 5 of the 9 operator CRDs, with conditions on just one. This closes the gap:

Conditions sweep — condition_type + condition_status on every CRD that carries status.conditions (Platform, BudgetPolicy, AgentFleet, EvalSuite + the four new ones). phase=Ready can mask a degraded reconcile; conditions are the real health truth.
Four dark CRDs now observed — ModelGateway, AgentSandbox, SandboxPool, BatchJob — each projecting phase (StateSet), conditions, and its load-bearing gauges (readyWorkers, failedCount/succeededCount/recordCount, completedAt, podPhase, observedGeneration).
Gauge deepening — Tenant fleet-size counts + lastReconciled; BudgetPolicy percentOfBudget; AgentFleet/Platform/ModelGateway observedGeneration; EvalSuite passThreshold/lastRunAt.
RBAC — KSM granted list/watch on the five agents.nanohype.dev resources (else the new blocks emit nothing).

Verification

Every projected status/spec path verified against the operator CRD schemas — no hollow metrics.
KSM parses the whole customResourceState as one unit (one malformed block breaks all kube_customresource_*), so the config was structurally validated end-to-end + yamllint clean.
readyWorkers/count gauges use nilIsZero so under-provision alerts fire on an unpopulated status rather than silently vanishing.

Delivery note

Dashboards ship as GrafanaDashboard CRs (grafana-operator, both prod and kx). The KSM customResourceState here is the single source — the duplicate copy in the operator chart was removed in eks-agent-platform#48.

…ent-platform#47) The agent-* persona dashboards queried agents_* metrics the operator never registers, so ~a third of their panels rendered no-data. Most of that data IS available as CR status — this rewrites those panels to the kube_customresource_* metrics kube-state-metrics emits, and extends the customResourceState config for the few fields it didn't yet project. KSM customResourceState (addons/observability/kube-state-metrics/values.yaml): - New status_field gauges: BudgetPolicy.status.currentSpendUsd, .killSwitchFiredAt, BudgetPolicy.spec.monthlyUsd (the threshold), Tenant.status.aggregateSpendUsd. Structurally identical to the proven entries (string gauges, nilIsZero). - Add resource-level labelsFromPath name/namespace to every resource block, so kube_customresource_* carries the `name` label the dashboards group by — this also hardens the operator's existing PrometheusRule alerts, which already depend on {{ $labels.name }} but had no config emitting it. Dashboard rewrites (agent-{founder,ops,finance}): - agents_platform_status_phase{Ready} -> kube_customresource_status_phase Platform Ready - agents_eval_run_score -> EvalSuite lastScore (already emitted) - agents_agent_runtime_replicas -> AgentFleet readyAgents (already emitted) - agents_spend_report_current_usd -> BudgetPolicy currentSpendUsd - agents_budget_policy_threshold_usd -> BudgetPolicy monthlyUsd (by name) Panel titles updated to match the real semantics (e.g. "Spend month-to-date", "Latest EvalSuite scores", "Ready agents"). Deferred (genuinely runtime/data-plane, not CR-projectable; tracked in #47): agentgateway_* (agent-agentgateway/agent-ops, incl. the invocation_total vs invocations_total name split) and agents_agent_invocations_total (agent-founder). Quality-checked (Systems A-/Code A/Consistency A-): CRD fields verified to exist, KSM string-gauge parsing confirmed, blast-radius config kept structurally identical. yamllint + kustomize build green.

github-actions · 2026-06-23T21:49:29Z

CI Results

Check	Status
YAML Lint	✅

Environment	Kustomize Build
dev	✅
staging	✅
production	✅

All validations passed.

…ted ones The agentgateway panels referenced agentgateway_invocation_total / agentgateway_invocations_total / agentgateway_invocation_duration_seconds — none of which agentgateway emits (verified against agentgateway.dev/docs). The real metrics are agentgateway_llm_requests_total and agentgateway_llm_request_duration_seconds (port 15020). Fixes the names (and eliminates the singular/plural split between agent-agentgateway and agent-ops). The per-label drill-downs (platform / model_id / status / route filters) still assume a label model that doesn't match agentgateway's OTel gen_ai_* conventions — those, plus the scrape annotation (port 15020), are tuned at first scrape against a live gateway (recipe in eks-agent-platform#47). Names being correct now reduces that work to a label pass. JSON valid; yamllint clean.

github-actions · 2026-06-24T00:34:50Z

CI Results

Check	Status
YAML Lint	✅

Environment	Kustomize Build
dev	✅
staging	✅
production	✅

All validations passed.

… operator surface The customResourceState config projected 5 of the 9 operator CRDs, and conditions on only one (Tenant). Every agent-* dashboard de-hollowed in this branch reads kube_customresource_* — so the metrics had to actually exist for those panels to render. This closes the gap to the full CRD surface. ─────────────────────── Conditions sweep ─────────────────────── Added the conditions block (condition_type + condition_status labels, value = status) to every CRD that carries status.conditions: Platform, BudgetPolicy, AgentFleet, EvalSuite, plus the four newly-added CRDs. phase=Ready can mask a degraded reconcile; conditions are the controller's real health truth, so "<Kind> not Ready" alerts now have a series to fire on for every resource. ─────────────────────── Four dark CRDs ───────────────────────── ModelGateway, AgentSandbox, SandboxPool, BatchJob were entirely unobserved. Each now projects phase (StateSet), conditions, and its load-bearing gauges: - ModelGateway observedGeneration - AgentSandbox podPhase (StateSet), completedAt - SandboxPool readyWorkers (nilIsZero — under-provision alerts fire on an unpopulated status instead of silently vanishing) - BatchJob failedCount, succeededCount, recordCount ─────────────────────── Gauge deepening ──────────────────────── - Tenant platformCount, readyPlatformCount, suspendedPlatformCount, lastReconciled (fleet-size denominator + reconcile staleness) - BudgetPolicy percentOfBudget, conditions (cap-unenforced-if-stale) - AgentFleet observedGeneration (unapplied spec change) - EvalSuite passThreshold, lastRunAt - Platform observedGeneration RBAC: granted KSM list/watch on agentfleets, agentsandboxes, sandboxpools, modelgateways, batchjobs (agents.nanohype.dev) — required or the new resource blocks would silently emit nothing. Every projected path verified against the operator CRD schemas; phase fields are free strings so the StateSet enum lists are best-effort. KSM parses the whole customResourceState as one unit, so the config was validated for structural correctness (one malformed block breaks all kube_customresource_*).

github-actions · 2026-06-24T00:44:50Z

CI Results

Check	Status
YAML Lint	✅

Environment	Kustomize Build
dev	✅
staging	✅
production	✅

All validations passed.

stxkxs mentioned this pull request Jun 23, 2026

refactor(charts): remove the dead customResourceState duplicate (single source = eks-gitops) nanohype/eks-agent-platform#48

Merged

stxkxs changed the title ~~fix(dashboards): de-hollow the agent-* persona panels via KSM~~ fix(dashboards): de-hollow agent-* panels via KSM + deepen CR-state coverage to the full operator surface Jun 24, 2026

stxkxs merged commit eeaf7dd into main Jun 24, 2026
7 checks passed

stxkxs deleted the dehollow-agent-dashboards branch June 24, 2026 00:47

This was referenced Jun 24, 2026

Extend KSM CR-state with eks-fleet Cluster XR + provider-opentofu Workspace (per-cluster vend inventory) #56

Open

Enable operator SLO alerting in production (provision the PagerDuty + Slack Secrets) #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dashboards): de-hollow agent-* panels via KSM + deepen CR-state coverage to the full operator surface#55

fix(dashboards): de-hollow agent-* panels via KSM + deepen CR-state coverage to the full operator surface#55
stxkxs merged 3 commits into
mainfrom
dehollow-agent-dashboards

stxkxs commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stxkxs commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

1. De-hollow the agent-* persona panels (the original scope)

2. Deepen the kube-state-metrics CR-state config to the full CRD surface

Verification

Delivery note

Uh oh!

github-actions Bot commented Jun 23, 2026

CI Results

Uh oh!

github-actions Bot commented Jun 24, 2026

CI Results

Uh oh!

github-actions Bot commented Jun 24, 2026

CI Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stxkxs commented Jun 23, 2026 •

edited

Loading