Skip to content

fix(dashboards): de-hollow agent-* panels via KSM + deepen CR-state coverage to the full operator surface#55

Merged
stxkxs merged 3 commits into
mainfrom
dehollow-agent-dashboards
Jun 24, 2026
Merged

fix(dashboards): de-hollow agent-* panels via KSM + deepen CR-state coverage to the full operator surface#55
stxkxs merged 3 commits into
mainfrom
dehollow-agent-dashboards

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 23, 2026

Copy link
Copy Markdown
Member

What

Two related changes, both about making the agent-* persona dashboards read real series instead of inventing metrics nothing emits.

1. De-hollow the agent-* persona panels (the original scope)

The agent-founder / agent-finance / agent-ops / agent-agentgateway dashboards had panels querying metrics that no exporter produces. Rewrote them to read kube_customresource_* (the operator's CRD status, projected by kube-state-metrics) and corrected the agentgateway metric names to the documented agentgateway_llm_* family.

2. Deepen the kube-state-metrics CR-state config to the full CRD surface

The de-hollowed panels read kube_customresource_*, so those series have to exist. The customResourceState config projected only 5 of the 9 operator CRDs, with conditions on just one. This closes the gap:

  • Conditions sweepcondition_type + condition_status on every CRD that carries status.conditions (Platform, BudgetPolicy, AgentFleet, EvalSuite + the four new ones). phase=Ready can mask a degraded reconcile; conditions are the real health truth.
  • Four dark CRDs now observed — ModelGateway, AgentSandbox, SandboxPool, BatchJob — each projecting phase (StateSet), conditions, and its load-bearing gauges (readyWorkers, failedCount/succeededCount/recordCount, completedAt, podPhase, observedGeneration).
  • Gauge deepening — Tenant fleet-size counts + lastReconciled; BudgetPolicy percentOfBudget; AgentFleet/Platform/ModelGateway observedGeneration; EvalSuite passThreshold/lastRunAt.
  • RBAC — KSM granted list/watch on the five agents.nanohype.dev resources (else the new blocks emit nothing).

Verification

  • Every projected status/spec path verified against the operator CRD schemas — no hollow metrics.
  • KSM parses the whole customResourceState as one unit (one malformed block breaks all kube_customresource_*), so the config was structurally validated end-to-end + yamllint clean.
  • readyWorkers/count gauges use nilIsZero so under-provision alerts fire on an unpopulated status rather than silently vanishing.

Delivery note

Dashboards ship as GrafanaDashboard CRs (grafana-operator, both prod and kx). The KSM customResourceState here is the single source — the duplicate copy in the operator chart was removed in eks-agent-platform#48.

…ent-platform#47)

The agent-* persona dashboards queried agents_* metrics the operator never
registers, so ~a third of their panels rendered no-data. Most of that data IS
available as CR status — this rewrites those panels to the kube_customresource_*
metrics kube-state-metrics emits, and extends the customResourceState config for
the few fields it didn't yet project.

KSM customResourceState (addons/observability/kube-state-metrics/values.yaml):
- New status_field gauges: BudgetPolicy.status.currentSpendUsd, .killSwitchFiredAt,
  BudgetPolicy.spec.monthlyUsd (the threshold), Tenant.status.aggregateSpendUsd.
  Structurally identical to the proven entries (string gauges, nilIsZero).
- Add resource-level labelsFromPath name/namespace to every resource block, so
  kube_customresource_* carries the `name` label the dashboards group by — this
  also hardens the operator's existing PrometheusRule alerts, which already
  depend on {{ $labels.name }} but had no config emitting it.

Dashboard rewrites (agent-{founder,ops,finance}):
- agents_platform_status_phase{Ready} -> kube_customresource_status_phase Platform Ready
- agents_eval_run_score -> EvalSuite lastScore (already emitted)
- agents_agent_runtime_replicas -> AgentFleet readyAgents (already emitted)
- agents_spend_report_current_usd -> BudgetPolicy currentSpendUsd
- agents_budget_policy_threshold_usd -> BudgetPolicy monthlyUsd (by name)
Panel titles updated to match the real semantics (e.g. "Spend month-to-date",
"Latest EvalSuite scores", "Ready agents").

Deferred (genuinely runtime/data-plane, not CR-projectable; tracked in #47):
agentgateway_* (agent-agentgateway/agent-ops, incl. the invocation_total vs
invocations_total name split) and agents_agent_invocations_total (agent-founder).

Quality-checked (Systems A-/Code A/Consistency A-): CRD fields verified to exist,
KSM string-gauge parsing confirmed, blast-radius config kept structurally identical.
yamllint + kustomize build green.
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production

All validations passed.

…ted ones

The agentgateway panels referenced agentgateway_invocation_total /
agentgateway_invocations_total / agentgateway_invocation_duration_seconds — none
of which agentgateway emits (verified against agentgateway.dev/docs). The real
metrics are agentgateway_llm_requests_total and agentgateway_llm_request_duration_seconds
(port 15020). Fixes the names (and eliminates the singular/plural split between
agent-agentgateway and agent-ops).

The per-label drill-downs (platform / model_id / status / route filters) still
assume a label model that doesn't match agentgateway's OTel gen_ai_* conventions —
those, plus the scrape annotation (port 15020), are tuned at first scrape against
a live gateway (recipe in eks-agent-platform#47). Names being correct now reduces
that work to a label pass. JSON valid; yamllint clean.
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production

All validations passed.

… operator surface

The customResourceState config projected 5 of the 9 operator CRDs, and
conditions on only one (Tenant). Every agent-* dashboard de-hollowed in this
branch reads kube_customresource_* — so the metrics had to actually exist for
those panels to render. This closes the gap to the full CRD surface.

─────────────────────── Conditions sweep ───────────────────────
Added the conditions block (condition_type + condition_status labels, value =
status) to every CRD that carries status.conditions: Platform, BudgetPolicy,
AgentFleet, EvalSuite, plus the four newly-added CRDs. phase=Ready can mask a
degraded reconcile; conditions are the controller's real health truth, so
"<Kind> not Ready" alerts now have a series to fire on for every resource.

─────────────────────── Four dark CRDs ─────────────────────────
ModelGateway, AgentSandbox, SandboxPool, BatchJob were entirely unobserved.
Each now projects phase (StateSet), conditions, and its load-bearing gauges:
  - ModelGateway   observedGeneration
  - AgentSandbox   podPhase (StateSet), completedAt
  - SandboxPool    readyWorkers (nilIsZero — under-provision alerts fire on an
                   unpopulated status instead of silently vanishing)
  - BatchJob       failedCount, succeededCount, recordCount

─────────────────────── Gauge deepening ────────────────────────
  - Tenant         platformCount, readyPlatformCount, suspendedPlatformCount,
                   lastReconciled (fleet-size denominator + reconcile staleness)
  - BudgetPolicy   percentOfBudget, conditions (cap-unenforced-if-stale)
  - AgentFleet     observedGeneration (unapplied spec change)
  - EvalSuite      passThreshold, lastRunAt
  - Platform       observedGeneration

RBAC: granted KSM list/watch on agentfleets, agentsandboxes, sandboxpools,
modelgateways, batchjobs (agents.nanohype.dev) — required or the new resource
blocks would silently emit nothing.

Every projected path verified against the operator CRD schemas; phase fields
are free strings so the StateSet enum lists are best-effort. KSM parses the
whole customResourceState as one unit, so the config was validated for
structural correctness (one malformed block breaks all kube_customresource_*).
@stxkxs stxkxs changed the title fix(dashboards): de-hollow the agent-* persona panels via KSM fix(dashboards): de-hollow agent-* panels via KSM + deepen CR-state coverage to the full operator surface Jun 24, 2026
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production

All validations passed.

@stxkxs stxkxs merged commit eeaf7dd into main Jun 24, 2026
7 checks passed
@stxkxs stxkxs deleted the dehollow-agent-dashboards branch June 24, 2026 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant