fix(dashboards): de-hollow agent-* panels via KSM + deepen CR-state coverage to the full operator surface#55
Merged
Merged
Conversation
…ent-platform#47)
The agent-* persona dashboards queried agents_* metrics the operator never
registers, so ~a third of their panels rendered no-data. Most of that data IS
available as CR status — this rewrites those panels to the kube_customresource_*
metrics kube-state-metrics emits, and extends the customResourceState config for
the few fields it didn't yet project.
KSM customResourceState (addons/observability/kube-state-metrics/values.yaml):
- New status_field gauges: BudgetPolicy.status.currentSpendUsd, .killSwitchFiredAt,
BudgetPolicy.spec.monthlyUsd (the threshold), Tenant.status.aggregateSpendUsd.
Structurally identical to the proven entries (string gauges, nilIsZero).
- Add resource-level labelsFromPath name/namespace to every resource block, so
kube_customresource_* carries the `name` label the dashboards group by — this
also hardens the operator's existing PrometheusRule alerts, which already
depend on {{ $labels.name }} but had no config emitting it.
Dashboard rewrites (agent-{founder,ops,finance}):
- agents_platform_status_phase{Ready} -> kube_customresource_status_phase Platform Ready
- agents_eval_run_score -> EvalSuite lastScore (already emitted)
- agents_agent_runtime_replicas -> AgentFleet readyAgents (already emitted)
- agents_spend_report_current_usd -> BudgetPolicy currentSpendUsd
- agents_budget_policy_threshold_usd -> BudgetPolicy monthlyUsd (by name)
Panel titles updated to match the real semantics (e.g. "Spend month-to-date",
"Latest EvalSuite scores", "Ready agents").
Deferred (genuinely runtime/data-plane, not CR-projectable; tracked in #47):
agentgateway_* (agent-agentgateway/agent-ops, incl. the invocation_total vs
invocations_total name split) and agents_agent_invocations_total (agent-founder).
Quality-checked (Systems A-/Code A/Consistency A-): CRD fields verified to exist,
KSM string-gauge parsing confirmed, blast-radius config kept structurally identical.
yamllint + kustomize build green.
CI Results
All validations passed. |
…ted ones The agentgateway panels referenced agentgateway_invocation_total / agentgateway_invocations_total / agentgateway_invocation_duration_seconds — none of which agentgateway emits (verified against agentgateway.dev/docs). The real metrics are agentgateway_llm_requests_total and agentgateway_llm_request_duration_seconds (port 15020). Fixes the names (and eliminates the singular/plural split between agent-agentgateway and agent-ops). The per-label drill-downs (platform / model_id / status / route filters) still assume a label model that doesn't match agentgateway's OTel gen_ai_* conventions — those, plus the scrape annotation (port 15020), are tuned at first scrape against a live gateway (recipe in eks-agent-platform#47). Names being correct now reduces that work to a label pass. JSON valid; yamllint clean.
CI Results
All validations passed. |
… operator surface
The customResourceState config projected 5 of the 9 operator CRDs, and
conditions on only one (Tenant). Every agent-* dashboard de-hollowed in this
branch reads kube_customresource_* — so the metrics had to actually exist for
those panels to render. This closes the gap to the full CRD surface.
─────────────────────── Conditions sweep ───────────────────────
Added the conditions block (condition_type + condition_status labels, value =
status) to every CRD that carries status.conditions: Platform, BudgetPolicy,
AgentFleet, EvalSuite, plus the four newly-added CRDs. phase=Ready can mask a
degraded reconcile; conditions are the controller's real health truth, so
"<Kind> not Ready" alerts now have a series to fire on for every resource.
─────────────────────── Four dark CRDs ─────────────────────────
ModelGateway, AgentSandbox, SandboxPool, BatchJob were entirely unobserved.
Each now projects phase (StateSet), conditions, and its load-bearing gauges:
- ModelGateway observedGeneration
- AgentSandbox podPhase (StateSet), completedAt
- SandboxPool readyWorkers (nilIsZero — under-provision alerts fire on an
unpopulated status instead of silently vanishing)
- BatchJob failedCount, succeededCount, recordCount
─────────────────────── Gauge deepening ────────────────────────
- Tenant platformCount, readyPlatformCount, suspendedPlatformCount,
lastReconciled (fleet-size denominator + reconcile staleness)
- BudgetPolicy percentOfBudget, conditions (cap-unenforced-if-stale)
- AgentFleet observedGeneration (unapplied spec change)
- EvalSuite passThreshold, lastRunAt
- Platform observedGeneration
RBAC: granted KSM list/watch on agentfleets, agentsandboxes, sandboxpools,
modelgateways, batchjobs (agents.nanohype.dev) — required or the new resource
blocks would silently emit nothing.
Every projected path verified against the operator CRD schemas; phase fields
are free strings so the StateSet enum lists are best-effort. KSM parses the
whole customResourceState as one unit, so the config was validated for
structural correctness (one malformed block breaks all kube_customresource_*).
CI Results
All validations passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two related changes, both about making the agent-* persona dashboards read real series instead of inventing metrics nothing emits.
1. De-hollow the agent-* persona panels (the original scope)
The
agent-founder / agent-finance / agent-ops / agent-agentgatewaydashboards had panels querying metrics that no exporter produces. Rewrote them to readkube_customresource_*(the operator's CRD status, projected by kube-state-metrics) and corrected the agentgateway metric names to the documentedagentgateway_llm_*family.2. Deepen the kube-state-metrics CR-state config to the full CRD surface
The de-hollowed panels read
kube_customresource_*, so those series have to exist. ThecustomResourceStateconfig projected only 5 of the 9 operator CRDs, with conditions on just one. This closes the gap:condition_type+condition_statuson every CRD that carriesstatus.conditions(Platform, BudgetPolicy, AgentFleet, EvalSuite + the four new ones).phase=Readycan mask a degraded reconcile; conditions are the real health truth.phase(StateSet), conditions, and its load-bearing gauges (readyWorkers,failedCount/succeededCount/recordCount,completedAt,podPhase,observedGeneration).lastReconciled; BudgetPolicypercentOfBudget; AgentFleet/Platform/ModelGatewayobservedGeneration; EvalSuitepassThreshold/lastRunAt.agents.nanohype.devresources (else the new blocks emit nothing).Verification
customResourceStateas one unit (one malformed block breaks allkube_customresource_*), so the config was structurally validated end-to-end + yamllint clean.readyWorkers/count gauges usenilIsZeroso under-provision alerts fire on an unpopulated status rather than silently vanishing.Delivery note
Dashboards ship as
GrafanaDashboardCRs (grafana-operator, both prod and kx). The KSMcustomResourceStatehere is the single source — the duplicate copy in the operator chart was removed in eks-agent-platform#48.