Skip to content

Export factory-run metrics for an SRE dashboard (RoleMetrics → /metrics or OTLP) #34

Description

@stxkxs

Context

fab tracks rich per-role factory-run signals but they never leave a local file. RoleMetrics (src/perf.ts:10-19) carries sessions, selfEvalPass/selfEvalFail, advisorCalls, revisions, totalInputTokens/totalOutputTokens, lastActive; collectSessionMetrics (src/perf.ts:50) populates it from session usage; formatPerfReport (src/perf.ts:90) renders an ASCII table; savePerf (src/perf.ts:44) writes .fab-perf.json in the cwd. There is no /metrics endpoint, no OTLP exporter, no scrape surface — so factory-run health is invisible to the observability stack (the dashboards audit graded fab F on this basis: signals trapped in a local file).

Proposed

Add a metrics exporter over RoleMetrics so factory runs are observable, then an SRE dashboard. fab already ships k8s Job manifests (deploy/) and runs in-cluster, so a Prometheus /metrics endpoint scraped by the Grafana Agent (→ AMP) is the natural fit; an OTLP push is the alternative for ephemeral CLI runs.

Metric mapping (labels: role, phase, group):

  • fab_agent_sessions_totalsessions
  • fab_self_eval_total{result="pass|fail"}selfEvalPass/selfEvalFail — the factory's quality "error rate" (self-eval fail ratio is the SLI)
  • fab_advisor_calls_totaladvisorCalls (Opus escalation rate)
  • fab_revisions_totalrevisions (merge-gate revision-loop pressure)
  • fab_tokens_total{direction="input|output"}totalInputTokens/totalOutputTokens (spend)
  • a run-duration histogram (fab_role_session_duration_seconds) — not currently captured; add it in collectSessionMetrics.

Then a fab GrafanaDashboard CR in eks-gitops following the established pattern (self-contained PromQL, SLO row on self-eval pass rate, RED by phase): agent-run rate, self-eval pass-rate SLO + burn, token spend by role/phase, revision + advisor rates, run duration p50/p95/p99. Pair with a Grafana-managed alert group (self-eval pass-rate burn, merge-gate-stall).

Scope note

Deferred from the SRE-dashboards push (decision: fab/cloudgov tracked as issues). This is the metrics-surface prerequisite; the dashboard depends on it. If fab runs only as short-lived Jobs, prefer OTLP push (or a pushgateway) over scrape.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions