Context
fab tracks rich per-role factory-run signals but they never leave a local file. RoleMetrics (src/perf.ts:10-19) carries sessions, selfEvalPass/selfEvalFail, advisorCalls, revisions, totalInputTokens/totalOutputTokens, lastActive; collectSessionMetrics (src/perf.ts:50) populates it from session usage; formatPerfReport (src/perf.ts:90) renders an ASCII table; savePerf (src/perf.ts:44) writes .fab-perf.json in the cwd. There is no /metrics endpoint, no OTLP exporter, no scrape surface — so factory-run health is invisible to the observability stack (the dashboards audit graded fab F on this basis: signals trapped in a local file).
Proposed
Add a metrics exporter over RoleMetrics so factory runs are observable, then an SRE dashboard. fab already ships k8s Job manifests (deploy/) and runs in-cluster, so a Prometheus /metrics endpoint scraped by the Grafana Agent (→ AMP) is the natural fit; an OTLP push is the alternative for ephemeral CLI runs.
Metric mapping (labels: role, phase, group):
fab_agent_sessions_total ← sessions
fab_self_eval_total{result="pass|fail"} ← selfEvalPass/selfEvalFail — the factory's quality "error rate" (self-eval fail ratio is the SLI)
fab_advisor_calls_total ← advisorCalls (Opus escalation rate)
fab_revisions_total ← revisions (merge-gate revision-loop pressure)
fab_tokens_total{direction="input|output"} ← totalInputTokens/totalOutputTokens (spend)
- a run-duration histogram (
fab_role_session_duration_seconds) — not currently captured; add it in collectSessionMetrics.
Then a fab GrafanaDashboard CR in eks-gitops following the established pattern (self-contained PromQL, SLO row on self-eval pass rate, RED by phase): agent-run rate, self-eval pass-rate SLO + burn, token spend by role/phase, revision + advisor rates, run duration p50/p95/p99. Pair with a Grafana-managed alert group (self-eval pass-rate burn, merge-gate-stall).
Scope note
Deferred from the SRE-dashboards push (decision: fab/cloudgov tracked as issues). This is the metrics-surface prerequisite; the dashboard depends on it. If fab runs only as short-lived Jobs, prefer OTLP push (or a pushgateway) over scrape.
Context
fab tracks rich per-role factory-run signals but they never leave a local file.
RoleMetrics(src/perf.ts:10-19) carriessessions,selfEvalPass/selfEvalFail,advisorCalls,revisions,totalInputTokens/totalOutputTokens,lastActive;collectSessionMetrics(src/perf.ts:50) populates it from session usage;formatPerfReport(src/perf.ts:90) renders an ASCII table;savePerf(src/perf.ts:44) writes.fab-perf.jsonin the cwd. There is no/metricsendpoint, no OTLP exporter, no scrape surface — so factory-run health is invisible to the observability stack (the dashboards audit graded fab F on this basis: signals trapped in a local file).Proposed
Add a metrics exporter over
RoleMetricsso factory runs are observable, then an SRE dashboard. fab already ships k8s Job manifests (deploy/) and runs in-cluster, so a Prometheus/metricsendpoint scraped by the Grafana Agent (→ AMP) is the natural fit; an OTLP push is the alternative for ephemeral CLI runs.Metric mapping (labels:
role,phase,group):fab_agent_sessions_total←sessionsfab_self_eval_total{result="pass|fail"}←selfEvalPass/selfEvalFail— the factory's quality "error rate" (self-eval fail ratio is the SLI)fab_advisor_calls_total←advisorCalls(Opus escalation rate)fab_revisions_total←revisions(merge-gate revision-loop pressure)fab_tokens_total{direction="input|output"}←totalInputTokens/totalOutputTokens(spend)fab_role_session_duration_seconds) — not currently captured; add it incollectSessionMetrics.Then a
fabGrafanaDashboard CR in eks-gitops following the established pattern (self-contained PromQL, SLO row on self-eval pass rate, RED by phase): agent-run rate, self-eval pass-rate SLO + burn, token spend by role/phase, revision + advisor rates, run duration p50/p95/p99. Pair with a Grafana-managed alert group (self-eval pass-rate burn, merge-gate-stall).Scope note
Deferred from the SRE-dashboards push (decision: fab/cloudgov tracked as issues). This is the metrics-surface prerequisite; the dashboard depends on it. If fab runs only as short-lived Jobs, prefer OTLP push (or a pushgateway) over scrape.