feat(quality): persist the gate's grade signal as a cross-engagement trend#33
Open
stxkxs wants to merge 1 commit into
Open
feat(quality): persist the gate's grade signal as a cross-engagement trend#33stxkxs wants to merge 1 commit into
stxkxs wants to merge 1 commit into
Conversation
…trend
The merge gate grades every PR across the 9 QUALITY_RUBRIC dimensions and the
external-reviewer re-grades cold, but those grades only ever drove a single
ship/block decision and were then discarded. The factory could not answer the
one question that tells it whether it is improving: are my grades trending up
or down across engagements? This wires the existing signal into an append-only
record and surfaces the trend.
─────────────────────────── What changed ───────────────────────────
src/quality.ts (new) — the quality log.
- QualityRun record: timestamp, workflow, gate profile, final decision,
revision attempts, the aggregate internal grades, and (for calibrated
code runs) the external-reviewer grades + drift.
- appendQualityRun / loadQualityRuns over an append-only JSONL at
~/.fab/quality.jsonl — next to state.json, so the signal spans every repo
the factory ships rather than one working tree. Overridable with
FAB_QUALITY_FILE, mirroring FAB_STATE_FILE.
- gradeToGpa maps letters (with +/-) to a 0–4.3 scale; N/A is excluded.
- formatQualityTrend renders a per-dimension table (overall vs recent-window
GPA with a direction arrow — declining dimensions show in red) plus a
footer with approval rate, calibration coverage, and drift rate.
src/gate.ts — extracted aggregateGrades(verdicts), the internal-grade
aggregation the external calibration already did inline (advisory verdicts
skipped, later verdict wins on collision). Now shared by the gate and the
quality record so they cannot diverge.
src/workflows.ts — capture at the gate, not a new pass.
- runExternalCalibration now returns the internal + external grades and the
drift alongside its blocking result, instead of returning null and
throwing the grades away on the aligned path.
- runMergeGate records exactly one QualityRun on every terminal path
(approve, drift-block reject, gate reject, exhausted revisions) via a
best-effort recordQuality helper — a metrics write is wrapped so it can
never break the gate.
src/bin/fab.ts — `fab perf` now prints the quality trend below the agent
performance table. One command, the whole picture.
src/index.ts — exports the quality surface.
─────────────────────────── Scope ───────────────────────────
This is the loop-closing wire-up, deliberately not a new repo or datastore.
A frozen golden-brief benchmark and a re-grading harness become worthwhile
only once the JSONL has real run density — they are explicitly deferred.
collectSessionMetrics (the per-role token table behind `fab perf`) remains
unwired; it is managed-agents-API-specific and orthogonal to the grade trend.
Tracked separately.
Tests: quality.test.ts (roundtrip, gradeToGpa, trend formatting/footer) and
aggregateGrades cases in gate.test.ts. Full suite, typecheck, lint, and
prettier all green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The merge gate grades every PR across the 9
QUALITY_RUBRICdimensions, and theexternal-reviewerre-grades cold for calibration. Until now those grades drove a single ship/block decision and were then discarded —runExternalCalibrationreturnednullon the aligned path, throwing the grades away. The factory could not answer the question that tells it whether it is getting better: are my grades trending up or down across engagements?This closes the loop by persisting the signal the gate already computes, and surfacing the trend in
fab perf. It is a wire-up, not a new repo or datastore.What
src/quality.ts(new) —QualityRunrecords (timestamp, workflow, profile, decision, attempts, internal grades, external grades + drift for calibrated runs) over an append-only JSONL at~/.fab/quality.jsonl. Lives next tostate.jsonso the trend is cross-engagement, not per-working-tree.FAB_QUALITY_FILEoverride mirrorsFAB_STATE_FILE. IncludesgradeToGpa(letters → 0–4.3) andformatQualityTrend(per-dimension overall-vs-recent table with a direction arrow; declining dimensions in red; footer with approval / calibration / drift rates).src/gate.ts— extractedaggregateGrades(verdicts)(the aggregation the calibration did inline) so the gate and the quality record share one source of truth.src/workflows.ts—runExternalCalibrationnow returns the grades + drift alongside its blocking result instead of discarding them;runMergeGaterecords exactly oneQualityRunper terminal path (approve / drift-block / reject / exhausted) via a best-effort helper that can never break the gate.src/bin/fab.ts—fab perfprints the quality trend under the agent table.Scope / deferred
collectSessionMetrics(the per-role token table) stays unwired; it's managed-agents-API-specific and orthogonal. Tracked in fab perf: collectSessionMetrics is never called — per-role table stays empty #32.Verification
npm run build,npm run typecheck,npm test(all suites),npm run lint,npm run format:check— all green. New tests:quality.test.ts+aggregateGradescases ingate.test.ts.