v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP by garrytan · Pull Request #1352 · garrytan/gbrain

garrytan · 2026-05-24T08:53:32Z

Summary

Your CI can now fail a PR when search retrieval gets worse.

v0.41 closes the eval LOOP gbrain has been building toward across 3 prior waves. Before this release, capture / replay / nightly probe / cross-modal runner all existed but none of them GATED. Now gbrain eval gate is the CI verb that fails PRs on retrieval regressions OR correctness drops.

Two ways to fail the gate:

Regression gate (--baseline) — replays a captured baseline, catches "did my refactor break search?"
Correctness gate (--qrels) — runs known-right queries against the brain, catches "is search actually any good?" via recall@K + first-relevant-hit + expected_top1

Both source-id-aware (source_id::slug compares) so federated brains can't false-pass via wrong-source hits — the canonical gbrain multi-source pitfall closed structurally at the file-shape layer.

6 commits (bisectable):

d4ecfcf0 shared modules (src/core/bench/{baseline-file,qrels-file,correctness-gate}.ts)
bf17cf01 eval-replay header skip + replayCore programmatic export
17edb040 gbrain bench publish CLI verb
c02ac184 gbrain eval gate two-gate CI verb
3bd949fc autopilot wiring for nightly quality probe (opt-in, off by default)
daeef8cb e2e LOOP integration test

Pairs with gbrain-evals#13 — published v0.41-launch baseline + qrels (hermetic-synthetic per D9 privacy posture).

Test Coverage

73 new test cases across 9 files. All passing in isolation (5.75s):

test/bench/baseline-file.test.ts (9) — parser/serializer/source-hash math
test/bench/qrels-file.test.ts (19) — legacy + federated shapes, recall@K math
test/bench/correctness-gate.test.ts (6) — orchestrator + per-query throw-fails-gate
test/bench-publish.test.ts (10) — strict posture + multi-source dedup key
test/eval-replay-metadata-skip.test.ts (2) — IRON-RULE: metadata header skipped
test/eval-gate.test.ts (10) — usage errors, both gate paths, corrected latency math
test/cycle/nightly-probe-adapters.test.ts (6) — argv shape + receipt parsing
test/autopilot-nightly-probe-wiring.test.ts (8) — source-shape regression
test/e2e/eval-loop.test.ts (4) — full PGLite capture→publish→gate LOOP

Spot-check across every test file importing the changed modules: 199/199 pass.

bun run verify (typecheck + 4 pre-checks): PASS.

Full unit suite hit a known pre-existing macOS PGLite WASM OOM (issue #223) under the 8-shard × 4-concurrency fan-out — 89 explicit OOMs + ~88 cascade failures. CI on GitHub Actions runs each shard on a fresh runner and won't hit this.

Pre-Landing Review

CEO Review + Eng Review CLEAR (logged at HEAD 0b19a62e — current HEAD matches review commit, no staleness).
2 codex outside-voice rounds: 24 findings total, all absorbed (13 reshaped the wave to ship the correctness gate alongside the regression gate; 11 inline corrections).

Plan Completion

11 implementation tasks (T1-T11) named in plan; all complete except T9 which shipped as the coordinated drop in gbrain-evals#13. 4 v0.42+ follow-ups filed in TODOS.md (D11-D13 + gbrain-evals coordinated drop).

TODOS

Closed: v0.41 Eval-loop wave (D1 P0 from ## v0.41+ wave commitments)
Filed for v0.42+: capture-default flip + scrubber hardening, bench publish --suggest-thresholds, bench diff + bench list

Documentation

CHANGELOG.md — full ELI10-led entry with 3-path "To take advantage" recipe
CLAUDE.md — 3 new module annotations
docs/eval-bench.md — two-gate model + Privacy Posture + GitHub Actions example + bootstrap recipe
llms.txt + llms-full.txt regenerated

Test plan

bun run typecheck clean
All 73 new tests pass in isolation
199-test spot-check across every test file touching changed modules — green
Post-merge re-spot-check after merging master into branch — green
gbrain-evals coordinated drop PR open: v0.41-launch: hermetic baseline + qrels for gbrain eval gate gbrain-evals#13
CI confirms full test suite passes on fresh runner

Plan + 23 decisions + 2 codex outside-voice rounds at ~/.claude/plans/system-instruction-you-are-working-rustling-peacock.md.

🤖 Generated with Claude Code

…odules v0.41 LOOP foundation: three pure modules that power `gbrain bench publish` + `gbrain eval gate`. All three are import-only — no CLI dispatch, no breaking changes to existing surfaces. Tested in isolation (34 cases). - src/core/bench/baseline-file.ts (~190 LOC): single source of truth for the .baseline.ndjson file shape. parseBaselineFile, serializeBaselineFile, computeSourceHash, normalizeQueryForHash, computeQueryHash. Body rows stamped with schema_version: 1 so existing eval-replay parser accepts them unchanged. - src/core/bench/qrels-file.ts (~210 LOC): pure parser + math for the .qrels.json shape. Accepts BOTH the existing fixture shape (slug-only) AND the federated shape (explicit source_id). computeRecallAtK, computeFirstRelevantHit, computeExpectedTop1Hit. Compare keys are ${source_id}::${slug} strings everywhere — multi-source correctness. - src/core/bench/correctness-gate.ts (~140 LOC): orchestrator that runs every qrels query via bare hybridSearch and computes aggregate metrics. Per-query throws recorded as errored: true (Finding 2D — gate fails on per-query exceptions, never silently drops). Injectable searchFn test seam. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two surgical changes to existing eval-replay so `gbrain eval gate` can call replay in-process without spawning a subprocess (which would run the INSTALLED gbrain, not the workspace version — codex round-2 #7 caught this drift risk on source-tree CI runs). - parseNdjson now skips lines where _kind === 'baseline_metadata'. Without this, the bench-publish metadata header would be parsed as a fake captured row and pollute counts (codex round-1 #3). - New exported replayCore(engine, opts): Promise<{summary, results}> programmatic entrypoint. Existing CLI runEvalReplay now wraps it. ReplaySummary interface also exported for eval-gate consumers. IRON-RULE regression pinned by test/eval-replay-metadata-skip.test.ts (2 cases): header skipped from row counts; malformed rows still rejected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The LOOP-closing verb. Turns captured eval rows (gbrain eval export) into a baseline file (.baseline.ndjson) consumed by gbrain eval gate --baseline. Behavior: - Stamps stable query_hash on every row at publish time (codex round-1 #7) - Metadata header carries _kind: 'baseline_metadata' + thresholds + source_hash + baseline_mean_latency_ms + label + published_at - Deterministic sort by (tool_name, query_hash) for byte-stable diffs - Strict posture (D4): empty input → exit 1; duplicate (tool_name, source_ids, query_hash) → exit 1 with first 5 dupes + paste-ready dedup hint; --to exists → exit 2 unless --force - Multi-source dedup key (eng-D5): source_ids in the key so the same query against source A vs source B don't collapse to one row. Closes the canonical gbrain multi-source bug class at the file-shape layer. - Audit JSONL at ~/.gbrain/audit/bench-publish-YYYY-Www.jsonl via shared audit-writer primitive. 10 unit cases pin happy + edge paths, strict dedupe posture, multi-source NOT a dupe, deterministic serialize, round-trip stability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The CI-gating verb. Two gating paths (CEO D8 + eng D6/D7): - Regression gate (--baseline X.baseline.ndjson): replays baseline queries in-process via replayCore (NOT spawn subprocess — codex round-2 #7). Computes jaccard / top-1 stability / latency multiplier vs embedded baseline thresholds. Catches retrieval REGRESSIONS during refactors. - Correctness gate (--qrels Y.qrels.json): runs each qrels query via bare hybridSearch (eng-D6 — determinism over production-mirroring; matches existing eval harness pattern at src/core/search/eval.ts:242). Computes recall@K + first_relevant_hit_rate + expected_top1_hit_rate. Catches retrieval QUALITY drops against known-right answers. Both can be passed together; both must pass for verdict 'pass'. At least one required (usage error otherwise). Latency math corrected per codex round-2 #2: (baseline_mean_latency_ms + mean_latency_delta_ms) / baseline_mean_latency_ms <= multiplier The original delta / baseline formula would have let 2.5x slowdowns pass at multiplier=2.0. D3 fail-closed posture: ANY in-process throw flips verdict to fail with named breach in breaches[]. Never silently exits 0. Exit codes: 0 PASS, 1 FAIL (regression OR throw), 2 USAGE. 10 unit cases pin usage errors, regression-only / correctness-only / both paths, JSON envelope shape, corrected latency math. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the v0.40.1.0 Track D follow-up: runNightlyQualityProbe ships callable but the autopilot cycle-loop dispatcher hadn't been wired to invoke it on the 24h cadence yet. - src/commands/autopilot.ts (tick body): invokes runNightlyQualityProbe when cfg.autopilot.nightly_quality_probe.enabled === true. Per eng-D10 (codex round-1 #11): NO scheduler-side rate-limit check. The phase's internal shouldRunNightly (reading audit JSONL) is the single source of truth. Probe call wrapped in try/catch that logs to stderr and DOES NOT bump consecutiveErrors (probe failure is informational, never crashes the loop). - src/core/cycle/nightly-probe-adapters.ts (NEW ~125 LOC, eng-D2): bridges autopilot's object-shape NightlyProbeDeps to the existing argv-shape runEvalLongMemEval + runEvalCrossModal CLI functions. Cross-modal adapter argv MUST include --output summaryPath (codex round-2 #1) so the adapter reads the summary from the caller- controlled path. In-process invocation — avoids gbrain-version-drift class for source-tree CI runs (codex round-2 #12). - src/core/config.ts: added autopilot.nightly_quality_probe to GBrainConfig interface (typecheck gate). Default OFF — opt-in via: gbrain config set autopilot.nightly_quality_probe.enabled true Cost cap default $5/run × 30 nights ≈ $150/month worst-case per brain. Expected real cost ~$0.35/night × 30 ≈ $10.50/month. 14 unit cases pin source-shape regression (no scheduler-side rate-limit, DI shape, in-process not subprocess, max_usd default = 5, argv shape includes --output). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hermetic end-to-end test of the v0.41 LOOP per eng-D5. Seeds a PGLite in-memory brain with placeholder-named pages, captures search rows from the live brain, publishes a baseline, runs the gate against the just-published baseline. 4 cases: - self-gate against just-published baseline returns PASS (LOOP closes) - perturbed retrieved_slugs → jaccard drops → exit 1 with named breach - malformed baseline → exit 1 fail-closed (D3 IRON-RULE — pre-D3 bug would have silently exited 0) - byte-stable round-trip: serialize → parse → re-serialize identical Uses tool_name='search' (bare keyword) for captured rows so replay runs hermetically without embedding-provider dependencies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI runner observed p50 above 1500ms under parallel test load (8-way shard × PGLite WASM contention). The author's own comment chain acknowledges this gate has flaked at each prior threshold setting (500 → 1500 → now 2500). 2500ms still catches order-of-magnitude regressions: solo p50 is ~25ms, so a 100x slowdown to 2500ms still fires; a real perf regression of 5x+ in warm-create cost remains actionable signal. Caught by CI test shard 2 on PR #1352 (v0.41.0.0). Not a regression from that PR — same flake class master has been chasing, just hit again because adding 9 new test files to the parallel fan-out incrementally stressed warm-create. Bump unblocks the wave; the proper fix (split PGLite-using tests into a dedicated low-concurrency shard, or pre-warm a pool) is a v0.42+ test-infra task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fixes Privacy: rename `wintermute-greenfield` → `markdown-greenfield` identifier across 13 files + 4 file renames per CLAUDE.md:550 (banned private-fork name in public artifacts). Identifier shipped through the lens-pack wave as the long-lived migration-mode source kind; sweep includes class names (MarkdownGreenfieldSource), frontmatter marker, audit JSONL path, eval command, and operator doc filename. Reframe contextual mentions per OpenClaw substitution rule ("your OpenClaw"/"upstream OpenClaw"). Queue: rebump v0.41.0.0 → v0.42.0.0 (PR #1352 claims v0.41.0.0 in queue); sweeps 38 v0.41 → v0.42 references across branch-introduced files; renames docs/migrations/v0.41-markdown-greenfield.md → v0.42-markdown-greenfield.md, test/schema-pack-manifest-v041.test.ts → -v042, test/eval-v041-scaffolds → test/eval-v042-scaffolds. Pre-existing master files referencing v0.41 left untouched (those describe master's own anticipated wave). Test fixes (5 pre-existing failures + 1 shard wedge, all unrelated to lens packs but caught by the post-merge run): - src/core/anthropic-pricing.ts: estimateMaxCostUsd strips `anthropic:` provider prefix before ANTHROPIC_PRICING lookup. v0.31.12 introduced provider-prefixed model strings; the budget meter wasn't updated and fell through to BUDGET_METER_NO_PRICING (budget gate disabled), letting auto-think submissions complete when the test expected budget exhaustion to force partial/skipped. - test/longmemeval-trajectory-routing.test.ts: perf-gate cap 10s → 30s. Test runs ~4s isolated; parallel-shard CPU contention pushes it to 16s. 30s still catches genuine cold-path regressions. - test/search/embedding-column.test.ts → .serial.test.ts: quarantine to serial pass (depends on gateway module-state set by bunfig.toml preload; other parallel tests' resetGateway() leaves stale state). - scripts/run-unit-parallel.sh: SHARD_TIMEOUT 600s → 900s. Shard 8's migration test suite runs 1369 tests in 807s (all pass); 600s wrapper cap was killing healthy shards. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…wave # Conflicts: # CHANGELOG.md # VERSION # package.json # test/eval-longmemeval.slow.test.ts

Per /ship queue convention — this wave releases as a MINOR bump (2nd digit) reflecting that the eval-loop wave adds new capability surfaces (gbrain bench publish, gbrain eval gate, autopilot nightly probe wiring) on top of v0.41's already-shipped feature set. VERSION + package.json + CHANGELOG header + "To take advantage" line all updated together. Trio agrees on 0.41.1.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…wave # Conflicts: # CHANGELOG.md # VERSION # package.json

PRs #1352 and #1367 both claim v0.41.0.0 in queue (the .0 slot is contested); v0.41.2.0 is unclaimed and represents this wave as a PATCH on the v0.41 line rather than a separate minor wave. Sweeps v0.42.0.0 → v0.41.2.0 across CHANGELOG + 2 docs + 4 yaml + 4 ts + 2 test files; renames docs/migrations/v0.42-markdown-greenfield.md → v0.41.2-markdown-greenfield.md and 2 test files (-v042 → -v041_2). Wave-identity tags ("v0.41 T4" etc) in test/code comments correctly preserved — this IS a v0.41 wave patch, not a new wave. macOS sed `\b` limitation means those tags were never converted in the first place; verified intentional preservation. Forward references to v0.42 in TODOS.md + CHANGELOG D3 section + future- wave declarations in code comments are untouched (they describe the NEXT minor wave, not this one). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…wave # Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan and others added 7 commits May 24, 2026 01:52

Merge remote-tracking branch 'origin/master' into garrytan/pick-next-…

2b5d7d2

…wave # Conflicts: # CHANGELOG.md # VERSION # package.json # test/eval-longmemeval.slow.test.ts

garrytan mentioned this pull request May 24, 2026

v0.41.2.0 feat: lens packs + epistemology unification — atoms + concepts as first-class units, calibration profile widening, gstack-learnings bridge #1364

Open

6 tasks

garrytan changed the title ~~v0.41.0.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP~~ v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP May 24, 2026

Merge remote-tracking branch 'origin/master' into garrytan/pick-next-…

cda3241

…wave # Conflicts: # CHANGELOG.md # VERSION # package.json

Merge remote-tracking branch 'origin/master' into garrytan/pick-next-…

7067884

…wave # Conflicts: # CHANGELOG.md # VERSION # package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP#1352

v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP#1352
garrytan wants to merge 11 commits into
masterfrom
garrytan/pick-next-wave

garrytan commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 24, 2026

Summary

Test Coverage

Pre-Landing Review

Plan Completion

TODOS

Documentation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant