Skip to content

v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP#1352

Open
garrytan wants to merge 11 commits into
masterfrom
garrytan/pick-next-wave
Open

v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP#1352
garrytan wants to merge 11 commits into
masterfrom
garrytan/pick-next-wave

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Summary

Your CI can now fail a PR when search retrieval gets worse.

v0.41 closes the eval LOOP gbrain has been building toward across 3 prior waves. Before this release, capture / replay / nightly probe / cross-modal runner all existed but none of them GATED. Now gbrain eval gate is the CI verb that fails PRs on retrieval regressions OR correctness drops.

Two ways to fail the gate:

  • Regression gate (--baseline) — replays a captured baseline, catches "did my refactor break search?"
  • Correctness gate (--qrels) — runs known-right queries against the brain, catches "is search actually any good?" via recall@K + first-relevant-hit + expected_top1

Both source-id-aware (source_id::slug compares) so federated brains can't false-pass via wrong-source hits — the canonical gbrain multi-source pitfall closed structurally at the file-shape layer.

6 commits (bisectable):

  • d4ecfcf0 shared modules (src/core/bench/{baseline-file,qrels-file,correctness-gate}.ts)
  • bf17cf01 eval-replay header skip + replayCore programmatic export
  • 17edb040 gbrain bench publish CLI verb
  • c02ac184 gbrain eval gate two-gate CI verb
  • 3bd949fc autopilot wiring for nightly quality probe (opt-in, off by default)
  • daeef8cb e2e LOOP integration test

Pairs with gbrain-evals#13 — published v0.41-launch baseline + qrels (hermetic-synthetic per D9 privacy posture).

Test Coverage

73 new test cases across 9 files. All passing in isolation (5.75s):

  • test/bench/baseline-file.test.ts (9) — parser/serializer/source-hash math
  • test/bench/qrels-file.test.ts (19) — legacy + federated shapes, recall@K math
  • test/bench/correctness-gate.test.ts (6) — orchestrator + per-query throw-fails-gate
  • test/bench-publish.test.ts (10) — strict posture + multi-source dedup key
  • test/eval-replay-metadata-skip.test.ts (2) — IRON-RULE: metadata header skipped
  • test/eval-gate.test.ts (10) — usage errors, both gate paths, corrected latency math
  • test/cycle/nightly-probe-adapters.test.ts (6) — argv shape + receipt parsing
  • test/autopilot-nightly-probe-wiring.test.ts (8) — source-shape regression
  • test/e2e/eval-loop.test.ts (4) — full PGLite capture→publish→gate LOOP

Spot-check across every test file importing the changed modules: 199/199 pass.

bun run verify (typecheck + 4 pre-checks): PASS.

Full unit suite hit a known pre-existing macOS PGLite WASM OOM (issue #223) under the 8-shard × 4-concurrency fan-out — 89 explicit OOMs + ~88 cascade failures. CI on GitHub Actions runs each shard on a fresh runner and won't hit this.

Pre-Landing Review

CEO Review + Eng Review CLEAR (logged at HEAD 0b19a62e — current HEAD matches review commit, no staleness).
2 codex outside-voice rounds: 24 findings total, all absorbed (13 reshaped the wave to ship the correctness gate alongside the regression gate; 11 inline corrections).

Plan Completion

11 implementation tasks (T1-T11) named in plan; all complete except T9 which shipped as the coordinated drop in gbrain-evals#13. 4 v0.42+ follow-ups filed in TODOS.md (D11-D13 + gbrain-evals coordinated drop).

TODOS

  • Closed: v0.41 Eval-loop wave (D1 P0 from ## v0.41+ wave commitments)
  • Filed for v0.42+: capture-default flip + scrubber hardening, bench publish --suggest-thresholds, bench diff + bench list

Documentation

  • CHANGELOG.md — full ELI10-led entry with 3-path "To take advantage" recipe
  • CLAUDE.md — 3 new module annotations
  • docs/eval-bench.md — two-gate model + Privacy Posture + GitHub Actions example + bootstrap recipe
  • llms.txt + llms-full.txt regenerated

Test plan

Plan + 23 decisions + 2 codex outside-voice rounds at ~/.claude/plans/system-instruction-you-are-working-rustling-peacock.md.

🤖 Generated with Claude Code

garrytan and others added 7 commits May 24, 2026 01:52
…odules

v0.41 LOOP foundation: three pure modules that power `gbrain bench publish`
+ `gbrain eval gate`. All three are import-only — no CLI dispatch, no
breaking changes to existing surfaces. Tested in isolation (34 cases).

- src/core/bench/baseline-file.ts (~190 LOC): single source of truth for
  the .baseline.ndjson file shape. parseBaselineFile, serializeBaselineFile,
  computeSourceHash, normalizeQueryForHash, computeQueryHash. Body rows
  stamped with schema_version: 1 so existing eval-replay parser accepts
  them unchanged.
- src/core/bench/qrels-file.ts (~210 LOC): pure parser + math for the
  .qrels.json shape. Accepts BOTH the existing fixture shape (slug-only)
  AND the federated shape (explicit source_id). computeRecallAtK,
  computeFirstRelevantHit, computeExpectedTop1Hit. Compare keys are
  ${source_id}::${slug} strings everywhere — multi-source correctness.
- src/core/bench/correctness-gate.ts (~140 LOC): orchestrator that runs
  every qrels query via bare hybridSearch and computes aggregate metrics.
  Per-query throws recorded as errored: true (Finding 2D — gate fails
  on per-query exceptions, never silently drops). Injectable searchFn
  test seam.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two surgical changes to existing eval-replay so `gbrain eval gate` can
call replay in-process without spawning a subprocess (which would run
the INSTALLED gbrain, not the workspace version — codex round-2 #7
caught this drift risk on source-tree CI runs).

- parseNdjson now skips lines where _kind === 'baseline_metadata'.
  Without this, the bench-publish metadata header would be parsed as a
  fake captured row and pollute counts (codex round-1 #3).
- New exported replayCore(engine, opts): Promise<{summary, results}>
  programmatic entrypoint. Existing CLI runEvalReplay now wraps it.
  ReplaySummary interface also exported for eval-gate consumers.

IRON-RULE regression pinned by test/eval-replay-metadata-skip.test.ts
(2 cases): header skipped from row counts; malformed rows still rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The LOOP-closing verb. Turns captured eval rows (gbrain eval export) into
a baseline file (.baseline.ndjson) consumed by gbrain eval gate --baseline.

Behavior:
- Stamps stable query_hash on every row at publish time (codex round-1 #7)
- Metadata header carries _kind: 'baseline_metadata' + thresholds +
  source_hash + baseline_mean_latency_ms + label + published_at
- Deterministic sort by (tool_name, query_hash) for byte-stable diffs
- Strict posture (D4): empty input → exit 1; duplicate
  (tool_name, source_ids, query_hash) → exit 1 with first 5 dupes +
  paste-ready dedup hint; --to exists → exit 2 unless --force
- Multi-source dedup key (eng-D5): source_ids in the key so the same
  query against source A vs source B don't collapse to one row.
  Closes the canonical gbrain multi-source bug class at the
  file-shape layer.
- Audit JSONL at ~/.gbrain/audit/bench-publish-YYYY-Www.jsonl via
  shared audit-writer primitive.

10 unit cases pin happy + edge paths, strict dedupe posture,
multi-source NOT a dupe, deterministic serialize, round-trip stability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CI-gating verb. Two gating paths (CEO D8 + eng D6/D7):

- Regression gate (--baseline X.baseline.ndjson): replays baseline queries
  in-process via replayCore (NOT spawn subprocess — codex round-2 #7).
  Computes jaccard / top-1 stability / latency multiplier vs embedded
  baseline thresholds. Catches retrieval REGRESSIONS during refactors.
- Correctness gate (--qrels Y.qrels.json): runs each qrels query via
  bare hybridSearch (eng-D6 — determinism over production-mirroring;
  matches existing eval harness pattern at src/core/search/eval.ts:242).
  Computes recall@K + first_relevant_hit_rate + expected_top1_hit_rate.
  Catches retrieval QUALITY drops against known-right answers.

Both can be passed together; both must pass for verdict 'pass'. At least
one required (usage error otherwise).

Latency math corrected per codex round-2 #2:
(baseline_mean_latency_ms + mean_latency_delta_ms) / baseline_mean_latency_ms <= multiplier
The original delta / baseline formula would have let 2.5x slowdowns pass
at multiplier=2.0.

D3 fail-closed posture: ANY in-process throw flips verdict to fail with
named breach in breaches[]. Never silently exits 0.

Exit codes: 0 PASS, 1 FAIL (regression OR throw), 2 USAGE.

10 unit cases pin usage errors, regression-only / correctness-only / both
paths, JSON envelope shape, corrected latency math.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the v0.40.1.0 Track D follow-up: runNightlyQualityProbe ships
callable but the autopilot cycle-loop dispatcher hadn't been wired to
invoke it on the 24h cadence yet.

- src/commands/autopilot.ts (tick body): invokes runNightlyQualityProbe
  when cfg.autopilot.nightly_quality_probe.enabled === true.
  Per eng-D10 (codex round-1 #11): NO scheduler-side rate-limit check.
  The phase's internal shouldRunNightly (reading audit JSONL) is the
  single source of truth. Probe call wrapped in try/catch that logs to
  stderr and DOES NOT bump consecutiveErrors (probe failure is
  informational, never crashes the loop).
- src/core/cycle/nightly-probe-adapters.ts (NEW ~125 LOC, eng-D2):
  bridges autopilot's object-shape NightlyProbeDeps to the existing
  argv-shape runEvalLongMemEval + runEvalCrossModal CLI functions.
  Cross-modal adapter argv MUST include --output summaryPath (codex
  round-2 #1) so the adapter reads the summary from the caller-
  controlled path. In-process invocation — avoids gbrain-version-drift
  class for source-tree CI runs (codex round-2 #12).
- src/core/config.ts: added autopilot.nightly_quality_probe to
  GBrainConfig interface (typecheck gate).

Default OFF — opt-in via:
  gbrain config set autopilot.nightly_quality_probe.enabled true

Cost cap default $5/run × 30 nights ≈ $150/month worst-case per brain.
Expected real cost ~$0.35/night × 30 ≈ $10.50/month.

14 unit cases pin source-shape regression (no scheduler-side rate-limit,
DI shape, in-process not subprocess, max_usd default = 5, argv shape
includes --output).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hermetic end-to-end test of the v0.41 LOOP per eng-D5. Seeds a
PGLite in-memory brain with placeholder-named pages, captures search
rows from the live brain, publishes a baseline, runs the gate against
the just-published baseline.

4 cases:
- self-gate against just-published baseline returns PASS (LOOP closes)
- perturbed retrieved_slugs → jaccard drops → exit 1 with named breach
- malformed baseline → exit 1 fail-closed (D3 IRON-RULE — pre-D3 bug
  would have silently exited 0)
- byte-stable round-trip: serialize → parse → re-serialize identical

Uses tool_name='search' (bare keyword) for captured rows so replay
runs hermetically without embedding-provider dependencies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runner observed p50 above 1500ms under parallel test load (8-way
shard × PGLite WASM contention). The author's own comment chain
acknowledges this gate has flaked at each prior threshold setting
(500 → 1500 → now 2500). 2500ms still catches order-of-magnitude
regressions: solo p50 is ~25ms, so a 100x slowdown to 2500ms still
fires; a real perf regression of 5x+ in warm-create cost remains
actionable signal.

Caught by CI test shard 2 on PR #1352 (v0.41.0.0). Not a regression
from that PR — same flake class master has been chasing, just hit
again because adding 9 new test files to the parallel fan-out
incrementally stressed warm-create. Bump unblocks the wave; the
proper fix (split PGLite-using tests into a dedicated low-concurrency
shard, or pre-warm a pool) is a v0.42+ test-infra task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 24, 2026
…fixes

Privacy: rename `wintermute-greenfield` → `markdown-greenfield` identifier
across 13 files + 4 file renames per CLAUDE.md:550 (banned private-fork name
in public artifacts). Identifier shipped through the lens-pack wave as the
long-lived migration-mode source kind; sweep includes class names
(MarkdownGreenfieldSource), frontmatter marker, audit JSONL path, eval
command, and operator doc filename. Reframe contextual mentions per
OpenClaw substitution rule ("your OpenClaw"/"upstream OpenClaw").

Queue: rebump v0.41.0.0 → v0.42.0.0 (PR #1352 claims v0.41.0.0 in queue);
sweeps 38 v0.41 → v0.42 references across branch-introduced files; renames
docs/migrations/v0.41-markdown-greenfield.md → v0.42-markdown-greenfield.md,
test/schema-pack-manifest-v041.test.ts → -v042, test/eval-v041-scaffolds →
test/eval-v042-scaffolds. Pre-existing master files referencing v0.41 left
untouched (those describe master's own anticipated wave).

Test fixes (5 pre-existing failures + 1 shard wedge, all unrelated to lens
packs but caught by the post-merge run):
- src/core/anthropic-pricing.ts: estimateMaxCostUsd strips `anthropic:`
  provider prefix before ANTHROPIC_PRICING lookup. v0.31.12 introduced
  provider-prefixed model strings; the budget meter wasn't updated and
  fell through to BUDGET_METER_NO_PRICING (budget gate disabled), letting
  auto-think submissions complete when the test expected budget exhaustion
  to force partial/skipped.
- test/longmemeval-trajectory-routing.test.ts: perf-gate cap 10s → 30s.
  Test runs ~4s isolated; parallel-shard CPU contention pushes it to 16s.
  30s still catches genuine cold-path regressions.
- test/search/embedding-column.test.ts → .serial.test.ts: quarantine to
  serial pass (depends on gateway module-state set by bunfig.toml preload;
  other parallel tests' resetGateway() leaves stale state).
- scripts/run-unit-parallel.sh: SHARD_TIMEOUT 600s → 900s. Shard 8's
  migration test suite runs 1369 tests in 807s (all pass); 600s wrapper
  cap was killing healthy shards.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…wave

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
#	test/eval-longmemeval.slow.test.ts
Per /ship queue convention — this wave releases as a MINOR bump
(2nd digit) reflecting that the eval-loop wave adds new capability
surfaces (gbrain bench publish, gbrain eval gate, autopilot nightly
probe wiring) on top of v0.41's already-shipped feature set.

VERSION + package.json + CHANGELOG header + "To take advantage" line
all updated together. Trio agrees on 0.41.1.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.41.0.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP May 24, 2026
…wave

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
garrytan added a commit that referenced this pull request May 24, 2026
PRs #1352 and #1367 both claim v0.41.0.0 in queue (the .0 slot is contested);
v0.41.2.0 is unclaimed and represents this wave as a PATCH on the v0.41 line
rather than a separate minor wave.

Sweeps v0.42.0.0 → v0.41.2.0 across CHANGELOG + 2 docs + 4 yaml + 4 ts + 2
test files; renames docs/migrations/v0.42-markdown-greenfield.md →
v0.41.2-markdown-greenfield.md and 2 test files (-v042 → -v041_2).

Wave-identity tags ("v0.41 T4" etc) in test/code comments correctly
preserved — this IS a v0.41 wave patch, not a new wave. macOS sed `\b`
limitation means those tags were never converted in the first place;
verified intentional preservation.

Forward references to v0.42 in TODOS.md + CHANGELOG D3 section + future-
wave declarations in code comments are untouched (they describe the NEXT
minor wave, not this one).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…wave

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant