v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed by garrytan · Pull Request #1351 · garrytan/gbrain

garrytan · 2026-05-24T08:46:43Z

Summary

Your brain stops accepting junk pages, and oversize content stops crashing the embedder. Six commits implement a layered content-sanity defense at the narrow waist of ingestion:

feat: add content-sanity assessor + embed-skip helper + audit JSONL primitives — 4 new pure modules (content-sanity.ts, content-sanity-literals.ts, embed-skip.ts, audit/content-sanity-audit.ts).
feat(embed): apply embed-skip filter at all 5 stale-chunk sites — Postgres + PGLite engines, embed CLI --stale AND --all paths, Minion helper. Single source of truth via embed-skip.ts.
feat(ingest): wire content-sanity gate into importFromContent narrow waist — throws ContentSanityBlockError on hard-block (junk match), sets frontmatter.embed_skip + deletes existing chunks on soft-block (oversize alone). Exit-code wire-up at cli.ts so gbrain import actually fails on bad content (Codex r2 docs: fix first-time experience — remove fictional kindling, add recommended schema #3).
feat: lint rules + doctor checks + 'gbrain sources audit' CLI — 2 new lint rules + 3 new doctor checks + new gbrain sources audit <id> dry-run subcommand.
chore: bump version and changelog (v0.40.9.0) — VERSION 0.40.8.1 → 0.40.9.0 + CHANGELOG entry + 9 TODOS entries for deferred v0.41+ follow-ups.
docs: update CLAUDE.md for v0.40.9.0 content-sanity wave — Key Files entries for all 4 new modules + extension blocks on touched files; llms-full.txt regenerated.

Test Coverage

99 new unit tests across 6 files (207 assertions):

test/content-sanity.test.ts             32 cases   ★★★ all built-in patterns, bytes parity, ContentSanityBlockError
test/content-sanity-literals.test.ts    13 cases   ★★★ ENOENT, comments, directives, regex meta-chars stay literal
test/embed-skip.test.ts                 20 cases   ★★★ JS predicate + SQL fragment semantics
test/audit/content-sanity-audit.test.ts 12 cases   ★★★ JSONL writer + reader + classify + summarize
test/lint-content-sanity.test.ts        13 cases   ★★★ huge-page + scraper-junk rules + bytes parity
test/import-file-content-sanity.test.ts  9 cases   ★★★ PGLite-backed hard-block throw + soft-block transition

COVERAGE: 99/99 pass (100%)  |  ★★★:6 files  |  GAPS: 0 (deferred E2E pins filed in TODOS for v0.41+)

136 regression tests on edited surfaces (lint-frontmatter, import-file, embed-stale, sync-failures, doctor) pass in isolation. bun run verify exit 0. bun run typecheck clean.

Tests: 0 new test files in v0.40.8.1 → 6 new test files in v0.40.9.0 (+99 tests).

Pre-Landing Review

The plan went through full /plan-ceo-review + /plan-eng-review, each with Codex round 1 + 2 outside-voice review. Both reviews CLEARED.

CEO review: 5 cherry-picks surfaced, 3 accepted, 2 deferred to v0.41+ post-Codex round 1. 16 decisions (D1-D16).
Eng review: 4 architectural decisions (D1-D4) + 4 strategic Codex round 2 tensions resolved (D6-D9). One outside-voice round 2 vote (D5).
Codex caught one load-bearing bug class during planning (importFromContent status-vocabulary mismatch) and one during implementation (kill-switch missed on bare installs with no ~/.gbrain/config.json). Both fixed before commit.

Plan + decision provenance: ~/.claude/plans/system-instruction-you-are-working-temporal-brook.md.
CEO plan: ~/.gstack/projects/oslo-v2/ceo-plans/2026-05-23-content-sanity.md.

Plan Completion

All 13 implementation tasks complete (12 P1 + 1 P2 batch). All 9 architectural decisions from eng review applied to code. All 9 deferred items filed as v0.41+ TODOs.

TODOS

9 new entries under "v0.41 content-sanity follow-ups" — chunk-level embed-quarantine (the deferred page-level granularity rethink), source-repo remediation CLI (gbrain sources prune-junk), threshold validation post-deploy on real corpora, brain-score no_junk_pages_score component, pages soft-delete CLI, post-v0.45 operator-regex + HTML-density features (each needs a real ReDoS / code-fence-handling story before it's worth shipping), bytes-parity E2E, 5-path narrow-waist E2E pin tests + doctor integration tests.

Documentation

CLAUDE.md — extended with v0.40.9.0 Key Files entries for the content-sanity defense wave: new modules (src/core/content-sanity.ts, src/core/content-sanity-literals.ts, src/core/embed-skip.ts, src/core/audit/content-sanity-audit.ts) and extension blocks on src/commands/sources.ts (new audit <id> subcommand), src/commands/doctor.ts (3 new checks: oversized_pages, scraper_junk_pages, content_sanity_audit_recent), src/commands/lint.ts (2 new rules: huge-page, scraper-junk), src/commands/embed.ts (5-site embed-skip filter), and src/core/import-file.ts (narrow-waist gate with ContentSanityBlockError).
llms-full.txt — regenerated to absorb the CLAUDE.md edits (per the project's mandatory regen rule after any CLAUDE.md change).
CHANGELOG.md / TODOS.md — covered by the version-bump commit (d33a2b6).
README.md / AGENTS.md — not touched. These are marketing / install-protocol surfaces that don't enumerate per-release Key Files. v0.40.9.0 is a defensive hardening release with no new install path or top-line capability that belongs on the README.

Test plan

All 99 new unit tests pass (bun test test/content-sanity*.test.ts test/embed-skip.test.ts test/audit/content-sanity-audit.test.ts test/lint-content-sanity.test.ts test/import-file-content-sanity.test.ts)
136 regression tests on edited surfaces pass in isolation
`bun run verify` clean (7 pre-checks + typecheck)
`bun run typecheck` clean
Full `bun run test` parallel suite returns exit 0
Optional post-merge: run `gbrain doctor --content-audit` on a real brain with junk pages to validate the threshold defaults against real-world distribution (filed as v0.41+ TODO)

🤖 Generated with Claude Code

…rimitives Four new core modules (pure, no engine I/O): - src/core/content-sanity.ts — assessor with 6 hand-vetted junk patterns (Cloudflare attention-required, just-a-moment, ray-id; access-denied; captcha-required; bare error-page titles). Bytes measured against compiled_truth + timeline (parseMarkdown body split, not file bytes). ContentSanityBlockError tagged with PAGE_JUNK_PATTERN code so classifyErrorCode hits via regex without a new ImportResult field. - src/core/content-sanity-literals.ts — operator literal-substring loader for ~/.gbrain/junk-substrings.txt. Comment directives for name + applies_to. ENOENT returns empty list (fail-soft); no regex parsing so no ReDoS surface. - src/core/embed-skip.ts — single source of truth for the embed-skip predicate. JS isEmbedSkipped() + filterOutEmbedSkipped() for in-memory callers; EMBED_SKIP_FILTER_FRAGMENT raw SQL string for engine-layer filters. buildEmbedSkipMarker() emits the canonical frontmatter shape. Both Postgres and PGLite use the same JSONB '?' existence operator. - src/core/audit/content-sanity-audit.ts — ISO-week JSONL at ~/.gbrain/audit/content-sanity-YYYY-Www.jsonl. Built on v0.40.4.0 audit-writer primitive. One stream for hard-block + soft-block + warn events with event_type discriminator. summarizeContentSanityEvents rolls up by type + source + pattern hits for doctor consumption. 99 unit tests across 4 new test files (207 assertions) covering boundaries, every built-in pattern, bytes-parity assertion, operator literals (regex meta-chars stay literal), audit JSONL round-trip + reader. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Embed sweep must skip pages with frontmatter.embed_skip set so soft-blocked pages don't get re-embedded. Five wiring sites all use the shared helper: 1. src/commands/embed.ts — --stale CLI path (delegates to embedAllStale) 2. src/commands/embed.ts — --all CLI path (JS-side filterOutEmbedSkipped on the listPages result; Codex r2 #11 caught this previously-missed surface that re-embedded soft-blocked pages on every model swap) 3. src/core/embed-stale.ts:90 — Minion helper (inherits via engine) 4. src/core/postgres-engine.ts — listStaleChunks + countStaleChunks gain 'NOT (COALESCE(p.frontmatter, ''{}''::jsonb) ? ''embed_skip'')' filter at the SQL layer. Always JOINs pages now (pre-fix bare path skipped the JOIN; D4 + D8 require it for the filter). 5. src/core/pglite-engine.ts — mirror of postgres-engine; PGLite is Postgres 17.5 in WASM so the same JSONB '?' operator works. Cross-site invariant pinned by test/embed-skip.test.ts (20 cases on the JS predicate + SQL fragment semantics). When v0.41+ promotes embed_skip to a schema column, all 5 sites get updated in one helper file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…waist Hard-block via thrown ContentSanityBlockError; soft-block via frontmatter marker + chunk deletion on transition (D9 invariant). Single throw point means every wrapper site (CLI, MCP put_page, sync) inherits correct exit/error semantics through existing exception flow — no per-wrapper status-vocabulary changes (Codex r2 #2). import-file.ts: - Gate runs AFTER parseMarkdown so assessor sees compiled_truth + timeline + title + frontmatter (Codex r2 #5+#7). - Kill-switch (GBRAIN_NO_SANITY=1) checked via direct process.env AS WELL AS effective config — loadConfig() returns null on bare installs (no ~/.gbrain/config.json, no DATABASE_URL) so the config-only path missed the kill-switch. Caught by test/import-file-content-sanity.test.ts. - Hard-block: throws ContentSanityBlockError. Existing import.ts catch increments errors; sync.ts:929 catch records failure with classified code. - Soft-block: sets parsed.frontmatter.embed_skip via buildEmbedSkipMarker before hash compute (so hash differs from prior version → real write). Chunking block guards on isEmbedSkipped → chunks stays empty → existing tx.deleteChunks fires (D9 transition invariant). - Audit JSONL records every assessment (hard / soft / warn + bypass-mode). sync.ts: - classifyErrorCode gains /PAGE_JUNK_PATTERN/ → 'PAGE_JUNK_PATTERN' regex. No PAGE_OVERSIZED code because oversize is now a soft state — page lands. config.ts: - New content_sanity.* field on GBrainConfig (4 keys: bytes_warn, bytes_block, junk_patterns_enabled, disabled). - loadConfig() reads GBRAIN_PAGE_WARN_BYTES, GBRAIN_PAGE_BLOCK_BYTES, GBRAIN_NO_JUNK_PATTERNS, GBRAIN_NO_SANITY env vars sparse-merged. - loadConfigWithEngine merges DB-plane content_sanity.* keys per-key sparse-merge so 'gbrain config set content_sanity.bytes_block N' takes effect uniformly (Codex r2 #6 D1 acceptance). - KNOWN_CONFIG_KEYS + KNOWN_CONFIG_KEY_PREFIXES include the new keys. cli.ts: - runImport now honors result.errors > 0 for non-zero exit. Pre-fix the CLI awaited runImport but discarded the result, so hard-blocked imports exited 0 silently (Codex r2 #3). 9 PGLite-backed unit tests pin: hard-block throws, error message contains PAGE_JUNK_PATTERN, blocked page does NOT land in DB, soft-block writes page with embed_skip set, soft-block deletes pre-existing chunks (D9 transition), kill-switch bypass works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three operator surfaces backed by the shared content-sanity assessor: lint.ts (2 new rules): - huge-page: bytes (compiled_truth + timeline post-parse) exceeds warn or block threshold. Message names the actual byte count. - scraper-junk: built-in junk pattern OR operator literal matched. - Lint runs parseMarkdown to extract body for bytes-parity with doctor (D2 — both surfaces measure body-only, not file-with-frontmatter). - runLintCore resolves effective config once per run: file/env (sync via loadConfig) + DB-lift when ~/.gbrain/ is reachable (D1). CI without ~/.gbrain/ falls through immediately. Engine probe wrapped in try/catch so lint never blocks on engine state. - Operator literals loaded once per lint run; passed through to every page's lintContent call. doctor.ts (3 new checks + 1 flag): - oversized_pages: indexed-free table scan via octet_length(compiled_truth) + octet_length(COALESCE(timeline, '')) (Codex r2 #13: octet_length is bytes, length is chars). Status warn on 1+ rows; oversize is now a soft state so no 'fail'. - scraper_junk_pages: capped 1000 most-recent default + --content-audit opt-in for full scan (D10 mirrors --index-audit precedent from v0.14.3). Applies assessor per-page on title + 2KB body slice + frontmatter. - content_sanity_audit_recent: reads ~/.gbrain/audit/content-sanity-*.jsonl for last 7 days, aggregates by event_type + source. Warn at 10+ events, fail at 100+. Doctor message names the multi-host limitation explicitly (Codex r1 #14): 'audit reflects events on this host only; multi-host operators should share GBRAIN_AUDIT_DIR'. sources.ts (new audit subcommand): - gbrain sources audit <id> [--json] [--include-warns] - Reads sources.local_path, walks disk (via pruneDir for node_modules / .git / dotfiles), runs assessContentSanity per .md file. - Reports size distribution (p50, p99, max) + would-hard-block count + would-soft-block count + junk-pattern hit map. - Read-only: NO DB writes, NO file mutations. Operator runs this BEFORE a sync to catch junk early, or AFTER landing v0.40.9.0 to audit historical inventory. 13 unit tests on lint rules; D1 config-lift behavior pinned by lift in runLintCore + manual override via opts.contentSanity for tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v0.40.9.0 — content sanity defense: junk-pattern throw + oversize-skip-embed. Plus TODOS.md entries for the 9 deferred v0.41+ follow-ups: - chunk-level embed-quarantine (Codex r1 #3 — page-level granularity wrong) - source-repo remediation CLI (gbrain sources prune-junk) - threshold validation post-deploy on real corpora - brain-score no_junk_pages_score component - pages soft-delete --where CLI (paired with prune-junk) - post-v0.45 operator-regex extensibility (needs real ReDoS story) - post-v0.45 HTML-density rule (needs fenced-code handling) - bytes-parity E2E across lint + doctor - 5-path narrow-waist E2E pin tests + doctor integration tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add v0.40.9.0 Key Files entries for the content-sanity defense modules: content-sanity.ts (assessor), content-sanity-literals.ts (operator loader), embed-skip.ts (5-site shared predicate), audit/content-sanity-audit.ts (JSONL writer). Extend doctor.ts, lint.ts, embed.ts, import-file.ts, and sources.ts entries with the v0.40.9.0 surfaces (3 new doctor checks, 2 new lint rules, embed-skip filter at 5 sites, importFromContent gate, sources audit subcommand). Regenerate llms-full.txt per the CLAUDE.md edit rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #1350 also claimed v0.40.9.0. Advancing this PR to v0.40.10.0 so CI's version-gate doesn't reject on overlap. No functional change — same shipped content, just a different version slot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…size-gate # Conflicts: # CHANGELOG.md # VERSION # package.json

…undary flake PR #1351 ship CI hit a single test failure (one in 2552): (fail) scanBrainSources partial-scan state > hanging COUNT does not exceed deadline — Promise.race timeout fires [579.01ms] Run: https://github.com/garrytan/gbrain/actions/runs/77611667786 Cause: heavily-loaded CI runners (8 parallel shards × 4 concurrent test files = ~32 concurrent bun processes) occasionally let the setTimeout race callback resolve a microsecond BEFORE the wall-clock boundary, leaving Date.now() one tick below deadline. The post-await deadline check at brain-writer.ts:512 uses Date.now() >= deadline; on that tick the check evaluated false and scanOneSource ran src-a anyway. Test then asserted firstSource.status === 'skipped' and got 'scanned'. Fix: add 1ms overshoot to the race-timer schedule: setTimeout(..., remainingMs + 1) Guarantees the timer fires past the deadline by at least one millisecond regardless of runner timer drift. Cost: 1ms additional wall-clock latency on hung COUNT queries — operationally negligible. Verified: stress-tested 5/5 passing locally. The bug class is identical to the one the existing test comment block (lines 180-187) documents (`>=` not `>` at line 512); this +1ms is the belt to that suspenders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan and others added 7 commits May 24, 2026 01:43

garrytan changed the title ~~v0.40.9.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed~~ v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed May 24, 2026

garrytan and others added 2 commits May 24, 2026 09:32

Merge remote-tracking branch 'origin/master' into garrytan/lint-page-…

4fa3d71

…size-gate # Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan merged commit fa2c7a6 into master May 24, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed#1351

v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed#1351
garrytan merged 9 commits into
masterfrom
garrytan/lint-page-size-gate

garrytan commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 24, 2026

Summary

Test Coverage

Pre-Landing Review

Plan Completion

TODOS

Documentation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant