v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed#1351
Merged
Conversation
…rimitives Four new core modules (pure, no engine I/O): - src/core/content-sanity.ts — assessor with 6 hand-vetted junk patterns (Cloudflare attention-required, just-a-moment, ray-id; access-denied; captcha-required; bare error-page titles). Bytes measured against compiled_truth + timeline (parseMarkdown body split, not file bytes). ContentSanityBlockError tagged with PAGE_JUNK_PATTERN code so classifyErrorCode hits via regex without a new ImportResult field. - src/core/content-sanity-literals.ts — operator literal-substring loader for ~/.gbrain/junk-substrings.txt. Comment directives for name + applies_to. ENOENT returns empty list (fail-soft); no regex parsing so no ReDoS surface. - src/core/embed-skip.ts — single source of truth for the embed-skip predicate. JS isEmbedSkipped() + filterOutEmbedSkipped() for in-memory callers; EMBED_SKIP_FILTER_FRAGMENT raw SQL string for engine-layer filters. buildEmbedSkipMarker() emits the canonical frontmatter shape. Both Postgres and PGLite use the same JSONB '?' existence operator. - src/core/audit/content-sanity-audit.ts — ISO-week JSONL at ~/.gbrain/audit/content-sanity-YYYY-Www.jsonl. Built on v0.40.4.0 audit-writer primitive. One stream for hard-block + soft-block + warn events with event_type discriminator. summarizeContentSanityEvents rolls up by type + source + pattern hits for doctor consumption. 99 unit tests across 4 new test files (207 assertions) covering boundaries, every built-in pattern, bytes-parity assertion, operator literals (regex meta-chars stay literal), audit JSONL round-trip + reader. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Embed sweep must skip pages with frontmatter.embed_skip set so soft-blocked
pages don't get re-embedded. Five wiring sites all use the shared helper:
1. src/commands/embed.ts — --stale CLI path (delegates to embedAllStale)
2. src/commands/embed.ts — --all CLI path (JS-side filterOutEmbedSkipped
on the listPages result; Codex r2 #11 caught this previously-missed
surface that re-embedded soft-blocked pages on every model swap)
3. src/core/embed-stale.ts:90 — Minion helper (inherits via engine)
4. src/core/postgres-engine.ts — listStaleChunks + countStaleChunks
gain 'NOT (COALESCE(p.frontmatter, ''{}''::jsonb) ? ''embed_skip'')'
filter at the SQL layer. Always JOINs pages now (pre-fix bare path
skipped the JOIN; D4 + D8 require it for the filter).
5. src/core/pglite-engine.ts — mirror of postgres-engine; PGLite is
Postgres 17.5 in WASM so the same JSONB '?' operator works.
Cross-site invariant pinned by test/embed-skip.test.ts (20 cases on the
JS predicate + SQL fragment semantics). When v0.41+ promotes embed_skip
to a schema column, all 5 sites get updated in one helper file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…waist Hard-block via thrown ContentSanityBlockError; soft-block via frontmatter marker + chunk deletion on transition (D9 invariant). Single throw point means every wrapper site (CLI, MCP put_page, sync) inherits correct exit/error semantics through existing exception flow — no per-wrapper status-vocabulary changes (Codex r2 #2). import-file.ts: - Gate runs AFTER parseMarkdown so assessor sees compiled_truth + timeline + title + frontmatter (Codex r2 #5+#7). - Kill-switch (GBRAIN_NO_SANITY=1) checked via direct process.env AS WELL AS effective config — loadConfig() returns null on bare installs (no ~/.gbrain/config.json, no DATABASE_URL) so the config-only path missed the kill-switch. Caught by test/import-file-content-sanity.test.ts. - Hard-block: throws ContentSanityBlockError. Existing import.ts catch increments errors; sync.ts:929 catch records failure with classified code. - Soft-block: sets parsed.frontmatter.embed_skip via buildEmbedSkipMarker before hash compute (so hash differs from prior version → real write). Chunking block guards on isEmbedSkipped → chunks stays empty → existing tx.deleteChunks fires (D9 transition invariant). - Audit JSONL records every assessment (hard / soft / warn + bypass-mode). sync.ts: - classifyErrorCode gains /PAGE_JUNK_PATTERN/ → 'PAGE_JUNK_PATTERN' regex. No PAGE_OVERSIZED code because oversize is now a soft state — page lands. config.ts: - New content_sanity.* field on GBrainConfig (4 keys: bytes_warn, bytes_block, junk_patterns_enabled, disabled). - loadConfig() reads GBRAIN_PAGE_WARN_BYTES, GBRAIN_PAGE_BLOCK_BYTES, GBRAIN_NO_JUNK_PATTERNS, GBRAIN_NO_SANITY env vars sparse-merged. - loadConfigWithEngine merges DB-plane content_sanity.* keys per-key sparse-merge so 'gbrain config set content_sanity.bytes_block N' takes effect uniformly (Codex r2 #6 D1 acceptance). - KNOWN_CONFIG_KEYS + KNOWN_CONFIG_KEY_PREFIXES include the new keys. cli.ts: - runImport now honors result.errors > 0 for non-zero exit. Pre-fix the CLI awaited runImport but discarded the result, so hard-blocked imports exited 0 silently (Codex r2 #3). 9 PGLite-backed unit tests pin: hard-block throws, error message contains PAGE_JUNK_PATTERN, blocked page does NOT land in DB, soft-block writes page with embed_skip set, soft-block deletes pre-existing chunks (D9 transition), kill-switch bypass works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three operator surfaces backed by the shared content-sanity assessor: lint.ts (2 new rules): - huge-page: bytes (compiled_truth + timeline post-parse) exceeds warn or block threshold. Message names the actual byte count. - scraper-junk: built-in junk pattern OR operator literal matched. - Lint runs parseMarkdown to extract body for bytes-parity with doctor (D2 — both surfaces measure body-only, not file-with-frontmatter). - runLintCore resolves effective config once per run: file/env (sync via loadConfig) + DB-lift when ~/.gbrain/ is reachable (D1). CI without ~/.gbrain/ falls through immediately. Engine probe wrapped in try/catch so lint never blocks on engine state. - Operator literals loaded once per lint run; passed through to every page's lintContent call. doctor.ts (3 new checks + 1 flag): - oversized_pages: indexed-free table scan via octet_length(compiled_truth) + octet_length(COALESCE(timeline, '')) (Codex r2 #13: octet_length is bytes, length is chars). Status warn on 1+ rows; oversize is now a soft state so no 'fail'. - scraper_junk_pages: capped 1000 most-recent default + --content-audit opt-in for full scan (D10 mirrors --index-audit precedent from v0.14.3). Applies assessor per-page on title + 2KB body slice + frontmatter. - content_sanity_audit_recent: reads ~/.gbrain/audit/content-sanity-*.jsonl for last 7 days, aggregates by event_type + source. Warn at 10+ events, fail at 100+. Doctor message names the multi-host limitation explicitly (Codex r1 #14): 'audit reflects events on this host only; multi-host operators should share GBRAIN_AUDIT_DIR'. sources.ts (new audit subcommand): - gbrain sources audit <id> [--json] [--include-warns] - Reads sources.local_path, walks disk (via pruneDir for node_modules / .git / dotfiles), runs assessContentSanity per .md file. - Reports size distribution (p50, p99, max) + would-hard-block count + would-soft-block count + junk-pattern hit map. - Read-only: NO DB writes, NO file mutations. Operator runs this BEFORE a sync to catch junk early, or AFTER landing v0.40.9.0 to audit historical inventory. 13 unit tests on lint rules; D1 config-lift behavior pinned by lift in runLintCore + manual override via opts.contentSanity for tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.40.9.0 — content sanity defense: junk-pattern throw + oversize-skip-embed. Plus TODOS.md entries for the 9 deferred v0.41+ follow-ups: - chunk-level embed-quarantine (Codex r1 #3 — page-level granularity wrong) - source-repo remediation CLI (gbrain sources prune-junk) - threshold validation post-deploy on real corpora - brain-score no_junk_pages_score component - pages soft-delete --where CLI (paired with prune-junk) - post-v0.45 operator-regex extensibility (needs real ReDoS story) - post-v0.45 HTML-density rule (needs fenced-code handling) - bytes-parity E2E across lint + doctor - 5-path narrow-waist E2E pin tests + doctor integration tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add v0.40.9.0 Key Files entries for the content-sanity defense modules: content-sanity.ts (assessor), content-sanity-literals.ts (operator loader), embed-skip.ts (5-site shared predicate), audit/content-sanity-audit.ts (JSONL writer). Extend doctor.ts, lint.ts, embed.ts, import-file.ts, and sources.ts entries with the v0.40.9.0 surfaces (3 new doctor checks, 2 new lint rules, embed-skip filter at 5 sites, importFromContent gate, sources audit subcommand). Regenerate llms-full.txt per the CLAUDE.md edit rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1350 also claimed v0.40.9.0. Advancing this PR to v0.40.10.0 so CI's version-gate doesn't reject on overlap. No functional change — same shipped content, just a different version slot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…size-gate # Conflicts: # CHANGELOG.md # VERSION # package.json
…undary flake PR #1351 ship CI hit a single test failure (one in 2552): (fail) scanBrainSources partial-scan state > hanging COUNT does not exceed deadline — Promise.race timeout fires [579.01ms] Run: https://github.com/garrytan/gbrain/actions/runs/77611667786 Cause: heavily-loaded CI runners (8 parallel shards × 4 concurrent test files = ~32 concurrent bun processes) occasionally let the setTimeout race callback resolve a microsecond BEFORE the wall-clock boundary, leaving Date.now() one tick below deadline. The post-await deadline check at brain-writer.ts:512 uses Date.now() >= deadline; on that tick the check evaluated false and scanOneSource ran src-a anyway. Test then asserted firstSource.status === 'skipped' and got 'scanned'. Fix: add 1ms overshoot to the race-timer schedule: setTimeout(..., remainingMs + 1) Guarantees the timer fires past the deadline by at least one millisecond regardless of runner timer drift. Cost: 1ms additional wall-clock latency on hung COUNT queries — operationally negligible. Verified: stress-tested 5/5 passing locally. The bug class is identical to the one the existing test comment block (lines 180-187) documents (`>=` not `>` at line 512); this +1ms is the belt to that suspenders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Your brain stops accepting junk pages, and oversize content stops crashing the embedder. Six commits implement a layered content-sanity defense at the narrow waist of ingestion:
feat: add content-sanity assessor + embed-skip helper + audit JSONL primitives— 4 new pure modules (content-sanity.ts,content-sanity-literals.ts,embed-skip.ts,audit/content-sanity-audit.ts).feat(embed): apply embed-skip filter at all 5 stale-chunk sites— Postgres + PGLite engines, embed CLI--staleAND--allpaths, Minion helper. Single source of truth viaembed-skip.ts.feat(ingest): wire content-sanity gate into importFromContent narrow waist— throwsContentSanityBlockErroron hard-block (junk match), setsfrontmatter.embed_skip+ deletes existing chunks on soft-block (oversize alone). Exit-code wire-up atcli.tssogbrain importactually fails on bad content (Codex r2 docs: fix first-time experience — remove fictional kindling, add recommended schema #3).feat: lint rules + doctor checks + 'gbrain sources audit' CLI— 2 new lint rules + 3 new doctor checks + newgbrain sources audit <id>dry-run subcommand.chore: bump version and changelog (v0.40.9.0)— VERSION 0.40.8.1 → 0.40.9.0 + CHANGELOG entry + 9 TODOS entries for deferred v0.41+ follow-ups.docs: update CLAUDE.md for v0.40.9.0 content-sanity wave— Key Files entries for all 4 new modules + extension blocks on touched files; llms-full.txt regenerated.Test Coverage
99 new unit tests across 6 files (207 assertions):
136 regression tests on edited surfaces (lint-frontmatter, import-file, embed-stale, sync-failures, doctor) pass in isolation.
bun run verifyexit 0.bun run typecheckclean.Tests: 0 new test files in v0.40.8.1 → 6 new test files in v0.40.9.0 (+99 tests).
Pre-Landing Review
The plan went through full
/plan-ceo-review+/plan-eng-review, each with Codex round 1 + 2 outside-voice review. Both reviews CLEARED.importFromContentstatus-vocabulary mismatch) and one during implementation (kill-switch missed on bare installs with no~/.gbrain/config.json). Both fixed before commit.Plan + decision provenance:
~/.claude/plans/system-instruction-you-are-working-temporal-brook.md.CEO plan:
~/.gstack/projects/oslo-v2/ceo-plans/2026-05-23-content-sanity.md.Plan Completion
All 13 implementation tasks complete (12 P1 + 1 P2 batch). All 9 architectural decisions from eng review applied to code. All 9 deferred items filed as v0.41+ TODOs.
TODOS
9 new entries under "v0.41 content-sanity follow-ups" — chunk-level embed-quarantine (the deferred page-level granularity rethink), source-repo remediation CLI (
gbrain sources prune-junk), threshold validation post-deploy on real corpora, brain-scoreno_junk_pages_scorecomponent, pages soft-delete CLI, post-v0.45 operator-regex + HTML-density features (each needs a real ReDoS / code-fence-handling story before it's worth shipping), bytes-parity E2E, 5-path narrow-waist E2E pin tests + doctor integration tests.Documentation
src/core/content-sanity.ts,src/core/content-sanity-literals.ts,src/core/embed-skip.ts,src/core/audit/content-sanity-audit.ts) and extension blocks onsrc/commands/sources.ts(newaudit <id>subcommand),src/commands/doctor.ts(3 new checks:oversized_pages,scraper_junk_pages,content_sanity_audit_recent),src/commands/lint.ts(2 new rules:huge-page,scraper-junk),src/commands/embed.ts(5-site embed-skip filter), andsrc/core/import-file.ts(narrow-waist gate withContentSanityBlockError).Test plan
bun test test/content-sanity*.test.ts test/embed-skip.test.ts test/audit/content-sanity-audit.test.ts test/lint-content-sanity.test.ts test/import-file-content-sanity.test.ts)🤖 Generated with Claude Code