Skip to content

v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed#1351

Merged
garrytan merged 9 commits into
masterfrom
garrytan/lint-page-size-gate
May 24, 2026
Merged

v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed#1351
garrytan merged 9 commits into
masterfrom
garrytan/lint-page-size-gate

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Summary

Your brain stops accepting junk pages, and oversize content stops crashing the embedder. Six commits implement a layered content-sanity defense at the narrow waist of ingestion:

  • feat: add content-sanity assessor + embed-skip helper + audit JSONL primitives — 4 new pure modules (content-sanity.ts, content-sanity-literals.ts, embed-skip.ts, audit/content-sanity-audit.ts).
  • feat(embed): apply embed-skip filter at all 5 stale-chunk sites — Postgres + PGLite engines, embed CLI --stale AND --all paths, Minion helper. Single source of truth via embed-skip.ts.
  • feat(ingest): wire content-sanity gate into importFromContent narrow waist — throws ContentSanityBlockError on hard-block (junk match), sets frontmatter.embed_skip + deletes existing chunks on soft-block (oversize alone). Exit-code wire-up at cli.ts so gbrain import actually fails on bad content (Codex r2 docs: fix first-time experience — remove fictional kindling, add recommended schema #3).
  • feat: lint rules + doctor checks + 'gbrain sources audit' CLI — 2 new lint rules + 3 new doctor checks + new gbrain sources audit <id> dry-run subcommand.
  • chore: bump version and changelog (v0.40.9.0) — VERSION 0.40.8.1 → 0.40.9.0 + CHANGELOG entry + 9 TODOS entries for deferred v0.41+ follow-ups.
  • docs: update CLAUDE.md for v0.40.9.0 content-sanity wave — Key Files entries for all 4 new modules + extension blocks on touched files; llms-full.txt regenerated.

Test Coverage

99 new unit tests across 6 files (207 assertions):

test/content-sanity.test.ts             32 cases   ★★★ all built-in patterns, bytes parity, ContentSanityBlockError
test/content-sanity-literals.test.ts    13 cases   ★★★ ENOENT, comments, directives, regex meta-chars stay literal
test/embed-skip.test.ts                 20 cases   ★★★ JS predicate + SQL fragment semantics
test/audit/content-sanity-audit.test.ts 12 cases   ★★★ JSONL writer + reader + classify + summarize
test/lint-content-sanity.test.ts        13 cases   ★★★ huge-page + scraper-junk rules + bytes parity
test/import-file-content-sanity.test.ts  9 cases   ★★★ PGLite-backed hard-block throw + soft-block transition

COVERAGE: 99/99 pass (100%)  |  ★★★:6 files  |  GAPS: 0 (deferred E2E pins filed in TODOS for v0.41+)

136 regression tests on edited surfaces (lint-frontmatter, import-file, embed-stale, sync-failures, doctor) pass in isolation. bun run verify exit 0. bun run typecheck clean.

Tests: 0 new test files in v0.40.8.1 → 6 new test files in v0.40.9.0 (+99 tests).

Pre-Landing Review

The plan went through full /plan-ceo-review + /plan-eng-review, each with Codex round 1 + 2 outside-voice review. Both reviews CLEARED.

  • CEO review: 5 cherry-picks surfaced, 3 accepted, 2 deferred to v0.41+ post-Codex round 1. 16 decisions (D1-D16).
  • Eng review: 4 architectural decisions (D1-D4) + 4 strategic Codex round 2 tensions resolved (D6-D9). One outside-voice round 2 vote (D5).
  • Codex caught one load-bearing bug class during planning (importFromContent status-vocabulary mismatch) and one during implementation (kill-switch missed on bare installs with no ~/.gbrain/config.json). Both fixed before commit.

Plan + decision provenance: ~/.claude/plans/system-instruction-you-are-working-temporal-brook.md.
CEO plan: ~/.gstack/projects/oslo-v2/ceo-plans/2026-05-23-content-sanity.md.

Plan Completion

All 13 implementation tasks complete (12 P1 + 1 P2 batch). All 9 architectural decisions from eng review applied to code. All 9 deferred items filed as v0.41+ TODOs.

TODOS

9 new entries under "v0.41 content-sanity follow-ups" — chunk-level embed-quarantine (the deferred page-level granularity rethink), source-repo remediation CLI (gbrain sources prune-junk), threshold validation post-deploy on real corpora, brain-score no_junk_pages_score component, pages soft-delete CLI, post-v0.45 operator-regex + HTML-density features (each needs a real ReDoS / code-fence-handling story before it's worth shipping), bytes-parity E2E, 5-path narrow-waist E2E pin tests + doctor integration tests.

Documentation

  • CLAUDE.md — extended with v0.40.9.0 Key Files entries for the content-sanity defense wave: new modules (src/core/content-sanity.ts, src/core/content-sanity-literals.ts, src/core/embed-skip.ts, src/core/audit/content-sanity-audit.ts) and extension blocks on src/commands/sources.ts (new audit <id> subcommand), src/commands/doctor.ts (3 new checks: oversized_pages, scraper_junk_pages, content_sanity_audit_recent), src/commands/lint.ts (2 new rules: huge-page, scraper-junk), src/commands/embed.ts (5-site embed-skip filter), and src/core/import-file.ts (narrow-waist gate with ContentSanityBlockError).
  • llms-full.txt — regenerated to absorb the CLAUDE.md edits (per the project's mandatory regen rule after any CLAUDE.md change).
  • CHANGELOG.md / TODOS.md — covered by the version-bump commit (d33a2b6).
  • README.md / AGENTS.md — not touched. These are marketing / install-protocol surfaces that don't enumerate per-release Key Files. v0.40.9.0 is a defensive hardening release with no new install path or top-line capability that belongs on the README.

Test plan

  • All 99 new unit tests pass (bun test test/content-sanity*.test.ts test/embed-skip.test.ts test/audit/content-sanity-audit.test.ts test/lint-content-sanity.test.ts test/import-file-content-sanity.test.ts)
  • 136 regression tests on edited surfaces pass in isolation
  • `bun run verify` clean (7 pre-checks + typecheck)
  • `bun run typecheck` clean
  • Full `bun run test` parallel suite returns exit 0
  • Optional post-merge: run `gbrain doctor --content-audit` on a real brain with junk pages to validate the threshold defaults against real-world distribution (filed as v0.41+ TODO)

🤖 Generated with Claude Code

garrytan and others added 7 commits May 24, 2026 01:43
…rimitives

Four new core modules (pure, no engine I/O):

- src/core/content-sanity.ts — assessor with 6 hand-vetted junk patterns
  (Cloudflare attention-required, just-a-moment, ray-id; access-denied;
  captcha-required; bare error-page titles). Bytes measured against
  compiled_truth + timeline (parseMarkdown body split, not file bytes).
  ContentSanityBlockError tagged with PAGE_JUNK_PATTERN code so
  classifyErrorCode hits via regex without a new ImportResult field.

- src/core/content-sanity-literals.ts — operator literal-substring loader
  for ~/.gbrain/junk-substrings.txt. Comment directives for name +
  applies_to. ENOENT returns empty list (fail-soft); no regex parsing so
  no ReDoS surface.

- src/core/embed-skip.ts — single source of truth for the embed-skip
  predicate. JS isEmbedSkipped() + filterOutEmbedSkipped() for in-memory
  callers; EMBED_SKIP_FILTER_FRAGMENT raw SQL string for engine-layer
  filters. buildEmbedSkipMarker() emits the canonical frontmatter shape.
  Both Postgres and PGLite use the same JSONB '?' existence operator.

- src/core/audit/content-sanity-audit.ts — ISO-week JSONL at
  ~/.gbrain/audit/content-sanity-YYYY-Www.jsonl. Built on v0.40.4.0
  audit-writer primitive. One stream for hard-block + soft-block + warn
  events with event_type discriminator. summarizeContentSanityEvents
  rolls up by type + source + pattern hits for doctor consumption.

99 unit tests across 4 new test files (207 assertions) covering
boundaries, every built-in pattern, bytes-parity assertion, operator
literals (regex meta-chars stay literal), audit JSONL round-trip + reader.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Embed sweep must skip pages with frontmatter.embed_skip set so soft-blocked
pages don't get re-embedded. Five wiring sites all use the shared helper:

  1. src/commands/embed.ts — --stale CLI path (delegates to embedAllStale)
  2. src/commands/embed.ts — --all CLI path (JS-side filterOutEmbedSkipped
     on the listPages result; Codex r2 #11 caught this previously-missed
     surface that re-embedded soft-blocked pages on every model swap)
  3. src/core/embed-stale.ts:90 — Minion helper (inherits via engine)
  4. src/core/postgres-engine.ts — listStaleChunks + countStaleChunks
     gain 'NOT (COALESCE(p.frontmatter, ''{}''::jsonb) ? ''embed_skip'')'
     filter at the SQL layer. Always JOINs pages now (pre-fix bare path
     skipped the JOIN; D4 + D8 require it for the filter).
  5. src/core/pglite-engine.ts — mirror of postgres-engine; PGLite is
     Postgres 17.5 in WASM so the same JSONB '?' operator works.

Cross-site invariant pinned by test/embed-skip.test.ts (20 cases on the
JS predicate + SQL fragment semantics). When v0.41+ promotes embed_skip
to a schema column, all 5 sites get updated in one helper file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…waist

Hard-block via thrown ContentSanityBlockError; soft-block via frontmatter
marker + chunk deletion on transition (D9 invariant). Single throw point
means every wrapper site (CLI, MCP put_page, sync) inherits correct
exit/error semantics through existing exception flow — no per-wrapper
status-vocabulary changes (Codex r2 #2).

import-file.ts:
- Gate runs AFTER parseMarkdown so assessor sees compiled_truth + timeline
  + title + frontmatter (Codex r2 #5+#7).
- Kill-switch (GBRAIN_NO_SANITY=1) checked via direct process.env AS WELL
  AS effective config — loadConfig() returns null on bare installs (no
  ~/.gbrain/config.json, no DATABASE_URL) so the config-only path missed
  the kill-switch. Caught by test/import-file-content-sanity.test.ts.
- Hard-block: throws ContentSanityBlockError. Existing import.ts catch
  increments errors; sync.ts:929 catch records failure with classified code.
- Soft-block: sets parsed.frontmatter.embed_skip via buildEmbedSkipMarker
  before hash compute (so hash differs from prior version → real write).
  Chunking block guards on isEmbedSkipped → chunks stays empty → existing
  tx.deleteChunks fires (D9 transition invariant).
- Audit JSONL records every assessment (hard / soft / warn + bypass-mode).

sync.ts:
- classifyErrorCode gains /PAGE_JUNK_PATTERN/ → 'PAGE_JUNK_PATTERN' regex.
  No PAGE_OVERSIZED code because oversize is now a soft state — page lands.

config.ts:
- New content_sanity.* field on GBrainConfig (4 keys: bytes_warn,
  bytes_block, junk_patterns_enabled, disabled).
- loadConfig() reads GBRAIN_PAGE_WARN_BYTES, GBRAIN_PAGE_BLOCK_BYTES,
  GBRAIN_NO_JUNK_PATTERNS, GBRAIN_NO_SANITY env vars sparse-merged.
- loadConfigWithEngine merges DB-plane content_sanity.* keys per-key
  sparse-merge so 'gbrain config set content_sanity.bytes_block N' takes
  effect uniformly (Codex r2 #6 D1 acceptance).
- KNOWN_CONFIG_KEYS + KNOWN_CONFIG_KEY_PREFIXES include the new keys.

cli.ts:
- runImport now honors result.errors > 0 for non-zero exit. Pre-fix the
  CLI awaited runImport but discarded the result, so hard-blocked imports
  exited 0 silently (Codex r2 #3).

9 PGLite-backed unit tests pin: hard-block throws, error message contains
PAGE_JUNK_PATTERN, blocked page does NOT land in DB, soft-block writes
page with embed_skip set, soft-block deletes pre-existing chunks (D9
transition), kill-switch bypass works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three operator surfaces backed by the shared content-sanity assessor:

lint.ts (2 new rules):
- huge-page: bytes (compiled_truth + timeline post-parse) exceeds warn or
  block threshold. Message names the actual byte count.
- scraper-junk: built-in junk pattern OR operator literal matched.
- Lint runs parseMarkdown to extract body for bytes-parity with doctor
  (D2 — both surfaces measure body-only, not file-with-frontmatter).
- runLintCore resolves effective config once per run: file/env (sync via
  loadConfig) + DB-lift when ~/.gbrain/ is reachable (D1). CI without
  ~/.gbrain/ falls through immediately. Engine probe wrapped in try/catch
  so lint never blocks on engine state.
- Operator literals loaded once per lint run; passed through to every
  page's lintContent call.

doctor.ts (3 new checks + 1 flag):
- oversized_pages: indexed-free table scan via
  octet_length(compiled_truth) + octet_length(COALESCE(timeline, ''))
  (Codex r2 #13: octet_length is bytes, length is chars). Status warn
  on 1+ rows; oversize is now a soft state so no 'fail'.
- scraper_junk_pages: capped 1000 most-recent default + --content-audit
  opt-in for full scan (D10 mirrors --index-audit precedent from v0.14.3).
  Applies assessor per-page on title + 2KB body slice + frontmatter.
- content_sanity_audit_recent: reads ~/.gbrain/audit/content-sanity-*.jsonl
  for last 7 days, aggregates by event_type + source. Warn at 10+ events,
  fail at 100+. Doctor message names the multi-host limitation explicitly
  (Codex r1 #14): 'audit reflects events on this host only; multi-host
  operators should share GBRAIN_AUDIT_DIR'.

sources.ts (new audit subcommand):
- gbrain sources audit <id> [--json] [--include-warns]
- Reads sources.local_path, walks disk (via pruneDir for node_modules /
  .git / dotfiles), runs assessContentSanity per .md file.
- Reports size distribution (p50, p99, max) + would-hard-block count +
  would-soft-block count + junk-pattern hit map.
- Read-only: NO DB writes, NO file mutations. Operator runs this BEFORE
  a sync to catch junk early, or AFTER landing v0.40.9.0 to audit
  historical inventory.

13 unit tests on lint rules; D1 config-lift behavior pinned by lift
in runLintCore + manual override via opts.contentSanity for tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.40.9.0 — content sanity defense: junk-pattern throw + oversize-skip-embed.

Plus TODOS.md entries for the 9 deferred v0.41+ follow-ups:
- chunk-level embed-quarantine (Codex r1 #3 — page-level granularity wrong)
- source-repo remediation CLI (gbrain sources prune-junk)
- threshold validation post-deploy on real corpora
- brain-score no_junk_pages_score component
- pages soft-delete --where CLI (paired with prune-junk)
- post-v0.45 operator-regex extensibility (needs real ReDoS story)
- post-v0.45 HTML-density rule (needs fenced-code handling)
- bytes-parity E2E across lint + doctor
- 5-path narrow-waist E2E pin tests + doctor integration tests

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add v0.40.9.0 Key Files entries for the content-sanity defense modules:
content-sanity.ts (assessor), content-sanity-literals.ts (operator loader),
embed-skip.ts (5-site shared predicate), audit/content-sanity-audit.ts
(JSONL writer). Extend doctor.ts, lint.ts, embed.ts, import-file.ts, and
sources.ts entries with the v0.40.9.0 surfaces (3 new doctor checks,
2 new lint rules, embed-skip filter at 5 sites, importFromContent gate,
sources audit subcommand).

Regenerate llms-full.txt per the CLAUDE.md edit rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1350 also claimed v0.40.9.0. Advancing this PR to v0.40.10.0 so CI's
version-gate doesn't reject on overlap. No functional change — same shipped
content, just a different version slot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.40.9.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed May 24, 2026
garrytan and others added 2 commits May 24, 2026 09:32
…size-gate

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
…undary flake

PR #1351 ship CI hit a single test failure (one in 2552):
  (fail) scanBrainSources partial-scan state > hanging COUNT does not
  exceed deadline — Promise.race timeout fires [579.01ms]

Run: https://github.com/garrytan/gbrain/actions/runs/77611667786

Cause: heavily-loaded CI runners (8 parallel shards × 4 concurrent test
files = ~32 concurrent bun processes) occasionally let the setTimeout
race callback resolve a microsecond BEFORE the wall-clock boundary,
leaving Date.now() one tick below deadline. The post-await deadline
check at brain-writer.ts:512 uses Date.now() >= deadline; on that tick
the check evaluated false and scanOneSource ran src-a anyway. Test then
asserted firstSource.status === 'skipped' and got 'scanned'.

Fix: add 1ms overshoot to the race-timer schedule:
  setTimeout(..., remainingMs + 1)

Guarantees the timer fires past the deadline by at least one millisecond
regardless of runner timer drift. Cost: 1ms additional wall-clock
latency on hung COUNT queries — operationally negligible.

Verified: stress-tested 5/5 passing locally. The bug class is identical
to the one the existing test comment block (lines 180-187) documents
(`>=` not `>` at line 512); this +1ms is the belt to that suspenders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit fa2c7a6 into master May 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant