Skip to content

feat(auto-enrich): Phase 1 sensor + recipe scaffold#1236

Open
LexClaw wants to merge 3 commits into
garrytan:masterfrom
LexClaw:feat/auto-enrich-phase1
Open

feat(auto-enrich): Phase 1 sensor + recipe scaffold#1236
LexClaw wants to merge 3 commits into
garrytan:masterfrom
LexClaw:feat/auto-enrich-phase1

Conversation

@LexClaw
Copy link
Copy Markdown

@LexClaw LexClaw commented May 20, 2026

Phase 1 of the auto-enrichment recipe (sensor + scaffold).

What this PR ships

  • recipes/auto-enrich.md - recipe manifest (parses cleanly via gbrain integrations show)
  • recipes/auto-enrich/scripts/detect_sparse.py - sensor via CLI composition (gbrain list -> get -> backlinks)
  • recipes/auto-enrich/scripts/auto_enrich_lib.py - Heartbeat helper + subprocess wrapper
  • recipes/auto-enrich/scripts/run_sensor.sh - thin shell wrapper, emits heartbeat
  • recipes/auto-enrich/config.yaml - tunables
  • recipes/auto-enrich/tests/test_detect_sparse.py - 9 tests, all passing
  • recipes/auto-enrich/README.md - human-readable docs

What this PR does NOT ship

  • Phase 2: research strategy + Cal dispatch (sub-card filed)
  • Phase 3: quality gate + synthesize + cron + smoke (sub-card filed)

Live verification

  • python3 -m pytest recipes/auto-enrich/tests/ -q -> 9 passed
  • gbrain integrations list shows: 'auto-enrich Nightly Cal dispatch... ACTIVE'
  • gbrain integrations status auto-enrich -> ACTIVE, Last event: sensor_run (0h ago)
  • gbrain integrations show auto-enrich parses cleanly

Known follow-up (Phase 1.5, not blocking)

The sensor's default candidate_pool_per_type=50 means ~200 candidates * 2 subprocesses each = ~400 gbrain calls per run. Hits 60s+ wall time on a large brain (63k pages). Phase 1.5 should add per-fetch timeout + concurrent execution or reduce default pool. Not blocking Phase 2.

Hard rule compliance

  • No em dashes in human-facing docs (HR-8)
  • No fabricated CLI verbs (verified against gbrain --help)
  • Subprocess pattern follows web_lib.py (no Python client import attempted)

Card

Phase 1 card kn7dkpzjznxhq978fkx7r7c7kh8738tz on Mission Control board.

@rayers
Copy link
Copy Markdown

rayers commented May 21, 2026

Reviewed locally against my brain (586K pages, 317 concept + 778 person, M365/iMessage-heavy). Phase 1 is solid — clean phase boundary, no DB writes, 9/9 pytest pass in 0.04s, architecture matches the recipes/web-to-brain pattern. Heartbeat schema lines up with what gbrain integrations show / status reads. Live sensor run with default --types concept,entity,person,company returned exactly the kind of true-sparse candidates I'd want surfaced — 30-50 char person stubs ranking at 0.98+.

A few smaller things worth surfacing in case any are useful:

1. never_enriched floors every untouched page at ~0.6, independent of body length.
On a --types concept run, top result was a true sparse 1496-char page (good). Candidates 2 and 3 were 8821-char and 9938-char concept pages — well-developed, just disconnected and never enriched. They scored 0.6 from links=0 (0.3) + never_enriched (0.3), with body contributing 0.0.

Until Phase 3 starts writing last_enriched somewhere, every page floors at 0.6 and well-developed pages without backlinks rank alongside true stubs. Options:

  • Gate the candidate pool with a body_length floor before scoring (skip pages > target_body_length)
  • Suppress the age penalty during a bootstrap window before any last_enriched exists in the corpus
  • Document the bootstrap behavior so the first cron runs aren't surprising

2. recipes/auto-enrich/requirements.txt would help.
pyyaml + pytest aren't declared anywhere. Cloning fresh and running python3 -m pytest recipes/auto-enrich/tests/ -q hits ModuleNotFoundError: No module named 'yaml'. Trivial to add a requirements file scoped to the recipe.

3. GBRAIN_BIN env var would help dev iteration.
auto_enrich_lib.run_gbrain hard-codes ["gbrain", *args]. For testing against a dev build I had to PATH-swap. Two-line change: os.environ.get("GBRAIN_BIN", "gbrain") at the top of the argv list.

4. Discovery path is asserted but not exercised end-to-end.
The README correctly explains the flat-manifest contract (recipes/auto-enrich.md is what loadAllRecipes walks). I couldn't verify discovery from this branch because gbrain integrations show auto-enrich runs against the installed CLI's bundle, not the PR clone. Worth a smoke-test step in the deliverable doc, or a CI check that adds the recipe and asserts gbrain integrations list shows it.

None of this blocks merge. #2 (requirements) feels like it'd be cheap to fold into this PR if you're game; #1, #3, #4 are happy to land separately or with Phase 2.

Nice work landing this with no writes and tight test coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants