garrytan · LexClaw · May 20, 2026 · May 20, 2026 · May 20, 2026
diff --git a/recipes/auto-enrich.md b/recipes/auto-enrich.md
@@ -0,0 +1,74 @@
+---
+id: auto-enrich
+name: Auto-Enrichment Recipe
+version: 0.1.0
+description: Nightly Cal dispatch that enriches sparse/orphan brain pages with research, citations, and cross-links via claude-haiku-4-5.
+category: sense
+requires: []
+secrets: []
+health_checks:
+  - type: env_exists
+    name: HOME
+    label: Heartbeat directory reachable
+setup_time: 30 min
+cost_estimate: "$3-5/mo (claude-haiku-4-5, nightly cadence)"
+---
+
+# Auto-Enrichment Recipe
+
+Nightly sensor that detects sparse, orphan, or stale entity pages in your brain and (in later phases) dispatches a Cal subagent to research and enrich them. Phase 1 ships the sensor and the recipe scaffold only. Research, quality gate, merge, and cron registration arrive in Phases 2 and 3.
+
+## What this is for
+
+Your brain accumulates pages that start as stubs: a person mentioned once, a company with a name and nothing else, a concept that never got fleshed out. Auto-enrich finds those pages on a regular cadence, ranks them by how sparse they are, and (in Phase 3) feeds a research artifact back through a quality gate before merging.
+
+Phase 1 (this PR) delivers:
+
+- Recipe manifest discoverable via `gbrain integrations list`
+- Sensor (`scripts/detect_sparse.py`) that ranks candidate pages via CLI composition (`gbrain list`, `gbrain get`, `gbrain backlinks`)
+- Heartbeat logging to `~/.gbrain/integrations/auto-enrich/heartbeat.jsonl`
+- TDD test suite for the sensor
+
+## Usage (Phase 1, sensor only)
+
+Run the sensor against your live brain:
+
+```bash
+bash recipes/auto-enrich/scripts/run_sensor.sh
+```
+
+Or directly:
+
+```bash
+python3 recipes/auto-enrich/scripts/detect_sparse.py --limit 5
+```
+
+Output is JSON to stdout: a ranked list of `{slug, score, reason, page_type}` records. The script also appends one heartbeat line per run to `~/.gbrain/integrations/auto-enrich/heartbeat.jsonl`.
+
+## Ranking signal
+
+For each candidate page the sensor computes:
+
+- `body_length_penalty` from the body text length returned by `gbrain get <slug>` (target: 1500 chars)
+- `link_starvation_penalty` from the inbound edge count returned by `gbrain backlinks <slug>` (target: 3 inbound links)
+- `enrichment_age_penalty` from the `last_enriched` frontmatter field if present, otherwise treated as never enriched (target: 90 days)
+
+The score is a weighted sum (defaults: 0.4 / 0.3 / 0.3) clamped to [0, 1]. Higher scores rank first.
+
+## Schedule
+
+Phase 1 has no cron registration. The cron is installed by `scripts/register-cron.sh` in Phase 3 (nightly at 03:00 local, delivered via Hermes cron, no Telegram noise on success).
+
+## Config
+
+See `config.yaml` for ranking weights, target thresholds, and runtime paths. Defaults are intentionally conservative; tune after the first dry-run sees real candidate distributions.
+
+## Files
+
+- `auto-enrich.md` (this manifest, flat at `recipes/` per integrations discovery contract)
+- `auto-enrich/README.md` (human-readable extended docs)
+- `auto-enrich/config.yaml` (tunables)
+- `auto-enrich/scripts/detect_sparse.py` (sensor)
+- `auto-enrich/scripts/auto_enrich_lib.py` (Heartbeat + subprocess wrapper)
+- `auto-enrich/scripts/run_sensor.sh` (one-shot invoker)
+- `auto-enrich/tests/` (pytest suite)
diff --git a/recipes/auto-enrich/README.md b/recipes/auto-enrich/README.md
@@ -0,0 +1,89 @@
+# Auto-Enrichment Recipe
+
+Phase 1 (sensor + scaffold).
+
+## What this directory contains
+
+```
+recipes/auto-enrich.md            # discoverable manifest (flat per integrations contract)
+recipes/auto-enrich/
+  README.md                        # this file
+  config.yaml                      # tunables: weights, thresholds, paths
+  scripts/
+    auto_enrich_lib.py             # Heartbeat class + gbrain subprocess wrapper
+    detect_sparse.py               # sensor: ranks sparse/orphan/stale pages
+    run_sensor.sh                  # one-shot bash entry point
+  tests/
+    test_detect_sparse.py          # TDD coverage for the sensor
+```
+
+Runtime state lives at `~/.gbrain/integrations/auto-enrich/` (not committed):
+
+- `heartbeat.jsonl` (append-only health log, pruned to 30 days by `gbrain integrations`)
+- `metrics.jsonl` (Phase 3)
+- `escalations.jsonl` (Phase 3)
+
+## Running the sensor
+
+```bash
+python3 recipes/auto-enrich/scripts/detect_sparse.py --limit 5
+```
+
+CLI flags:
+
+- `--limit N`: maximum candidates to return after ranking (default: from config, 5)
+- `--config PATH`: alternate config.yaml location
+- `--output PATH`: write JSON to file instead of stdout
+- `--types T1,T2`: comma-separated page types to scan (default: concept,entity,person,company)
+- `--candidate-pool N`: how many oldest-updated pages to inspect per type before scoring (default: 50)
+
+Exit codes:
+
+- 0: success, ranked JSON on stdout (or written to `--output`)
+- 1: gbrain subprocess error
+- 2: config parse error
+
+## How the sensor works
+
+1. Enumerate candidates per page type using `gbrain list --type <T> --sort updated_asc --limit <pool>`. The TSV columns are `slug, type, date, title`.
+2. For each candidate, `gbrain get <slug>` returns the markdown. Parse the YAML frontmatter (between the first two `---` fences) with `yaml.safe_load`. The body length is `len(body_string)`, computed client-side.
+3. For each candidate, `gbrain backlinks <slug>` returns a JSON edge array. The inbound link count is the array length.
+4. Score each candidate:
+
+   ```
+   score =   w_body  * clamp(1 - body_length / target_body_length, 0, 1)
+           + w_links * clamp(1 - inbound_count / target_inbound_links, 0, 1)
+           + w_age   * clamp(days_since(last_enriched) / max_age_days, 0, 1)
+   ```
+
+   When `last_enriched` is absent from the frontmatter (the common case until Phase 3 starts writing it), the age penalty maxes out at 1.0.
+5. Sort descending, truncate to `--limit`, emit JSON.
+
+## Discovery contract
+
+Integrations discovery in `src/commands/integrations.ts::loadAllRecipes` walks `recipes/*.md` (flat .md files only, no subdirectory recursion). The manifest is therefore at `recipes/auto-enrich.md`, not `recipes/auto-enrich/recipe.md`. The supporting tree under `recipes/auto-enrich/` is for code, not discovery.
+
+## Heartbeat contract
+
+`auto_enrich_lib.Heartbeat.emit(event, status, details)` appends one JSON line per call to `~/.gbrain/integrations/auto-enrich/heartbeat.jsonl`. Shape matches the format `gbrain integrations` consumes:
+
+```json
+{"ts": "2026-05-20T03:00:00Z", "event": "sensor_run", "source_version": "0.1.0", "status": "ok", "details": {"candidates_scanned": 50, "candidates_returned": 5}}
+```
+
+`gbrain integrations show auto-enrich` and `gbrain integrations status auto-enrich` read this file and surface the most recent entry.
+
+## Tests
+
+```bash
+cd recipes/auto-enrich
+python3 -m pytest tests/ -v
+```
+
+The test suite mocks the `gbrain` subprocess boundary so it does not touch the live brain. Live verification is documented in the deliverable report.
+
+## Phase boundaries
+
+- Phase 1 (this PR): sensor + recipe scaffold + heartbeat. No writes.
+- Phase 2: research strategy + Cal dispatch + research artifact schema.
+- Phase 3: quality gate + synthesize merge + cron registration + live smoke test.
diff --git a/recipes/auto-enrich/config.yaml b/recipes/auto-enrich/config.yaml
@@ -0,0 +1,32 @@
+# Auto-Enrichment config (Phase 1 sensor only)
+sensor:
+  max_candidates_per_run: 5
+  candidate_pool_per_type: 50
+  page_types:
+    - concept
+    - entity
+    - person
+    - company
+  target_body_length: 1500
+  target_inbound_links: 3
+  max_enrichment_age_days: 90
+  ranking_weights:
+    body: 0.4
+    links: 0.3
+    age: 0.3
+
+# Phase 2 + 3 knobs (kept here so the file is the single tunables surface)
+research:
+  cal_subagent_timeout_seconds: 300
+  max_parallel_research: 2
+  cal_model: claude-haiku-4-5
+
+quality_gate:
+  require_citations: true
+  reject_on_destructive_merge: true
+  lint_synthesized_draft: true
+
+runtime:
+  heartbeat_path: ~/.gbrain/integrations/auto-enrich/heartbeat.jsonl
+  metrics_path: ~/.gbrain/integrations/auto-enrich/metrics.jsonl
+  escalations_path: ~/.gbrain/integrations/auto-enrich/escalations.jsonl
diff --git a/recipes/auto-enrich/scripts/__init__.py b/recipes/auto-enrich/scripts/__init__.py
diff --git a/recipes/auto-enrich/scripts/auto_enrich_lib.py b/recipes/auto-enrich/scripts/auto_enrich_lib.py
@@ -0,0 +1,119 @@
+"""Shared helpers for the auto-enrich recipe.
+
+Subprocess wrapper around the gbrain CLI and a Heartbeat class that writes
+JSONL lines to ~/.gbrain/integrations/auto-enrich/heartbeat.jsonl in the
+shape that `gbrain integrations show / status` consumes.
+
+No Python client for gbrain exists; everything is subprocess-only. The
+wrapper raises GBrainCLIError on non-zero exit so callers can map to the
+sensor's exit codes (0 ok, 1 CLI error, 2 config error).
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import subprocess
+from dataclasses import dataclass
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+RECIPE_ID = "auto-enrich"
+RECIPE_VERSION = "0.1.0"
+
+DEFAULT_HEARTBEAT_PATH = Path.home() / ".gbrain" / "integrations" / RECIPE_ID / "heartbeat.jsonl"
+
+
+class GBrainCLIError(RuntimeError):
+    """Raised when a gbrain subprocess returns non-zero."""
+
+    def __init__(self, argv: list[str], returncode: int, stdout: str, stderr: str):
+        self.argv = argv
+        self.returncode = returncode
+        self.stdout = stdout
+        self.stderr = stderr
+        super().__init__(
+            f"gbrain {' '.join(argv[1:])} exited {returncode}: {stderr.strip() or stdout.strip()}"
+        )
+
+
+def run_gbrain(args: list[str], *, timeout: int = 60) -> str:
+    """Invoke `gbrain <args...>` and return stdout. Raises GBrainCLIError on
+    non-zero. Pattern matches recipes/web-to-brain/scripts/web_lib.py."""
+    argv = ["gbrain", *args]
+    try:
+        result = subprocess.run(
+            argv,
+            capture_output=True,
+            text=True,
+            timeout=timeout,
+            check=False,
+        )
+    except FileNotFoundError as exc:
+        raise GBrainCLIError(argv, 127, "", str(exc)) from exc
+    if result.returncode != 0:
+        raise GBrainCLIError(argv, result.returncode, result.stdout, result.stderr)
+    return result.stdout
+
+
+@dataclass
+class Heartbeat:
+    """Append-only JSONL heartbeat log. One line per recipe-run event.
+
+    Matches the shape `gbrain integrations` reads in
+    src/commands/integrations.ts::readHeartbeat (HeartbeatEntry).
+    """
+
+    path: Path = DEFAULT_HEARTBEAT_PATH
+    recipe_id: str = RECIPE_ID
+    source_version: str = RECIPE_VERSION
+
+    def emit(
+        self,
+        event: str,
+        status: str = "ok",
+        details: dict[str, Any] | None = None,
+        error: str | None = None,
+    ) -> None:
+        """Append one JSON line. Creates parent dirs on first write."""
+        entry: dict[str, Any] = {
+            "ts": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+            "event": event,
+            "source_version": self.source_version,
+            "status": status,
+        }
+        if details is not None:
+            entry["details"] = details
+        if error is not None:
+            entry["error"] = error
+        self.path.parent.mkdir(parents=True, exist_ok=True)
+        with self.path.open("a", encoding="utf-8") as fh:
+            fh.write(json.dumps(entry) + "\n")
+
+
+def parse_frontmatter(markdown: str) -> tuple[dict[str, Any], str]:
+    """Split a page returned by `gbrain get` into (frontmatter_dict, body_str).
+
+    The page format is:
+        ---\n<yaml>\n---\n<body>
+
+    Pages without leading frontmatter return ({}, full_text). YAML errors
+    surface ({}, full_text) plus a logged warning rather than crashing the
+    sensor on a single malformed page.
+    """
+    import yaml  # local import keeps top-level import side-effect free
+
+    if not markdown.startswith("---"):
+        return {}, markdown
+    parts = markdown.split("---", 2)
+    if len(parts) < 3:
+        return {}, markdown
+    _, fm_text, body = parts
+    try:
+        data = yaml.safe_load(fm_text) or {}
+    except yaml.YAMLError:
+        return {}, markdown
+    if not isinstance(data, dict):
+        return {}, markdown
+    return data, body.lstrip("\n")