Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions recipes/auto-enrich.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
id: auto-enrich
name: Auto-Enrichment Recipe
version: 0.1.0
description: Nightly Cal dispatch that enriches sparse/orphan brain pages with research, citations, and cross-links via claude-haiku-4-5.
category: sense
requires: []
secrets: []
health_checks:
- type: env_exists
name: HOME
label: Heartbeat directory reachable
setup_time: 30 min
cost_estimate: "$3-5/mo (claude-haiku-4-5, nightly cadence)"
---

# Auto-Enrichment Recipe

Nightly sensor that detects sparse, orphan, or stale entity pages in your brain and (in later phases) dispatches a Cal subagent to research and enrich them. Phase 1 ships the sensor and the recipe scaffold only. Research, quality gate, merge, and cron registration arrive in Phases 2 and 3.

## What this is for

Your brain accumulates pages that start as stubs: a person mentioned once, a company with a name and nothing else, a concept that never got fleshed out. Auto-enrich finds those pages on a regular cadence, ranks them by how sparse they are, and (in Phase 3) feeds a research artifact back through a quality gate before merging.

Phase 1 (this PR) delivers:

- Recipe manifest discoverable via `gbrain integrations list`
- Sensor (`scripts/detect_sparse.py`) that ranks candidate pages via CLI composition (`gbrain list`, `gbrain get`, `gbrain backlinks`)
- Heartbeat logging to `~/.gbrain/integrations/auto-enrich/heartbeat.jsonl`
- TDD test suite for the sensor

## Usage (Phase 1, sensor only)

Run the sensor against your live brain:

```bash
bash recipes/auto-enrich/scripts/run_sensor.sh
```

Or directly:

```bash
python3 recipes/auto-enrich/scripts/detect_sparse.py --limit 5
```

Output is JSON to stdout: a ranked list of `{slug, score, reason, page_type}` records. The script also appends one heartbeat line per run to `~/.gbrain/integrations/auto-enrich/heartbeat.jsonl`.

## Ranking signal

For each candidate page the sensor computes:

- `body_length_penalty` from the body text length returned by `gbrain get <slug>` (target: 1500 chars)
- `link_starvation_penalty` from the inbound edge count returned by `gbrain backlinks <slug>` (target: 3 inbound links)
- `enrichment_age_penalty` from the `last_enriched` frontmatter field if present, otherwise treated as never enriched (target: 90 days)

The score is a weighted sum (defaults: 0.4 / 0.3 / 0.3) clamped to [0, 1]. Higher scores rank first.

## Schedule

Phase 1 has no cron registration. The cron is installed by `scripts/register-cron.sh` in Phase 3 (nightly at 03:00 local, delivered via Hermes cron, no Telegram noise on success).

## Config

See `config.yaml` for ranking weights, target thresholds, and runtime paths. Defaults are intentionally conservative; tune after the first dry-run sees real candidate distributions.

## Files

- `auto-enrich.md` (this manifest, flat at `recipes/` per integrations discovery contract)
- `auto-enrich/README.md` (human-readable extended docs)
- `auto-enrich/config.yaml` (tunables)
- `auto-enrich/scripts/detect_sparse.py` (sensor)
- `auto-enrich/scripts/auto_enrich_lib.py` (Heartbeat + subprocess wrapper)
- `auto-enrich/scripts/run_sensor.sh` (one-shot invoker)
- `auto-enrich/tests/` (pytest suite)
89 changes: 89 additions & 0 deletions recipes/auto-enrich/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Auto-Enrichment Recipe

Phase 1 (sensor + scaffold).

## What this directory contains

```
recipes/auto-enrich.md # discoverable manifest (flat per integrations contract)
recipes/auto-enrich/
README.md # this file
config.yaml # tunables: weights, thresholds, paths
scripts/
auto_enrich_lib.py # Heartbeat class + gbrain subprocess wrapper
detect_sparse.py # sensor: ranks sparse/orphan/stale pages
run_sensor.sh # one-shot bash entry point
tests/
test_detect_sparse.py # TDD coverage for the sensor
```

Runtime state lives at `~/.gbrain/integrations/auto-enrich/` (not committed):

- `heartbeat.jsonl` (append-only health log, pruned to 30 days by `gbrain integrations`)
- `metrics.jsonl` (Phase 3)
- `escalations.jsonl` (Phase 3)

## Running the sensor

```bash
python3 recipes/auto-enrich/scripts/detect_sparse.py --limit 5
```

CLI flags:

- `--limit N`: maximum candidates to return after ranking (default: from config, 5)
- `--config PATH`: alternate config.yaml location
- `--output PATH`: write JSON to file instead of stdout
- `--types T1,T2`: comma-separated page types to scan (default: concept,entity,person,company)
- `--candidate-pool N`: how many oldest-updated pages to inspect per type before scoring (default: 50)

Exit codes:

- 0: success, ranked JSON on stdout (or written to `--output`)
- 1: gbrain subprocess error
- 2: config parse error

## How the sensor works

1. Enumerate candidates per page type using `gbrain list --type <T> --sort updated_asc --limit <pool>`. The TSV columns are `slug, type, date, title`.
2. For each candidate, `gbrain get <slug>` returns the markdown. Parse the YAML frontmatter (between the first two `---` fences) with `yaml.safe_load`. The body length is `len(body_string)`, computed client-side.
3. For each candidate, `gbrain backlinks <slug>` returns a JSON edge array. The inbound link count is the array length.
4. Score each candidate:

```
score = w_body * clamp(1 - body_length / target_body_length, 0, 1)
+ w_links * clamp(1 - inbound_count / target_inbound_links, 0, 1)
+ w_age * clamp(days_since(last_enriched) / max_age_days, 0, 1)
```

When `last_enriched` is absent from the frontmatter (the common case until Phase 3 starts writing it), the age penalty maxes out at 1.0.
5. Sort descending, truncate to `--limit`, emit JSON.

## Discovery contract

Integrations discovery in `src/commands/integrations.ts::loadAllRecipes` walks `recipes/*.md` (flat .md files only, no subdirectory recursion). The manifest is therefore at `recipes/auto-enrich.md`, not `recipes/auto-enrich/recipe.md`. The supporting tree under `recipes/auto-enrich/` is for code, not discovery.

## Heartbeat contract

`auto_enrich_lib.Heartbeat.emit(event, status, details)` appends one JSON line per call to `~/.gbrain/integrations/auto-enrich/heartbeat.jsonl`. Shape matches the format `gbrain integrations` consumes:

```json
{"ts": "2026-05-20T03:00:00Z", "event": "sensor_run", "source_version": "0.1.0", "status": "ok", "details": {"candidates_scanned": 50, "candidates_returned": 5}}
```

`gbrain integrations show auto-enrich` and `gbrain integrations status auto-enrich` read this file and surface the most recent entry.

## Tests

```bash
cd recipes/auto-enrich
python3 -m pytest tests/ -v
```

The test suite mocks the `gbrain` subprocess boundary so it does not touch the live brain. Live verification is documented in the deliverable report.

## Phase boundaries

- Phase 1 (this PR): sensor + recipe scaffold + heartbeat. No writes.
- Phase 2: research strategy + Cal dispatch + research artifact schema.
- Phase 3: quality gate + synthesize merge + cron registration + live smoke test.
32 changes: 32 additions & 0 deletions recipes/auto-enrich/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Auto-Enrichment config (Phase 1 sensor only)
sensor:
max_candidates_per_run: 5
candidate_pool_per_type: 50
page_types:
- concept
- entity
- person
- company
target_body_length: 1500
target_inbound_links: 3
max_enrichment_age_days: 90
ranking_weights:
body: 0.4
links: 0.3
age: 0.3

# Phase 2 + 3 knobs (kept here so the file is the single tunables surface)
research:
cal_subagent_timeout_seconds: 300
max_parallel_research: 2
cal_model: claude-haiku-4-5

quality_gate:
require_citations: true
reject_on_destructive_merge: true
lint_synthesized_draft: true

runtime:
heartbeat_path: ~/.gbrain/integrations/auto-enrich/heartbeat.jsonl
metrics_path: ~/.gbrain/integrations/auto-enrich/metrics.jsonl
escalations_path: ~/.gbrain/integrations/auto-enrich/escalations.jsonl
Empty file.
119 changes: 119 additions & 0 deletions recipes/auto-enrich/scripts/auto_enrich_lib.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
"""Shared helpers for the auto-enrich recipe.

Subprocess wrapper around the gbrain CLI and a Heartbeat class that writes
JSONL lines to ~/.gbrain/integrations/auto-enrich/heartbeat.jsonl in the
shape that `gbrain integrations show / status` consumes.

No Python client for gbrain exists; everything is subprocess-only. The
wrapper raises GBrainCLIError on non-zero exit so callers can map to the
sensor's exit codes (0 ok, 1 CLI error, 2 config error).
"""

from __future__ import annotations

import json
import os
import subprocess
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from typing import Any

RECIPE_ID = "auto-enrich"
RECIPE_VERSION = "0.1.0"

DEFAULT_HEARTBEAT_PATH = Path.home() / ".gbrain" / "integrations" / RECIPE_ID / "heartbeat.jsonl"


class GBrainCLIError(RuntimeError):
"""Raised when a gbrain subprocess returns non-zero."""

def __init__(self, argv: list[str], returncode: int, stdout: str, stderr: str):
self.argv = argv
self.returncode = returncode
self.stdout = stdout
self.stderr = stderr
super().__init__(
f"gbrain {' '.join(argv[1:])} exited {returncode}: {stderr.strip() or stdout.strip()}"
)


def run_gbrain(args: list[str], *, timeout: int = 60) -> str:
"""Invoke `gbrain <args...>` and return stdout. Raises GBrainCLIError on
non-zero. Pattern matches recipes/web-to-brain/scripts/web_lib.py."""
argv = ["gbrain", *args]
try:
result = subprocess.run(
argv,
capture_output=True,
text=True,
timeout=timeout,
check=False,
)
except FileNotFoundError as exc:
raise GBrainCLIError(argv, 127, "", str(exc)) from exc
if result.returncode != 0:
raise GBrainCLIError(argv, result.returncode, result.stdout, result.stderr)
return result.stdout


@dataclass
class Heartbeat:
"""Append-only JSONL heartbeat log. One line per recipe-run event.

Matches the shape `gbrain integrations` reads in
src/commands/integrations.ts::readHeartbeat (HeartbeatEntry).
"""

path: Path = DEFAULT_HEARTBEAT_PATH
recipe_id: str = RECIPE_ID
source_version: str = RECIPE_VERSION

def emit(
self,
event: str,
status: str = "ok",
details: dict[str, Any] | None = None,
error: str | None = None,
) -> None:
"""Append one JSON line. Creates parent dirs on first write."""
entry: dict[str, Any] = {
"ts": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
"event": event,
"source_version": self.source_version,
"status": status,
}
if details is not None:
entry["details"] = details
if error is not None:
entry["error"] = error
self.path.parent.mkdir(parents=True, exist_ok=True)
with self.path.open("a", encoding="utf-8") as fh:
fh.write(json.dumps(entry) + "\n")


def parse_frontmatter(markdown: str) -> tuple[dict[str, Any], str]:
"""Split a page returned by `gbrain get` into (frontmatter_dict, body_str).

The page format is:
---\n<yaml>\n---\n<body>

Pages without leading frontmatter return ({}, full_text). YAML errors
surface ({}, full_text) plus a logged warning rather than crashing the
sensor on a single malformed page.
"""
import yaml # local import keeps top-level import side-effect free

if not markdown.startswith("---"):
return {}, markdown
parts = markdown.split("---", 2)
if len(parts) < 3:
return {}, markdown
_, fm_text, body = parts
try:
data = yaml.safe_load(fm_text) or {}
except yaml.YAMLError:
return {}, markdown
if not isinstance(data, dict):
return {}, markdown
return data, body.lstrip("\n")
Loading