Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -585,6 +585,32 @@ Query

Keyword alone misses conceptual matches. Vector alone misses exact phrases. RRF gets both. Search quality is benchmarked and reproducible: `gbrain eval --qrels queries.json` measures P@k, Recall@k, MRR, and nDCG@k. A/B test config changes before deploying them.

### Non-English brains (FTS language config)

The Postgres full-text search tokenizer is configurable via `GBRAIN_FTS_LANGUAGE`. Defaults to `english`. Set it to any text-search configuration that exists in your Postgres instance:

```bash
export GBRAIN_FTS_LANGUAGE=portuguese # uses built-in portuguese stemmer
export GBRAIN_FTS_LANGUAGE=spanish # built-in spanish stemmer
export GBRAIN_FTS_LANGUAGE=pt_br # custom config (e.g. unaccent + portuguese)
```

List available configs: `psql -c "SELECT cfgname FROM pg_ts_config"`. To create a custom accent-insensitive Portuguese config, see [docs/guides/multi-language-fts.md](docs/guides/multi-language-fts.md).

Both the **query side** (`websearch_to_tsquery`) and the **write side** (the trigger functions that populate `pages.search_vector` and `content_chunks.search_vector`) honor `GBRAIN_FTS_LANGUAGE`. On first install, schema migration v33 reads the env var and creates trigger functions in the configured language; subsequent inserts/updates tokenize using that setting.

To change language on a brain that has already run v33, use the dedicated CLI command:

```bash
export GBRAIN_FTS_LANGUAGE=portuguese
gbrain reindex-search-vector --dry-run # preview row counts
gbrain reindex-search-vector --yes # recreate triggers + backfill
```

The command is idempotent (re-running with the same language is a no-op for vector content) and uses the same recreate-and-backfill primitives as v33.

For accent-insensitive Portuguese (`pt_br`), see [docs/guides/multi-language-fts.md](docs/guides/multi-language-fts.md) for the `unaccent` + portuguese stemmer recipe.

## Why it works: many strategies in concert

The brain isn't one trick. Every retrieval question goes through ~20 deterministic
Expand Down
6 changes: 0 additions & 6 deletions TODOS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# TODOS


## v0.35.6.0 floor-ratio gate follow-ups (v0.36.x+)

- [ ] **v0.36.x: Run gbrain-side floor-ratio ablation before flipping any mode-bundle default.** v0.35.6.0 ships the gate default-off (`MODE_BUNDLES[*].floor_ratio = undefined`) because the SkyTwin labeled-retrieval ablation that surfaced the regression isn't reproducible on gbrain's own eval surfaces from outside. Before any mode-bundle default flip, run the gate at `floor_ratio: undefined`, 0.85, 0.90, 0.95 across `gbrain eval longmemeval`, `gbrain eval whoknows`, `gbrain eval suspected-contradictions`, and the BrainBench-Real replay (sibling gbrain-evals repo). Quantify per-mode P@k / R@k / nDCG@k / top-1 stability deltas. Look for: regression on queries that genuinely need the long-tail boost (specific entity lookups, low-frequency topics) vs improvement on queries where weak-overlap pages were leapfrogging. The corpus-level finding determines whether tokenmax (most exposure to the failure mode) should flip first, or whether the gate stays a per-call opt-in indefinitely. Filed during v0.35.6.0 codex outside-voice review.
Expand All @@ -11,14 +10,12 @@

- [ ] **v0.36.x: Reranker top-N expansion when floor-ratio narrows the candidate pool.** Floor-ratio can suppress a legitimate candidate that would have made it to the reranker's top-N. Sanity check after the v0.36 ablation: if tokenmax with `floor_ratio: 0.85` and `reranker_top_n_in: 30` shows the reranker seeing a meaningfully different set than without the gate, consider expanding `reranker_top_n_in` when floor is set (e.g. 30 → 40) so the reranker still has 30 floor-eligible candidates to reorder. Cheap mitigation if the data supports it. Not a blocker.


## dreamy-thompson wave follow-ups (v0.36.x)

- [ ] **v0.36.x: runThink full rewrite — drop ThinkLLMClient indirection.** v0.36's fix(think) wave landed a gateway-backed adapter at `src/core/think/index.ts:225-251` so `gbrain config set anthropic_api_key` works over MCP stdio (closed #952). The adapter routes through `gateway.chat()` but `runThink` still carries the `ThinkLLMClient` interface as the test seam — it's the last LLM-using path that doesn't use the canonical `__setChatTransportForTests` seam v0.31.12 established for chat/embed. Cleanup: drop `ThinkLLMClient`, drop the `opts.client` injection point, migrate the 12+ existing tests (`test/think-pipeline.serial.test.ts:144,181,222`, `test/think-gateway-adapter.test.ts`, plus 9+ others that stub the interface) to `__setChatTransportForTests`. Pros: codebase consistency, one fewer test-stub pattern, easier to add provider switching for think once it routes through gateway natively. Cons: 12+ test files need migration. Blocked by: v0.36 wave landing on master (so the adapter exists to lean on while migrating tests). Plan reference: D5 + D7 in `~/.claude/plans/ok-i-spun-up-dreamy-thompson.md`.

- [ ] **v0.36.x: Supabase parity test fixture for `applyForwardReferenceBootstrap`.** v0.36 fixed the underlying bug (bootstrap now uses the DDL connection from `initSchema` so probes run inside the advisory-lock scope) per codex P1 from /ship adversarial review. What remains is the TEST FIXTURE that proves it: the new pre-v18/pre-v34/pre-v60 E2E tests run against local Docker Postgres but not against Supabase-shape pooler topology (transaction pooler + statement_timeout). Real Supabase upgrades have failed multiple times on this exact connection-topology divergence (#699, #820 lineage). Fix: a test fixture that exercises the probe path against deriveDirectUrl + transaction pooler + statement_timeout. Cons: requires Supabase fixture infra OR careful mocking of the connection-selection logic in `db.ts`'s `getDDLConnection` path.


## kinshasa-v3 follow-ups (v0.35.4.0)

- [ ] **v0.36.x: Fix `supervisor-audit.ts:77` `readSupervisorEvents` to use the dual-week-aware pattern from `stub-guard-audit.ts:readRecentStubGuardEvents`.** The supervisor reader only reads the current ISO-week file, so a 24h sliding window across Monday 00:00 UTC silently loses Sunday's events (they're in last week's file). The new stub-guard reader in v0.35.4.0 fixes this for its own audit log by reading BOTH current and previous week files before timestamp-filtering — the supervisor reader should adopt the same shape. Pin with a unit test that uses a fake-clock fixture set to "Monday 00:01 UTC" with a Sunday 23:55 event in the prior file. Filed during v0.35.4.0 kinshasa-v3 codex outside-voice review.
Expand All @@ -29,7 +26,6 @@

- [ ] **v0.36.x: Sweep the banned private-agent-name references out of `CHANGELOG.md`.** Three pre-existing lines in `CHANGELOG.md` (around lines 2537, 2606, 3304) reference the name that `scripts/check-privacy.sh` enforces against. Pre-existing on master, not introduced by v0.35.4.0; `CHANGELOG.md` is on the script's allow-list so master CI is green, but they still violate the spirit of CLAUDE.md's privacy rule (the allow-list is a meta-documentation exception, not a license to add new references). Replace with `your OpenClaw` or `Garry's OpenClaw` per the script's own suggestion text. Trivial cleanup PR. Filed during v0.35.4.0 privacy audit.


## embed --stale follow-ups (v0.34.4.0)

- [ ] **v0.35.x: Concurrent NULL→non-NULL upsert race in `embed.ts:429-443` + `postgres-engine.ts:1231`'s `COALESCE(EXCLUDED.embedding, content_chunks.embedding)`.** Two `embed --stale` workers (or `embed --stale` racing with a sync that re-embeds the same chunk) can have the slower writer overwrite the faster one's fresher embedding. Window is small (20 workers, all from the same `listStaleChunks` snapshot) but exists. Tractable fix: a `WHERE content_chunks.embedded_at < EXCLUDED.embedded_at OR content_chunks.embedding IS NULL` predicate on the upsert. Out of scope for v0.34.4.0 because the upsert is not in the diff; pre-existing bug. Filed during v0.34.4.0 codex outside-voice review.
Expand All @@ -52,7 +48,6 @@

- [ ] **v0.34.x: `hybrid.ts:223` explicit-pick refactor.** The SearchOpts rebuild manually picks fields from HybridSearchOpts. This is the bug shape that caused the original v0.34.1 P0 leak — a new SearchOpts field is silently dropped if not manually added here. The wave added `sourceId` + `sourceIds` to the pick; future fields will keep hitting this footgun. Fix: refactor to spread + TypeScript `Pick<>` helper that narrows HybridSearchOpts → SearchOpts type-safely.


## functional-area-resolver follow-ups (v0.32.3.0)

- [ ] **v0.33.x: Dogfood `functional-area-resolver` on gbrain's own `skills/RESOLVER.md`** when it crosses ~12KB (currently 8KB). Apply the pattern to the Operational section first (largest). Filed during v0.32.3.0 CEO review.
Expand Down Expand Up @@ -1925,7 +1920,6 @@ flow + recovery messaging).
**Depends on:** decision on whether to deprecate the bare name or dual-publish
during a transition window.


## v0.32.6 follow-ups from PR #880 (gbrain-context post-Codex recalibration)

These items were demoted from the PR #880 scope because they depend on
Expand Down
1 change: 0 additions & 1 deletion docs/UPGRADING_DOWNSTREAM_AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -537,4 +537,3 @@ To check what your fork is missing:
```bash
diff <(grep -A3 "Based on gbrain" ~/<your-fork>/skills/brain-ops/SKILL.md) \
<(grep "v[0-9]" ~/gbrain/skills/migrations/ | tail -3)
```
50 changes: 50 additions & 0 deletions skills/_brain-filing-rules.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,56 @@ gbrain files restore <dir> # Download back to local
This ensures any derived brain page can be traced back to its original source,
and large files don't bloat the git repo.

## `docs/` — Workspace Document Index (v0.5, gdoc-ingest skill)

The `docs/` directory is the canonical INDEX of Google Workspace documents
(Docs / Sheets / Slides / PDFs in Drive) that Rafael cares about. Drive
remains source of truth; the brain is the searchable index.

### Path conventions

| State | Slug pattern | Owner |
|-------|--------------|-------|
| Inbox (untriaged) | `docs/inbox/<slug>` | gdoc-ingest skill auto-creates |
| Triaged (canonical) | `docs/<disciplina>/<tema>/<slug>` | Triage promotes after Rafael confirms |
| Aggregated view | `docs/inbox` | Materialized view of pending items |
| Templates | `docs/<disciplina>/_templates/<slug>` | Underscore prefix |
| Concepts | `docs/concepts/<slug>` | Reusable principles (e.g. title-first-classification) |

### Frontmatter contract

All `docs/` pages MUST carry these frontmatter keys (see prds/gdoc-ingest):

- `type: document`
- `status: draft-index | oficial | draft | arquivado | obsoleto | stale-untriaged`
- `kind: doc | sheet | slide | pdf | drive-file`
- `disciplina` + `tema` (canonical taxonomy in TAXONOMY constant)
- `secondary_tags: []` (other taxonomy matches in body)
- `owner` (email)
- `url_drive` (Drive link — source of truth)
- `file_id` (Drive file ID for de-dup)
- `mimetype` (MIME of the source file)
- `last_modified_drive` + `indexed_at` (ISO timestamps)
- `indexed_via: slack-paste | drive-crawler | manual-cli | e2e-test`
- `raw_char_count` (extracted text length)
- `is_meeting_doc: true` (if Google Meet transcript or Gemini Anotações)
- `slide_stats` (for slides) OR `sheet_stats` (for sheets)

### Filing rule for documents

1. **Title-first classification** — the title decides disciplina/tema, not body keywords. (See `concepts/title-first-classification`.)
2. **Iron Law** — every entity (person, project) mentioned with an existing brain page MUST get a back-link FROM that entity TO the doc page. Skill applies this automatically when `--commit` is used; only entities that ALREADY have pages get linked (notability gate).
3. **Successor detection** — if title suggests a newer version (e.g. "Relatório Mar 2026" with "Relatório Feb 2026" already in brain), skill flags `successorOf` in payload. Triage decides if predecessor goes to `status: arquivado`.
4. **No PII redaction at ingest** — fallback heuristic surfaces raw text. PII redaction is the LLM's responsibility at TRIAGE (sonnet-4-6).
5. **Drive is source of truth** — NEVER edit content in brain page; brain is a read-only index. To change content, edit in Drive and re-ingest.

### Triage workflow

1. Cron `gdoc-inbox-triagem-ping` (sex 15h BRT) lists pending items in Slack.
2. Rafael responds: ✅ confirma slug, ✏️ corrige tema, 🗑️ descarta, ou 🔗 marca como sucessor.
3. On confirm: page moves from `docs/inbox/<slug>` to `docs/<disciplina>/<tema>/<slug>`, status changes to `oficial`.
4. On stale (>60d in inbox): cron auto-tags `stale-untriaged`.

## Dream-cycle synthesize / patterns directories (v0.23)

The `synthesize` and `patterns` phases of `gbrain dream` write to a
Expand Down
1 change: 0 additions & 1 deletion skills/academic-verify/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,6 @@ doesn't, the trace speaks for itself.
skill checks whether the cited claim is true
- `skills/conventions/quality.md` — citation + back-link rules


## Contract

This skill guarantees:
Expand Down
1 change: 0 additions & 1 deletion skills/archive-crawler/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,7 +303,6 @@ scan_paths: ["paths from gbrain.yml"]
the same primary-subject filing rule
- `skills/conventions/quality.md` — citations, back-links, voice


## Contract

This skill guarantees:
Expand Down
1 change: 0 additions & 1 deletion skills/concept-synthesis/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,6 @@ This is heavy work. Run on a cadence, not on every signal:
- `skills/voice-note-ingest/SKILL.md` — same for audio channels
- `skills/idea-ingest/SKILL.md` — same for links / articles


## Contract

This skill guarantees:
Expand Down
Loading