Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions docs/proposals/smart-embed-scheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Proposal: Smart Embed Scheduling (Stale Page Catch-Up)

## Problem

`gbrain embed --stale` processes pages one at a time with no prioritization. In a production brain with 165K+ pages, the stale embedding backlog can grow to 24K+ pages. The daily cron runs `gbrain embed --stale` but can only process a fraction of the backlog before timing out or hitting rate limits, meaning the brain falls further behind every day.

Stale embeddings mean semantic search returns outdated results — a page that was updated last week still has embeddings from its original content. For brains that rely on semantic search for entity resolution, content discovery, and agent workflows, this is a silent quality degradation.

### Scale of Impact

| Metric | Value |
|--------|-------|
| Total pages | ~165,000 |
| Stale pages (outdated embeddings) | ~24,700 (15%) |
| Daily cron throughput | ~500-1000 pages |
| Days to clear backlog at current rate | 25-50 days |

## Proposed Solution

### 1. Prioritized Batch Processing

Add `--priority` flag to `gbrain embed --stale`:

```bash
# Process stale pages, most recently modified first
gbrain embed --stale --batch-size 500 --priority recent

# Process stale pages, oldest embeddings first
gbrain embed --stale --batch-size 500 --priority oldest

# Process stale pages for a specific type first
gbrain embed --stale --batch-size 500 --priority-type company,person
```

### 2. Catch-Up Mode

Add `gbrain embed --catch-up` that runs until all stale pages are processed:

```bash
# Process ALL stale pages with rate limiting
gbrain embed --catch-up --rate-limit 100/min

# Catch up with a specific concurrency
gbrain embed --catch-up --concurrency 5

# Catch up but stop after N pages (resume later)
gbrain embed --catch-up --max 5000
```

### 3. Dream Cycle Auto-Escalation

The dream cycle should monitor the stale count and automatically escalate:

```
Stale count < 100: normal batch (default --batch-size)
Stale count 100-1K: 2x batch size
Stale count 1K-10K: 5x batch size + warning in doctor
Stale count > 10K: 10x batch size + alert
```

### Implementation Notes

- Batch processing should use concurrent embedding calls where the provider supports it
- Rate limiting should respect the embedding provider's limits (configurable)
- Progress reporting: show a progress bar or periodic status during catch-up mode
- Interruption handling: catch-up mode should be safely interruptible (Ctrl+C) and resumable
- Track last-processed page so catch-up can resume where it left off

## Agent Onboarding

### Doctor Detection

`gbrain doctor` should flag stale counts:

```
⚠ 24,712 pages have stale embeddings (15% of total)
Semantic search quality is degraded for these pages.
Run `gbrain embed --catch-up` to refresh all stale embeddings.
Estimated time: ~2 hours at default rate.
```

### Migration Prompt (on upgrade)

```
You have 24,712 stale embeddings.
Run `gbrain embed --catch-up` to refresh them?
This may take a while depending on your embedding provider. [y/N]
```

## Evidence

In the production brain, the stale count grew from 0 to 24K over several months because:
1. Pages are modified frequently (meeting notes, company updates, etc.)
2. The daily cron only processes ~500-1000 pages per run
3. No alerting existed to warn the operator about the growing backlog
4. The operator discovered the issue only after noticing degraded search results

## Risks & Mitigations

| Risk | Mitigation |
|------|------------|
| High API costs during catch-up | `--rate-limit` flag, `--max` flag to cap per-run |
| Provider rate limiting | Built-in backoff, configurable concurrency |
| Long-running catch-up blocking other operations | Run in background, interruptible, progress saved |
23 changes: 20 additions & 3 deletions src/core/cycle/synthesize.ts
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,22 @@ function warnUnknownModelOnce(model: string): void {
);
}

// ── Surrogate-safe string slicing ─────────────────────────────────────

/**
* Slice a string at `index` without splitting a UTF-16 surrogate pair.
* If `index` lands between a high and low surrogate, back up by one so
* the pair stays intact in the left half.
*/
function safeSliceEnd(index: number, str: string): number {
if (index <= 0 || index >= str.length) return index;
const code = str.charCodeAt(index - 1);
// If the char just before the cut is a high surrogate (D800–DBFF),
// the cut would orphan it. Back up one.
if (code >= 0xD800 && code <= 0xDBFF) return index - 1;
return index;
}

// ── Hash-deterministic transcript chunker (D9) ────────────────────────

/**
Expand Down Expand Up @@ -178,8 +194,9 @@ function findBoundary(text: string, maxChars: number, searchStart: number): numb
// Tier 3: any newline.
const nlIdx = window.lastIndexOf('\n');
if (nlIdx >= 0) return searchStart + nlIdx;
// No boundary fits; hard-split at maxChars (deterministic).
return maxChars;
// No boundary fits; hard-split at maxChars (deterministic),
// but avoid splitting a UTF-16 surrogate pair.
return safeSliceEnd(maxChars, text);
}

/**
Expand Down Expand Up @@ -659,7 +676,7 @@ export async function judgeSignificance(
// doesn't need the full body; the opening + closing sections are usually
// representative of significance.
const trimmed = t.content.length > 8000
? t.content.slice(0, 4000) + '\n[...truncated...]\n' + t.content.slice(-4000)
? t.content.slice(0, safeSliceEnd(4000, t.content)) + '\n[...truncated...]\n' + t.content.slice(safeSliceEnd(t.content.length - 4000, t.content))
: t.content;

const sys = `You judge whether a conversation transcript is worth synthesizing into a personal knowledge brain.
Expand Down
Loading