garrytan · garrytan-agents · May 24, 2026 · May 24, 2026 · May 24, 2026
diff --git a/docs/proposals/smart-embed-scheduling.md b/docs/proposals/smart-embed-scheduling.md
@@ -0,0 +1,104 @@
+# Proposal: Smart Embed Scheduling (Stale Page Catch-Up)
+
+## Problem
+
+`gbrain embed --stale` processes pages one at a time with no prioritization. In a production brain with 165K+ pages, the stale embedding backlog can grow to 24K+ pages. The daily cron runs `gbrain embed --stale` but can only process a fraction of the backlog before timing out or hitting rate limits, meaning the brain falls further behind every day.
+
+Stale embeddings mean semantic search returns outdated results — a page that was updated last week still has embeddings from its original content. For brains that rely on semantic search for entity resolution, content discovery, and agent workflows, this is a silent quality degradation.
+
+### Scale of Impact
+
+| Metric | Value |
+|--------|-------|
+| Total pages | ~165,000 |
+| Stale pages (outdated embeddings) | ~24,700 (15%) |
+| Daily cron throughput | ~500-1000 pages |
+| Days to clear backlog at current rate | 25-50 days |
+
+## Proposed Solution
+
+### 1. Prioritized Batch Processing
+
+Add `--priority` flag to `gbrain embed --stale`:
+
+```bash
+# Process stale pages, most recently modified first
+gbrain embed --stale --batch-size 500 --priority recent
+
+# Process stale pages, oldest embeddings first  
+gbrain embed --stale --batch-size 500 --priority oldest
+
+# Process stale pages for a specific type first
+gbrain embed --stale --batch-size 500 --priority-type company,person
+```
+
+### 2. Catch-Up Mode
+
+Add `gbrain embed --catch-up` that runs until all stale pages are processed:
+
+```bash
+# Process ALL stale pages with rate limiting
+gbrain embed --catch-up --rate-limit 100/min
+
+# Catch up with a specific concurrency
+gbrain embed --catch-up --concurrency 5
+
+# Catch up but stop after N pages (resume later)
+gbrain embed --catch-up --max 5000
+```
+
+### 3. Dream Cycle Auto-Escalation
+
+The dream cycle should monitor the stale count and automatically escalate:
+
+```
+Stale count < 100:   normal batch (default --batch-size)
+Stale count 100-1K:  2x batch size
+Stale count 1K-10K:  5x batch size + warning in doctor
+Stale count > 10K:   10x batch size + alert
+```
+
+### Implementation Notes
+
+- Batch processing should use concurrent embedding calls where the provider supports it
+- Rate limiting should respect the embedding provider's limits (configurable)
+- Progress reporting: show a progress bar or periodic status during catch-up mode
+- Interruption handling: catch-up mode should be safely interruptible (Ctrl+C) and resumable
+- Track last-processed page so catch-up can resume where it left off
+
+## Agent Onboarding
+
+### Doctor Detection
+
+`gbrain doctor` should flag stale counts:
+
+```
+⚠ 24,712 pages have stale embeddings (15% of total)
+  Semantic search quality is degraded for these pages.
+  Run `gbrain embed --catch-up` to refresh all stale embeddings.
+  Estimated time: ~2 hours at default rate.
+```
+
+### Migration Prompt (on upgrade)
+
+```
+You have 24,712 stale embeddings.
+Run `gbrain embed --catch-up` to refresh them? 
+This may take a while depending on your embedding provider. [y/N]
+```
+
+## Evidence
+
+In the production brain, the stale count grew from 0 to 24K over several months because:
+1. Pages are modified frequently (meeting notes, company updates, etc.)
+2. The daily cron only processes ~500-1000 pages per run
+3. No alerting existed to warn the operator about the growing backlog
+4. The operator discovered the issue only after noticing degraded search results
+
+## Risks & Mitigations
+
+| Risk | Mitigation |
+|------|------------|
+| High API costs during catch-up | `--rate-limit` flag, `--max` flag to cap per-run |
+| Provider rate limiting | Built-in backoff, configurable concurrency |
+| Long-running catch-up blocking other operations | Run in background, interruptible, progress saved |
diff --git a/src/core/cycle/synthesize.ts b/src/core/cycle/synthesize.ts
@@ -110,6 +110,22 @@ function warnUnknownModelOnce(model: string): void {
   );
 }
 
+// ── Surrogate-safe string slicing ─────────────────────────────────────
+
+/**
+ * Slice a string at `index` without splitting a UTF-16 surrogate pair.
+ * If `index` lands between a high and low surrogate, back up by one so
+ * the pair stays intact in the left half.
+ */
+function safeSliceEnd(index: number, str: string): number {
+  if (index <= 0 || index >= str.length) return index;
+  const code = str.charCodeAt(index - 1);
+  // If the char just before the cut is a high surrogate (D800–DBFF),
+  // the cut would orphan it. Back up one.
+  if (code >= 0xD800 && code <= 0xDBFF) return index - 1;
+  return index;
+}
+
 // ── Hash-deterministic transcript chunker (D9) ────────────────────────
 
 /**
@@ -178,8 +194,9 @@ function findBoundary(text: string, maxChars: number, searchStart: number): numb
   // Tier 3: any newline.
   const nlIdx = window.lastIndexOf('\n');
   if (nlIdx >= 0) return searchStart + nlIdx;
-  // No boundary fits; hard-split at maxChars (deterministic).
-  return maxChars;
+  // No boundary fits; hard-split at maxChars (deterministic),
+  // but avoid splitting a UTF-16 surrogate pair.
+  return safeSliceEnd(maxChars, text);
 }
 
 /**
@@ -659,7 +676,7 @@ export async function judgeSignificance(
   // doesn't need the full body; the opening + closing sections are usually
   // representative of significance.
   const trimmed = t.content.length > 8000
-    ? t.content.slice(0, 4000) + '\n[...truncated...]\n' + t.content.slice(-4000)
+    ? t.content.slice(0, safeSliceEnd(4000, t.content)) + '\n[...truncated...]\n' + t.content.slice(safeSliceEnd(t.content.length - 4000, t.content))
     : t.content;
 
   const sys = `You judge whether a conversation transcript is worth synthesizing into a personal knowledge brain.