diff --git a/CHANGELOG.md b/CHANGELOG.md index 58ba1bd3d..108a527b8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,93 @@ All notable changes to GBrain will be documented in this file. +## [0.39.0.0] - 2026-05-21 + +**You can finally cap the cost of `gbrain brainstorm` and `gbrain lsd`, AND if the cap fires mid-run, you can resume right where you left off without losing the ideas you already paid for.** + +The 13K-page brain incident that started this wave is real and was expensive. A `gbrain lsd` run estimated $0.96, actually billed $50.71, generated zero usable ideas. The fix wave already merged (PR #1234) capped the prefix sampling that caused the explosion. This release goes one cathedral further: every LLM call that any `gbrain` command makes is now accounted at the gateway layer, so the same cap that protects brainstorm also protects `doctor --remediate`, `eval suspected-contradictions`, the dream cycle, and any future LLM-calling command. The plumbing is shared. + +What that means in the hand: pass `--max-cost N` to brainstorm or lsd or `doctor --remediate`, and the first overflow throws a typed error before any extra dollars are spent. The throw fires from inside the gateway's reserve check, so a budget exhaustion never even acquires a rate-lease slot or makes a provider HTTP call. The cap is a real ceiling, not a suggestion. + +When brainstorm IS exhausted mid-run, the orchestrator persists what's been done to `~/.gbrain/brainstorm/.json` with the FULL idea bodies (not just counts), then re-throws. The user paste-runs the suggested `gbrain brainstorm --resume ` and the second run skips the already-completed crosses, runs only the missing ones, then merges everything before the judge runs. The final BrainstormResult contains the pre-crash ideas AND the post-resume ideas. (Codex's outside-voice review was the one that caught this — a resume that produces only the second-run's ideas would be silent partial output, which is worse than no resume at all.) + +### How to turn it on + +```bash +# Cap brainstorm cost at $2 (default $5). Throws BudgetExhausted if exceeded. +gbrain brainstorm "what story should I write next" --max-cost 2 + +# Crash recovery — list saved runs, resume the one you want. +gbrain brainstorm --list-runs +gbrain brainstorm --resume 1a2b3c4d5e6f7890 + +# Bypass the 7-day staleness gate if you really mean it. +gbrain brainstorm --resume 1a2b3c4d5e6f7890 --force-resume + +# Same cap, different command — doctor's autonomous remediation now resumes too. +gbrain doctor --remediate --max-cost 5 +# (on BudgetExhausted, the run persists a checkpoint at +# ~/.gbrain/remediation/.json and tells you the --resume command) +gbrain doctor --remediate --resume +``` + +### What's safe to know about + +A4 amended is a semantic shift: `gbrain doctor --remediate --max-usd` used to be a pre-flight estimate check ("refuse if est > cap"); it's now ALSO a mid-run hard ceiling backed by BudgetTracker via the gateway's AsyncLocalStorage scope. If you cron-schedule `--remediate`, the worst case used to be "the run starts despite the under-estimate"; now the worst case is "the run aborts mid-step and writes a resumable checkpoint." The first failure-mode is gone; the second is recoverable via `--resume`. `--max-cost` is a new alias for `--max-usd` for symmetry with brainstorm. + +The brainstorm checkpoint identity intentionally uses NO embedding bits: `run_id = sha256(question + profile + sort(close_slugs) + sort(far_slugs)).slice(0,16)`. Swap your embedding model between runs and the resume still finds the checkpoint. Conversely, change the question by even one word and you get a different run_id (the previous checkpoint is left alone; the cycle purge phase GCs anything older than 7 days). + +The dream cycle's `~/.gbrain/audit/dream-budget-YYYY-Www.jsonl` grew one new field on every line: `schema_version: 1`. Reorderings are tolerated (downstream consumers should index by field name, not position); renames or removals are breaking. The same schema-stable contract holds for the new `~/.gbrain/audit/budget-YYYY-Www.jsonl` produced by the unified `BudgetTracker`. + +If you wrote integration code against `BudgetExhausted` in the brainstorm orchestrator before this release: that class moved to `src/core/budget/budget-tracker.ts`. The orchestrator re-exports the old name for back-compat, so existing imports keep working. + +### Itemized changes + +- **`BudgetTracker` is the new canonical primitive** at `src/core/budget/budget-tracker.ts`. One class, one typed error (`BudgetExhausted` with `reason: 'cost' | 'runtime' | 'no_pricing'`), one schema-stable audit JSONL. Pinned by 18 unit cases covering TX1 (record throws when cumulative exceeds cap), TX2 (no_pricing hard-fails when cap is set + pricing missing), A3 amended (pessimistic fallback when `err.usage` is absent), the onExhausted-fires-once-before-throw contract, and the schema-stable audit schema. +- **`withBudgetTracker(tracker, fn)` at the gateway layer (TX5)** installs the tracker on a module-internal `AsyncLocalStorage`. Every `gateway.chat / embed / rerank` call inside the scope auto-composes. Outside-scope calls are budget no-ops (existing behavior preserved). Nested scopes restore the outer on exit. Parallel `Promise.all` scopes do not bleed trackers across each other. +- **Subagent rate-lease ordering pinned (A1)**: the gateway's `reserve()` runs BEFORE `acquireLease()` in `src/core/minions/handlers/subagent.ts`. A budget throw must NOT consume a rate-lease slot. The handler body itself no longer needs explicit budget threading; the AsyncLocalStorage composition handles it. +- **`payload-fitter.ts` (P6)** lands at `src/core/diarize/payload-fitter.ts` with two strategies. `'batch'` is deterministic token-budgeted chunking, no LLM calls. `'summarize'` embed-clusters then Haiku-summarizes each cluster in parallel via `Promise.allSettled` at parallelism=4. The quality gate flags `degraded: true` when success ratio drops below the configured `min_success_ratio` (default 0.75) — caller decides whether to surface or abort. +- **Brainstorm checkpoint (P7)** at `src/core/brainstorm/checkpoint.ts`. Atomic .tmp+rename writes. Full idea bodies persisted (TX3). One-flag resume (TX4). 7-day mtime-based GC wired into the cycle purge phase. +- **`doctor --remediate --resume`** loads `~/.gbrain/remediation/.json` and continues from the next un-completed step. Refuses on mismatched plan_hash with a paste-ready message. +- **`gbrain brainstorm --list-runs`** prints saved run_ids + iso dates + question stems so the user can pick which to resume. +- **ISO-week audit filenames consolidated** into `src/core/audit-week-file.ts`. Four call sites migrated (shell-jobs, phantoms, slug-fallback, dream-budget). Year-boundary cases (2020-W53, 2024-12-30 belongs to 2025-W01) pinned by tests. +- **eval-contradictions** routes through `withBudgetTracker` for telemetry without changing the CLI surface. `--budget-usd` semantics + `PreFlightBudgetError` shape are byte-identical. + +### For contributors + +- `bun test` adds 73 new tests across 9 new files (`test/core/budget/`, `test/core/audit-week-file.test.ts`, `test/core/diarize/`, `test/brainstorm/checkpoint.test.ts`, `test/e2e/brainstorm-resume.test.ts`, `test/core/remediation-checkpoint.test.ts`). Plus F1 closes the pre-existing PGLite `page_links` schema gap (the brainstorm domain-bank queries `page_links` but the embedded schema only defined `links`). Brainstorm now works against PGLite brains in production via the new `page_links` view alias shipped in both the embedded schema bundle and migration v86 (renumbered from v81 during merge with master's v0.38 cathedrals which claimed v81-v85). F2 adds an E2E pinning the user-facing `--max-cost` pre-flight refusal path. F3 adds `--max-cost` to `gbrain reindex --code`. All previous brainstorm + doctor + eval-contradictions tests still pass. + +## To take advantage of v0.39.0.0 + +`gbrain upgrade` should do this automatically. If it didn't, or if `gbrain doctor` +warns about a partial migration: + +1. **Run the orchestrator manually:** + ```bash + gbrain apply-migrations --yes + ``` + This applies migration v86 (`page_links_view_alias`) on PGLite + Postgres brains. The alias is required for `gbrain brainstorm` and `gbrain lsd` to work against the domain-bank tiebreaker; without it, the brainstorm domain-bank queries fail with `relation "page_links" does not exist`. +2. **Set a cost cap on the commands you care about:** + ```bash + # Sets a per-run dollar ceiling. Throws BudgetExhausted before any LLM call + # if the pre-run estimate exceeds the cap, AND mid-run if cumulative spend + # blows past it. + gbrain brainstorm "test" --max-cost 1 + gbrain doctor --remediate --max-cost 5 + gbrain reindex --code --max-cost 10 + ``` +3. **Verify the outcome:** + ```bash + gbrain doctor # schema_version should be 86 + gbrain brainstorm --list-runs # confirms the new checkpoint directory exists + ``` +4. **If any step fails or the numbers look wrong,** please file an issue: + https://github.com/garrytan/gbrain/issues with: + - output of `gbrain doctor` + - contents of `~/.gbrain/upgrade-errors.jsonl` if it exists + - which step broke + + This feedback loop is how the gbrain maintainers find fragile upgrade paths. Thank you. ## [0.38.2.0] - 2026-05-22 **`gbrain doctor` no longer hangs on big brains, and gives you real signal when it has to give up.** @@ -683,29 +770,6 @@ Credited contributors per the CHANGELOG attribution convention; closing comments ```bash gbrain apply-migrations --yes ``` -2. **Try the capture verb:** - ```bash - gbrain capture "first thought into v0.38" - gbrain query "first thought" - ``` - The receipt block should show the slug + file path; the query should - return the page within a second. -3. **For webhook ingestion** (only if you run `gbrain serve --http`): - ```bash - curl -X POST https://your-brain/ingest \ - -H "Authorization: Bearer $TOKEN" \ - -H "Content-Type: text/markdown" \ - -d "# webhook test" - ``` - You should see HTTP 202 + a `job_id`. Run `gbrain query "webhook test"` - to confirm the page landed. -4. **If any step fails or the numbers look wrong,** please file an issue: - https://github.com/garrytan/gbrain/issues with: - - output of `gbrain doctor` - - contents of `~/.gbrain/upgrade-errors.jsonl` if it exists - - which step broke - - This feedback loop is how the gbrain maintainers find fragile upgrade paths. Thank you. 2. **Verify the source-routing fix on your federated brains:** ```bash gbrain sources current diff --git a/CLAUDE.md b/CLAUDE.md index 022213d63..96517ad5e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -107,6 +107,12 @@ strict behavior when unset. - `src/core/ai/recipes/voyage.ts` — Voyage AI openai-compatible recipe. **v0.28.7 (#680):** declares `chars_per_token=1` + `safety_factor=0.5` so the gateway pre-splits Voyage batches at a 60K-character budget (50% of 120K-token cap with the dense-tokenizer ratio). Closes the v0.27 backfill loop where ~26% of the corpus stayed un-embedded because tiktoken-grounded budgeting silently undercounted Voyage's actual token usage. **v0.28.11 (#719):** declares `multimodal_models: ['voyage-multimodal-3']` so the gateway rejects text-only Voyage models pointed at the multimodal endpoint with a clear `AIConfigError` instead of waiting for Voyage's HTTP 400. **v0.33.1.1 (#962, fixup):** recipe docstring at `:7-16` tightened to name the seven hosted flexible-dim models that accept `output_dimension` explicitly (`voyage-4-large`, `voyage-4`, `voyage-4-lite`, `voyage-3-large`, `voyage-3.5`, `voyage-3.5-lite`, `voyage-code-3`) and call out that `voyage-4-nano` is the open-weight variant listed separately by Voyage as fixed 1024-dim — does NOT accept the parameter. The "all v4 variants are flexible" misread is what caused the original PR to include nano in `VOYAGE_OUTPUT_DIMENSION_MODELS`; the negative regression assertion in `test/ai/gateway.test.ts` (`dimsProviderOptions` returns `undefined` for `voyage-4-nano`) pins the contract. **v0.37.3.0:** `voyage-code-3` is the recommended embedding model for gstack per-worktree code brains (Topology 3 in `docs/architecture/topologies.md`). Registration was already in the `models` list since pre-v0.33; the v0.37.3.0 wave adds discoverability surfaces — decision-tree branch in `docs/integrations/embedding-providers.md`, Topology 3 "Recommended embedding model" subsection, runtime nudge from `gbrain reindex --code` against non-code-tuned models. Recipe-shape regression pinned by `test/ai/voyage-code-3-recipe.test.ts`. - `src/core/ai/recipes/anthropic.ts` — Anthropic recipe (chat + expansion touchpoints). **v0.31.12:** chat and expansion `models:` lists drop the v0.31.6 phantom `claude-sonnet-4-6-20250929` date suffix — canonical id is `claude-sonnet-4-6`. The wrong-direction alias `claude-sonnet-4-6 → claude-sonnet-4-6-20250929` is removed; a reverse alias `claude-sonnet-4-6-20250929 → claude-sonnet-4-6` keeps stale user configs working (rescues `facts.extraction_model` and `models.dream.synthesize` set by v0.31.6 installs). Recipe-shape regression pinned by `test/anthropic-model-ids.test.ts` (6 cases, verbatim cherry-pick of PR #830 plus the reverse-alias rescue case). - `src/core/anthropic-pricing.ts` — Single source of truth for Anthropic model pricing (per-MTok input/output). **v0.31.12:** Opus 4.7 corrected from `$15/$75` to `$5/$25` (the old number was from Opus 4 generation, never refreshed when 4.7 shipped); Opus 4.6 also corrected. Consumed by `src/core/budget-meter.ts` and `src/core/cross-modal-eval/runner.ts` — the cross-modal estimator now reads `ANTHROPIC_PRICING` for Anthropic models instead of duplicating the table, killing the v0.31.6 drift bug class. +- `src/core/budget/budget-tracker.ts` (v0.37.x) — keystone primitive for the brainstorm cost-cathedral wave. One typed error (`BudgetExhausted` with `reason: 'cost' | 'runtime' | 'no_pricing'`), one schema-stable audit JSONL at `~/.gbrain/audit/budget-YYYY-Www.jsonl`. Contracts pinned by 18 unit cases: **TX1** — `record()` throws when cumulative spend exceeds cap (the cap is a real ceiling, not a suggestion); **TX2** — `reserve()` hard-fails with `reason: 'no_pricing'` when `maxCostUsd` is set AND the model is missing from pricing maps (warn-once preserved when cap is unset); **A3 amended** — `extractUsageFromError(err, fallback)` returns `err.usage` when SDK provides it, else the pessimistic fallback (caller passes `maxOutputTokens`, not the optimistic pre-call estimate). `onExhausted(cb)` callback fires once synchronously BEFORE the throw propagates so callers can persist checkpoints. Replaces three parallel copies (inline brainstorm class, cycle/budget-meter, eval-contradictions). Adapts the old `BudgetMeter` via T5 (public shape preserved + `schema_version: 1` stamped on every dream-budget audit line). +- `src/core/audit-week-file.ts` (v0.37.x, Q1) — single source of truth for ISO-week audit JSONL filename math. Exports `isoWeek(d)`, `isoWeekFilename(prefix, now?)`, `resolveAuditDir()` (honors `GBRAIN_AUDIT_DIR`). Year-boundary correctness pinned by tests at 2020-W53 (the 53-week year), 2025-W01 rolling in from 2024-12-30 (Monday), 2026-W01. Four call sites migrated in T4: `src/core/minions/handlers/shell-audit.ts`, `src/core/facts/phantom-audit.ts`, `src/core/audit-slug-fallback.ts`, `src/core/cycle/budget-meter.ts`. Each call site keeps its `computeAuditFilename` thin wrapper for back-compat with existing tests. +- `src/core/ai/gateway.ts:withBudgetTracker` (v0.37.x, T3 / TX5) — gateway-layer enforcement via `AsyncLocalStorage`. `withBudgetTracker(tracker, fn)` installs the tracker on the module-internal store; every `gateway.chat / embed / rerank` call inside the scope auto-composes (reserve before, record in try/finally). Outside-scope calls are budget no-ops (current behavior preserved). Nested scopes restore the outer tracker on exit. `getCurrentBudgetTracker()` is the test seam. The chat path uses A3-amended pessimistic fallback on error paths; the embed path estimates input tokens from char count × recipe's `chars_per_token` because the AI SDK doesn't surface per-batch embed token usage; the rerank path estimates char count of query+docs. 6 unit cases pin the contract. +- `src/core/diarize/payload-fitter.ts` (v0.37.x, P6 / Q3) — generic fit-arbitrarily-large-items-into-per-call-token-budget utility. `'batch'` strategy is deterministic token-budgeted chunking with no LLM calls. `'summarize'` strategy embed-clusters into ceil(items/4) groups via cheap deterministic nearest-neighbor on cosine, Haiku-summarizes each cluster via `Promise.allSettled` at parallelism=4 (Perf1). Each Haiku call composes the active BudgetTracker via T3's AsyncLocalStorage. The quality gate (codex outside-voice finding #4): when `success_ratio < min_success_ratio` (default 0.75), result is flagged `degraded: true` — the fitter preserves the successful subset; the caller decides whether to surface a partial result or abort. +- `src/core/brainstorm/checkpoint.ts` (v0.37.x, P7 / TX3+TX4+A5 amended) — crash-resilient checkpoint for `gbrain brainstorm` and `gbrain lsd`. Persists FULL idea bodies (~50KB per run) so resume can MERGE the pre-crash ideas with the post-resume ideas before the judge runs (codex's load-bearing finding — a resume that produces only second-run output is silent partial output). `run_id = sha256(question + profile + sort(close_slugs) + sort(far_slugs)).slice(0,16)` — NO embedding bits, stable across embedding-model swaps. Atomic write via `.tmp + rename`. ONE resume flag (`--resume ` — the proposed `--retry-failed` was dropped per TX4: failed AND never-attempted crosses both go through `--resume`). `--list-runs` prints saved run_ids mtime-newest-first. `--force-resume` bypasses the 7-day staleness gate. The cycle purge phase (`gbrain dream --phase purge`) GCs checkpoints older than 7 days via `gcStaleCheckpoints(7)`. Pinned by 20 unit cases + 3 E2E cases in `test/e2e/brainstorm-resume.test.ts` including the load-bearing merge contract. +- `src/core/remediation-checkpoint.ts` (v0.37.x, T7 / A4 amended) — `doctor --remediate` checkpoint at `~/.gbrain/remediation/.json`. `plan_hash = sha256(JSON.stringify(sorted recommendation ids)).slice(0,16)`. Schema-versioned. Atomic write via `.tmp + rename`. `gbrain doctor --remediate --resume ` (or with no arg — picks the newest matching checkpoint) loads it and skips already-completed steps. Mismatched plan_hash refuses with a paste-ready message. Cleared on clean completion. Pinned by 13 unit cases. - `src/core/model-config.ts` — Model-string resolution (the seam every internal LLM call walks through). **v0.31.12:** four-tier system (`ModelTier = 'utility' | 'reasoning' | 'deep' | 'subagent'`) with `TIER_DEFAULTS` (utility→haiku-4-5, reasoning→sonnet-4-6, deep→opus-4-7, subagent→sonnet-4-6) and `tier?: ModelTier` on `ResolveModelOpts`. Resolution chain is now 8 steps: cliFlag → deprecated key → config key → `models.default` → `models.tier.` → env var → `TIER_DEFAULTS[tier]` → caller fallback. Two new exports — `isAnthropicProvider(modelString)` checks `provider:model` prefix OR `claude-` bare-id pattern, and `enforceSubagentAnthropic()` is the layer-2 runtime guard: when `tier === 'subagent'` resolves to a non-Anthropic provider, it emits a once-per-`(source, model)` stderr warn AND falls back to `TIER_DEFAULTS.subagent` instead of letting the Anthropic Messages API tool-loop attempt to run on OpenAI/Gemini. `_resetDeprecationWarningsForTest()` now also clears `_subagentTierWarningsEmitted` so tests re-emit. - `src/core/ai/model-resolver.ts` — Recipe-touchpoint validator. **v0.31.12:** `assertTouchpoint(recipe, touchpoint, modelId, extendedModels?)` gains an optional 4th `extendedModels: ReadonlySet` argument. When the modelId is in that set, the native-recipe allowlist throw is bypassed — the user explicitly opted into this model via config so we let provider rejection surface as `model_not_found` at HTTP call time (and `gbrain models doctor` catches it earlier). Default code paths with hardcoded model strings MUST NOT pass `extendedModels` — typos in source code still fail fast. Replaces the earlier plan to soften the validator wholesale (Codex F4/F5 in plan review flagged that as too broad — it would have removed the fail-fast contract for chat + expand + embed all three). - `src/core/ai/gateway.ts` extension (v0.31.12) — new module-scoped `_extendedModels: Map>` registry feeds `assertTouchpoint`'s 4th-arg path. New `reconfigureGatewayWithEngine(engine)` async function is called from `cli.ts` after `engine.connect()` (and before every command except `CLI_ONLY` no-DB commands) — re-resolves expansion + chat defaults through `resolveModel()` so `models.tier.*` and `models.default` overrides apply to expansion + chat both. `DEFAULT_CHAT_MODEL` corrected to `anthropic:claude-sonnet-4-6` (was the v0.31.6 phantom `-20250929`). New `__setChatTransportForTests` seam mirrors `__setEmbedTransportForTests` so tests drive `chat()` with a stubbed transport. diff --git a/TODOS.md b/TODOS.md index 7d693799b..f5d84718e 100644 --- a/TODOS.md +++ b/TODOS.md @@ -1,6 +1,30 @@ # TODOS +## v0.37.x brainstorm cost-cathedral follow-ups (filed during T12) + +- [ ] **Explicit `--max-cost` flag on `gbrain extract`, `gbrain enrich`, `gbrain integrity auto`.** v0.37.x ships gateway-layer enforcement via `withBudgetTracker` — wrapping any of those commands at their entrypoint with `withBudgetTracker(tracker, fn)` immediately gives them the same cap semantics that brainstorm + doctor --remediate have. The CLI flag wiring (parse `--max-cost`, construct `BudgetTracker` with `maxCostUsd`, wrap the entrypoint) is the only missing piece. ~30 lines each plus smoke tests. Deferred per the plan's "NOT in scope" — gateway-layer composition was the structural goal; the per-command flag wiring is the next ergonomic win. + +- [ ] **`P5` config-schema `budgets:` block in `~/.gbrain/config.json`.** The lsd cost-explosion incident's P5 proposed declarative per-command budgets in config. v0.37.x ships the imperative `--max-cost N` surface, which covers the canonical case. Config-driven defaults (so users don't have to remember to pass `--max-cost` every time) are a v0.38+ ergonomic win. Shape: + ```yaml + budgets: + default: + max_cost_usd: 5.00 + max_runtime_seconds: 300 + brainstorm: { max_cost_usd: 2.00 } + lsd: { max_cost_usd: 5.00 } + dream: { max_cost_usd: 10.00 } + ``` + Resolution: CLI flag > config block > built-in default. + +- [ ] **Multi-day brainstorm resume (>7d).** A5's 7-day mtime window covers >99% of crash-and-resume cases (an operator forgets for a week is rare). `--force-resume` is the escape hatch. The full multi-day story (longer retention, possibly a daily GC instead of cycle-purge-only, dashboard for in-flight runs) is a v0.38+ concern. + +- [ ] **Async-batched audit writes.** Sync `appendFileSync` is fine at typical volumes (~5ms × 100 crosses = ~500ms — not noticeable inside a $1 brainstorm run). Profiling trigger criterion: when 100+ crosses on a large brain shows audit-write time dominating wall-clock cost, switch to an async write queue. Fixing prematurely costs complexity for no measurable benefit. + +- [ ] **`BudgetLedger` unification with `BudgetTracker`.** `src/core/enrichment/budget.ts` defines a separate `BudgetLedger` primitive for per-day, per-scope/resolverId enrichment caps. Different shape from `BudgetTracker` (daily reset windows + multi-tier scope keys). Unification is possible but requires careful schema design to preserve enrichment's existing report semantics. Deferred because: (a) BudgetTracker covers the per-command case cleanly today, (b) the existing BudgetLedger isn't a customer-facing surface — it backs `gbrain enrich`'s internal accounting, (c) merging them would require a schema migration on the enrichment budget audit JSONL. Revisit when the enrichment surface gets its next major touch. + +- [ ] **judges.ts internal chunking → payload-fitter delegation.** v0.37.x ships `src/core/diarize/payload-fitter.ts` with the batch strategy ready to consume from `src/core/brainstorm/judges.ts`'s `runJudge` chunking path. Today judges.ts keeps its own copy of the chunking loop (~30 lines) — straightforward refactor: replace the inline split with `fit({strategy:'batch', items: ideas, maxTokensPerCall, estimateTokens})` and concatenate results. The cost-guardrails test suite already pins the public contract; the refactor is mechanical. Touch one function; trivial. + ## v0.37 PGLite fresh-install fix wave — deferred follow-ups (v0.37.x+ / v0.38.x) - [ ] **`gbrain embed --try-fallback` for provider quota/auth failures.** The v0.37 wave deliberately rejected auto-fallback because silently switching providers writes mixed-space vectors into one `content_chunks.embedding` column, corrupting retrieval. The right design: explicit `--try-fallback` flag that (a) detects the primary failure type (429 / 401 / 5xx), (b) confirms the fallback provider's `embedding_dimensions` matches the schema, (c) prompts the user via TTY before switching mid-corpus, (d) writes a marker chunk attribute so doctor can flag mixed-provider corpora later. Doctor currently surfaces "Detected 1 alternative embedding provider ready to use" but the embed command never acts. Owner: open. Sources: user bug report item #5; v0.37 wave plan deferred list. diff --git a/VERSION b/VERSION index dcc5e4ab3..fc3cdf0a6 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.38.2.0 +0.39.0.0 \ No newline at end of file diff --git a/docs/incidents/2026-05-20-lsd-cost-explosion.md b/docs/incidents/2026-05-20-lsd-cost-explosion.md new file mode 100644 index 000000000..96508c948 --- /dev/null +++ b/docs/incidents/2026-05-20-lsd-cost-explosion.md @@ -0,0 +1,265 @@ +# Incident Report: LSD Brainstorm 53× Cost Overrun + +**Date:** 2026-05-20 +**Severity:** High (financial — $50.71 actual vs $0.96 estimated) +**Component:** `gbrain lsd` / `gbrain brainstorm` +**Brain size:** 13,690 pages, 16,314 links, ~2,000 unique directory prefixes +**Version:** v0.37.1.0 (first release of brainstorm/lsd) + +## What Happened + +A user ran `gbrain lsd "what story should Garry's List write next" --yes` on a 13,690-page brain. The command: + +1. **Estimated cost: $0.96** (2×12 = 24 crosses × 4 ideas + judge) +2. **Actual cost: $50.71** — 53× over estimate +3. **Token usage:** 4,906,011 input + 2,399,239 output = 7.3M total tokens +4. **Far set pulled 1,985 pages** instead of the configured 12 +5. **Generated 15,868 raw ideas** across the crosses (vs expected ~96) +6. **Judge phase failed:** 2,989,338 tokens exceeded Claude Sonnet's 1M context limit +7. **Zero ideas surfaced to the user** — complete failure + +A retry with `--limit 12` explicit: +- Far set correctly returned 12 pages, cost was $0.39 +- But judge still failed: `parseJudgeJSON: no strategy produced valid JSON` +- Again, 0 ideas survived to output (96 generated, 0 scored) + +## Root Causes + +### RC1: Far Set Explosion (caused the $50 bill) + +**File:** `src/core/brainstorm/domain-bank.ts` → `fetchFar()` → `listPrefixSampledPages()` + +The domain bank samples pages by directory prefix to get diversity. `listPrefixSampledPages` returns **one page per prefix passed in**. On a 13K-page brain with ~2,000 unique prefixes (books/, civic/bundles/, civic/gl-article-*, people/, concepts/, etc.), passing all prefixes produces ~2,000 rows — not the configured `m=12`. + +The cost estimator uses `m` (12) to predict crosses and cost. But the actual cross phase receives 1,985 far-set pages, producing `2 × 1985 = 3,970` crosses at 4 ideas each = 15,868 ideas. + +**The estimate formula is correct for the intended behavior; the far set selection is what diverged.** + +### RC2: No Cost Circuit Breaker + +There is no mechanism to: +- Abort if estimated cost exceeds a threshold +- Abort mid-run if actual spend diverges from estimate +- Cap the far set size regardless of prefix count +- Warn the user that a run will be expensive before proceeding + +The `--yes` flag skips the 10-second cost preview wait, removing even the manual inspection opportunity. + +### RC3: Judge Context Overflow + +The judge receives ALL ideas in a single prompt. With 15,868 ideas at ~350 tokens each, that's ~5.5M tokens — well beyond any model's context window. + +Even on the retry with only 96 ideas, the judge failed with JSON parsing errors, suggesting the judge prompt/response format is fragile. + +### RC4: Unpaired UTF-16 Surrogates in Page Content + +Two crosses failed with: `The request body is not valid JSON: no low surrogate in string` + +Some pages (likely OCR imports or web scrapes) contain unpaired UTF-16 surrogates. When these get serialized into the JSON request body for the LLM API, the JSON encoder produces invalid JSON. + +### RC5: No Timeout on Individual Crosses + +One cross timed out with no specific timeout configured. The default HTTP timeout allowed it to hang for an extended period before failing, consuming tokens on the API side. + +## Observed Token Flow + +``` +Configured: 2 close × 12 far = 24 crosses × 4 ideas = 96 ideas + 1 judge call +Actual: 2 close × 1985 far = 3970 crosses × 4 ideas = 15,868 ideas + 1 judge call (failed) + +Per-cross tokens (estimated): ~1,200 in + 600 out +Actual total: 4,906,011 in + 2,399,239 out + +The judge call alone would have been: + 15,868 ideas × ~350 tokens = ~5.5M tokens (prompt) + Model limit: 1M tokens (Sonnet) + Overflow: 5.5× context limit +``` + +## Proposed Fixes + +### P1: Far Set Cap (Critical — prevents cost explosion) + +`fetchFar()` must cap the number of prefixes BEFORE calling `listPrefixSampledPages`. The cap should be `max(m * 4, 50)` to allow some diversity headroom while preventing runaway growth. Final selection trimmed to `m` by distance score. + +**Status:** Implemented in `dc080ac2`. + +### P2: Cost Guardrails (Critical — defense in depth) + +New flags for `brainstorm` and `lsd` commands: +- `--max-cost ` (default $5): hard-abort if pre-run estimate exceeds +- `--strict-budget`: abort mid-run if running cost exceeds 5× estimate +- `--max-far-set ` (default 50): explicit far set size cap + +**Status:** Implemented in `dc080ac2`. + +### P3: Judge Chunking (Critical — prevents context overflow) + +Split ideas into batches of ~100 before calling the judge LLM. Each batch is a separate API call; results concatenated. This bounds per-call token usage to ~35K regardless of total idea count. + +**Status:** Implemented in `dc080ac2`. + +### P4: Unicode Sanitization (Medium — prevents cross failures) + +Strip unpaired UTF-16 surrogates from page content before building cross prompts. This is a general problem for any gbrain function that serializes user-generated page content into JSON for API calls. + +**Status:** Implemented in `dc080ac2`. + +### P5: Global Token & Time Budgets for All Analysis Functions (Proposed) + +**This is the bigger architectural ask.** Every gbrain command that makes LLM calls should respect configurable budgets: + +```yaml +# Proposed config additions to ~/.gbrain/config.json +budgets: + # Global defaults + default: + max_input_tokens: 500_000 # per-command input token cap + max_output_tokens: 200_000 # per-command output token cap + max_cost_usd: 5.00 # per-command dollar cap + max_runtime_seconds: 300 # 5-minute wall-clock cap + + # Per-command overrides + brainstorm: + max_cost_usd: 2.00 + max_runtime_seconds: 120 + lsd: + max_cost_usd: 5.00 + max_runtime_seconds: 300 + dream: + max_cost_usd: 10.00 + max_runtime_seconds: 600 + extract: + max_input_tokens: 1_000_000 + max_runtime_seconds: 900 + enrich: + max_cost_usd: 3.00 + max_runtime_seconds: 180 +``` + +**Commands affected:** +- `brainstorm` / `lsd` — bisociation crosses + judge (this incident) +- `dream` — dream cycle phases (enrichment, emotional weight, etc.) +- `extract all` — link + timeline extraction across all pages +- `enrich` — per-page deep enrichment with web research +- `eval` — evaluation runs (suspected-contradictions, retrieval drift) +- `integrity auto` — automated content repair +- `doctor --remediate` — autonomous self-healing via Minions + +**Implementation approach:** +1. Add a `BudgetTracker` class that wraps LLM calls with token/cost/time accounting +2. Every analysis function receives a budget context +3. On budget exhaustion: save partial results, emit a structured warning, exit cleanly +4. CLI flags (`--max-cost`, `--max-tokens`, `--timeout`) override config defaults +5. `--no-budget` escape hatch for power users who know what they're doing + +### P6: Diarization / Summarization for Oversized Payloads (Proposed) + +When a judge or analysis phase receives more content than fits in context: + +1. **Estimate tokens** before calling the LLM +2. If over budget, **diarize**: summarize/compress the content to fit +3. For the judge specifically: rank ideas by a cheap heuristic first (keyword overlap, novelty score), then send only top-N to the LLM judge +4. For other analysis: progressive summarization — chunk → summarize → merge summaries → final analysis + +This is effectively a **token budget allocator** that decides how to spend a fixed token budget across variable-length inputs. + +``` +Example: 15,868 ideas need judging, context limit 900K tokens + Step 1: Cheap pre-filter (keyword dedup, obvious duplicates) → 8,000 unique ideas + Step 2: Batch into 80 chunks of 100 ideas each + Step 3: Judge each chunk → 80 calls × ~35K tokens = 2.8M total (spread across calls) + Step 4: Merge top ideas from each chunk → final ranking + Total cost: ~$2-3 instead of $50 +``` + +### P7: Structured Error Recovery (Proposed) + +When a cross or judge call fails: +- Save the partial results immediately (don't wait for the full run) +- Emit a machine-readable error event (not just a log warning) +- Support `--retry-failed` to re-run only the failed crosses without repeating successful ones +- Checkpoint progress to disk so interrupted runs can resume + +## Impact + +- **Financial:** $50.71 wasted on a single failed run +- **User trust:** Zero ideas delivered despite ~7M tokens processed +- **Time:** ~15 minutes of compute time, plus overnight delay in reporting results + +## Lessons + +1. **First run of any new feature on a large brain should be dry-run or capped.** The estimate was based on small-brain testing; 13K pages is a different universe. +2. **Cost estimators must account for actual data cardinality, not just configured parameters.** The estimate used `m=12` but the real far set was `|prefixes|`. +3. **Every LLM-calling function needs a budget.** This isn't just a brainstorm problem — it's an architectural gap in any system that makes variable numbers of LLM calls based on data size. +4. **JSON serialization of user content is a landmine.** Any page could contain invalid Unicode. Sanitize at the serialization boundary, not per-feature. + +## Shipped in v0.37.x (the budget cathedral wave) + +P1-P4 already shipped via PR #1234 (the first fix wave). P5-P7 plus a few +architectural rounds shipped in the budget-cathedral wave that followed: + +- **P1 (far set cap):** `fetchFar()` in `src/core/brainstorm/domain-bank.ts` + caps prefix sampling to `max(m*4, 50)` and trims final pages to `m` by + distance. The 2K-prefix explosion class is closed. +- **P2 (cost guardrails):** `--max-cost`, `--max-far-set`, `--strict-budget`, + `--judge-model`, `--max-ideas-per-judge-call` flags on brainstorm + lsd. + Pre-flight estimate refusal, mid-run cost-ceiling abort. +- **P3 (judge chunking):** `runJudge` in `src/core/brainstorm/judges.ts` + auto-chunks at 100 ideas/call. Context-window overflow is structurally + prevented. +- **P4 (unicode sanitization):** `sanitizeUnicode` in + `src/core/brainstorm/orchestrator.ts` strips unpaired surrogates before + serialization. +- **P5 (BudgetTracker at the gateway layer):** new + `src/core/budget/budget-tracker.ts` is the canonical primitive. The + gateway's `withBudgetTracker(tracker, fn)` composes via + `AsyncLocalStorage` so every gateway-routed LLM call + inside the scope auto-records. `BudgetExhausted` is a typed error with + `reason: 'cost' | 'runtime' | 'no_pricing'`. `record()` throws when + cumulative spend exceeds the cap (TX1). `reserve()` hard-fails on + `no_pricing` when the cap is set + model missing from pricing maps (TX2). +- **P6 (payload-fitter):** `src/core/diarize/payload-fitter.ts` with + `'batch'` and `'summarize'` strategies. Summarize embed-clusters + (k=ceil(items/4)), Haiku-summarizes each cluster in parallel via + `Promise.allSettled` at parallelism=4. Surfaces `degraded: true` flag + when success ratio < 0.75 so callers decide whether to surface a partial + result or abort. +- **P7 (brainstorm checkpoint + --resume):** + `src/core/brainstorm/checkpoint.ts` persists FULL idea bodies (not just + counts — TX3 load-bearing). One `--resume ` flag covers both + failed and never-attempted crosses (TX4). `run_id` formula uses NO + embedding bits so the identity is stable across embedding-model swaps + (A5 amended). 7-day mtime-based GC wired into the cycle purge phase. + `--list-runs` lists saved checkpoints. `--force-resume` bypasses the 7d + staleness gate. + +Also shipped alongside the wave (folded inline): + +- **doctor --remediate --resume:** A4 amended. The mid-run cap is now a + real ceiling; `--max-cost` is an alias for `--max-usd`. On + BudgetExhausted, the orchestrator persists a checkpoint at + `~/.gbrain/remediation/.json` and tells the user the exact + `gbrain doctor --remediate --resume` command. The resumed run skips + already-completed steps. +- **Audit-week-file consolidation (Q1):** four call sites + (shell-jobs / phantoms / slug-fallback / dream-budget) now share one + ISO-week filename helper. Year-boundary correctness pinned by tests. +- **eval-contradictions tracker telemetry:** the existing CostTracker + stays for the report shape; the runner additionally installs a + withBudgetTracker scope for the gateway-layer telemetry path. + +What did NOT make this wave (filed in TODOS for a follow-up): + +- The schema fix for `page_links` on PGLite. The brainstorm domain-bank + queries reference `page_links` but the embedded schema only defines + `links`; the E2E works around this with a view in test setup, but + real PGLite users currently can't run `gbrain brainstorm`. Schema fix + needed. +- `--max-cost` flag on `extract`, `enrich`, `integrity auto`. The + gateway-layer enforcement covers them when wrapped at the entrypoint, + but the CLI flag wiring is deferred. +- Async-batched audit writes. Sync `appendFileSync` is fine at typical + volumes; revisit if profiling shows it dominates. +- Multi-day brainstorm resume (>7d). The `--force-resume` flag is the + operator escape hatch for now. diff --git a/llms-full.txt b/llms-full.txt index 3118aa510..05fe1e786 100644 --- a/llms-full.txt +++ b/llms-full.txt @@ -243,6 +243,12 @@ strict behavior when unset. - `src/core/ai/recipes/voyage.ts` — Voyage AI openai-compatible recipe. **v0.28.7 (#680):** declares `chars_per_token=1` + `safety_factor=0.5` so the gateway pre-splits Voyage batches at a 60K-character budget (50% of 120K-token cap with the dense-tokenizer ratio). Closes the v0.27 backfill loop where ~26% of the corpus stayed un-embedded because tiktoken-grounded budgeting silently undercounted Voyage's actual token usage. **v0.28.11 (#719):** declares `multimodal_models: ['voyage-multimodal-3']` so the gateway rejects text-only Voyage models pointed at the multimodal endpoint with a clear `AIConfigError` instead of waiting for Voyage's HTTP 400. **v0.33.1.1 (#962, fixup):** recipe docstring at `:7-16` tightened to name the seven hosted flexible-dim models that accept `output_dimension` explicitly (`voyage-4-large`, `voyage-4`, `voyage-4-lite`, `voyage-3-large`, `voyage-3.5`, `voyage-3.5-lite`, `voyage-code-3`) and call out that `voyage-4-nano` is the open-weight variant listed separately by Voyage as fixed 1024-dim — does NOT accept the parameter. The "all v4 variants are flexible" misread is what caused the original PR to include nano in `VOYAGE_OUTPUT_DIMENSION_MODELS`; the negative regression assertion in `test/ai/gateway.test.ts` (`dimsProviderOptions` returns `undefined` for `voyage-4-nano`) pins the contract. **v0.37.3.0:** `voyage-code-3` is the recommended embedding model for gstack per-worktree code brains (Topology 3 in `docs/architecture/topologies.md`). Registration was already in the `models` list since pre-v0.33; the v0.37.3.0 wave adds discoverability surfaces — decision-tree branch in `docs/integrations/embedding-providers.md`, Topology 3 "Recommended embedding model" subsection, runtime nudge from `gbrain reindex --code` against non-code-tuned models. Recipe-shape regression pinned by `test/ai/voyage-code-3-recipe.test.ts`. - `src/core/ai/recipes/anthropic.ts` — Anthropic recipe (chat + expansion touchpoints). **v0.31.12:** chat and expansion `models:` lists drop the v0.31.6 phantom `claude-sonnet-4-6-20250929` date suffix — canonical id is `claude-sonnet-4-6`. The wrong-direction alias `claude-sonnet-4-6 → claude-sonnet-4-6-20250929` is removed; a reverse alias `claude-sonnet-4-6-20250929 → claude-sonnet-4-6` keeps stale user configs working (rescues `facts.extraction_model` and `models.dream.synthesize` set by v0.31.6 installs). Recipe-shape regression pinned by `test/anthropic-model-ids.test.ts` (6 cases, verbatim cherry-pick of PR #830 plus the reverse-alias rescue case). - `src/core/anthropic-pricing.ts` — Single source of truth for Anthropic model pricing (per-MTok input/output). **v0.31.12:** Opus 4.7 corrected from `$15/$75` to `$5/$25` (the old number was from Opus 4 generation, never refreshed when 4.7 shipped); Opus 4.6 also corrected. Consumed by `src/core/budget-meter.ts` and `src/core/cross-modal-eval/runner.ts` — the cross-modal estimator now reads `ANTHROPIC_PRICING` for Anthropic models instead of duplicating the table, killing the v0.31.6 drift bug class. +- `src/core/budget/budget-tracker.ts` (v0.37.x) — keystone primitive for the brainstorm cost-cathedral wave. One typed error (`BudgetExhausted` with `reason: 'cost' | 'runtime' | 'no_pricing'`), one schema-stable audit JSONL at `~/.gbrain/audit/budget-YYYY-Www.jsonl`. Contracts pinned by 18 unit cases: **TX1** — `record()` throws when cumulative spend exceeds cap (the cap is a real ceiling, not a suggestion); **TX2** — `reserve()` hard-fails with `reason: 'no_pricing'` when `maxCostUsd` is set AND the model is missing from pricing maps (warn-once preserved when cap is unset); **A3 amended** — `extractUsageFromError(err, fallback)` returns `err.usage` when SDK provides it, else the pessimistic fallback (caller passes `maxOutputTokens`, not the optimistic pre-call estimate). `onExhausted(cb)` callback fires once synchronously BEFORE the throw propagates so callers can persist checkpoints. Replaces three parallel copies (inline brainstorm class, cycle/budget-meter, eval-contradictions). Adapts the old `BudgetMeter` via T5 (public shape preserved + `schema_version: 1` stamped on every dream-budget audit line). +- `src/core/audit-week-file.ts` (v0.37.x, Q1) — single source of truth for ISO-week audit JSONL filename math. Exports `isoWeek(d)`, `isoWeekFilename(prefix, now?)`, `resolveAuditDir()` (honors `GBRAIN_AUDIT_DIR`). Year-boundary correctness pinned by tests at 2020-W53 (the 53-week year), 2025-W01 rolling in from 2024-12-30 (Monday), 2026-W01. Four call sites migrated in T4: `src/core/minions/handlers/shell-audit.ts`, `src/core/facts/phantom-audit.ts`, `src/core/audit-slug-fallback.ts`, `src/core/cycle/budget-meter.ts`. Each call site keeps its `computeAuditFilename` thin wrapper for back-compat with existing tests. +- `src/core/ai/gateway.ts:withBudgetTracker` (v0.37.x, T3 / TX5) — gateway-layer enforcement via `AsyncLocalStorage`. `withBudgetTracker(tracker, fn)` installs the tracker on the module-internal store; every `gateway.chat / embed / rerank` call inside the scope auto-composes (reserve before, record in try/finally). Outside-scope calls are budget no-ops (current behavior preserved). Nested scopes restore the outer tracker on exit. `getCurrentBudgetTracker()` is the test seam. The chat path uses A3-amended pessimistic fallback on error paths; the embed path estimates input tokens from char count × recipe's `chars_per_token` because the AI SDK doesn't surface per-batch embed token usage; the rerank path estimates char count of query+docs. 6 unit cases pin the contract. +- `src/core/diarize/payload-fitter.ts` (v0.37.x, P6 / Q3) — generic fit-arbitrarily-large-items-into-per-call-token-budget utility. `'batch'` strategy is deterministic token-budgeted chunking with no LLM calls. `'summarize'` strategy embed-clusters into ceil(items/4) groups via cheap deterministic nearest-neighbor on cosine, Haiku-summarizes each cluster via `Promise.allSettled` at parallelism=4 (Perf1). Each Haiku call composes the active BudgetTracker via T3's AsyncLocalStorage. The quality gate (codex outside-voice finding #4): when `success_ratio < min_success_ratio` (default 0.75), result is flagged `degraded: true` — the fitter preserves the successful subset; the caller decides whether to surface a partial result or abort. +- `src/core/brainstorm/checkpoint.ts` (v0.37.x, P7 / TX3+TX4+A5 amended) — crash-resilient checkpoint for `gbrain brainstorm` and `gbrain lsd`. Persists FULL idea bodies (~50KB per run) so resume can MERGE the pre-crash ideas with the post-resume ideas before the judge runs (codex's load-bearing finding — a resume that produces only second-run output is silent partial output). `run_id = sha256(question + profile + sort(close_slugs) + sort(far_slugs)).slice(0,16)` — NO embedding bits, stable across embedding-model swaps. Atomic write via `.tmp + rename`. ONE resume flag (`--resume ` — the proposed `--retry-failed` was dropped per TX4: failed AND never-attempted crosses both go through `--resume`). `--list-runs` prints saved run_ids mtime-newest-first. `--force-resume` bypasses the 7-day staleness gate. The cycle purge phase (`gbrain dream --phase purge`) GCs checkpoints older than 7 days via `gcStaleCheckpoints(7)`. Pinned by 20 unit cases + 3 E2E cases in `test/e2e/brainstorm-resume.test.ts` including the load-bearing merge contract. +- `src/core/remediation-checkpoint.ts` (v0.37.x, T7 / A4 amended) — `doctor --remediate` checkpoint at `~/.gbrain/remediation/.json`. `plan_hash = sha256(JSON.stringify(sorted recommendation ids)).slice(0,16)`. Schema-versioned. Atomic write via `.tmp + rename`. `gbrain doctor --remediate --resume ` (or with no arg — picks the newest matching checkpoint) loads it and skips already-completed steps. Mismatched plan_hash refuses with a paste-ready message. Cleared on clean completion. Pinned by 13 unit cases. - `src/core/model-config.ts` — Model-string resolution (the seam every internal LLM call walks through). **v0.31.12:** four-tier system (`ModelTier = 'utility' | 'reasoning' | 'deep' | 'subagent'`) with `TIER_DEFAULTS` (utility→haiku-4-5, reasoning→sonnet-4-6, deep→opus-4-7, subagent→sonnet-4-6) and `tier?: ModelTier` on `ResolveModelOpts`. Resolution chain is now 8 steps: cliFlag → deprecated key → config key → `models.default` → `models.tier.` → env var → `TIER_DEFAULTS[tier]` → caller fallback. Two new exports — `isAnthropicProvider(modelString)` checks `provider:model` prefix OR `claude-` bare-id pattern, and `enforceSubagentAnthropic()` is the layer-2 runtime guard: when `tier === 'subagent'` resolves to a non-Anthropic provider, it emits a once-per-`(source, model)` stderr warn AND falls back to `TIER_DEFAULTS.subagent` instead of letting the Anthropic Messages API tool-loop attempt to run on OpenAI/Gemini. `_resetDeprecationWarningsForTest()` now also clears `_subagentTierWarningsEmitted` so tests re-emit. - `src/core/ai/model-resolver.ts` — Recipe-touchpoint validator. **v0.31.12:** `assertTouchpoint(recipe, touchpoint, modelId, extendedModels?)` gains an optional 4th `extendedModels: ReadonlySet` argument. When the modelId is in that set, the native-recipe allowlist throw is bypassed — the user explicitly opted into this model via config so we let provider rejection surface as `model_not_found` at HTTP call time (and `gbrain models doctor` catches it earlier). Default code paths with hardcoded model strings MUST NOT pass `extendedModels` — typos in source code still fail fast. Replaces the earlier plan to soften the validator wholesale (Codex F4/F5 in plan review flagged that as too broad — it would have removed the fail-fast contract for chat + expand + embed all three). - `src/core/ai/gateway.ts` extension (v0.31.12) — new module-scoped `_extendedModels: Map>` registry feeds `assertTouchpoint`'s 4th-arg path. New `reconfigureGatewayWithEngine(engine)` async function is called from `cli.ts` after `engine.connect()` (and before every command except `CLI_ONLY` no-DB commands) — re-resolves expansion + chat defaults through `resolveModel()` so `models.tier.*` and `models.default` overrides apply to expansion + chat both. `DEFAULT_CHAT_MODEL` corrected to `anthropic:claude-sonnet-4-6` (was the v0.31.6 phantom `-20250929`). New `__setChatTransportForTests` seam mirrors `__setEmbedTransportForTests` so tests drive `chat()` with a stubbed transport. diff --git a/package.json b/package.json index ad92ec36a..a8c18e59c 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "gbrain", - "version": "0.38.2.0", + "version": "0.39.0.0", "description": "Postgres-native personal knowledge brain with hybrid RAG search", "type": "module", "main": "src/core/index.ts", diff --git a/src/commands/brainstorm.ts b/src/commands/brainstorm.ts index 2d66b02be..b6fc738a0 100644 --- a/src/commands/brainstorm.ts +++ b/src/commands/brainstorm.ts @@ -29,6 +29,22 @@ export interface BrainstormCliArgs { save?: boolean; yes: boolean; limit?: number; + /** Cost ceiling in USD; aborts pre-run if estimate exceeds. Default $5. */ + maxCost?: number; + /** Hard cap on far-set prefix sampling. Default 50. */ + maxFarSet?: number; + /** When true, abort mid-run if running spend exceeds 5× estimate. */ + strictBudget?: boolean; + /** Override the model used for the judge phase. */ + judgeModel?: string; + /** Max ideas per judge LLM call. Default 100. */ + maxIdeasPerJudgeCall?: number; + /** TX4: resume a crashed run by run_id. */ + resume?: string; + /** Bypass the 7-day staleness gate on resume. */ + forceResume?: boolean; + /** When true, print the list of saved runs + exit. */ + listRuns?: boolean; help: boolean; error?: string; } @@ -57,6 +73,50 @@ export function parseBrainstormArgs(args: string[]): BrainstormCliArgs { return out; } out.limit = n; + } else if (arg === '--max-cost') { + const v = args[++i]; + const n = v ? parseFloat(v) : NaN; + if (!Number.isFinite(n) || n <= 0) { + out.error = `--max-cost requires a positive number in USD (got ${v})`; + return out; + } + out.maxCost = n; + } else if (arg === '--max-far-set') { + const v = args[++i]; + const n = v ? parseInt(v, 10) : NaN; + if (!Number.isFinite(n) || n <= 0) { + out.error = `--max-far-set requires a positive integer (got ${v})`; + return out; + } + out.maxFarSet = n; + } else if (arg === '--strict-budget') { + out.strictBudget = true; + } else if (arg === '--judge-model') { + const v = args[++i]; + if (!v) { + out.error = `--judge-model requires a model id (e.g. anthropic:claude-sonnet-4-6)`; + return out; + } + out.judgeModel = v; + } else if (arg === '--max-ideas-per-judge-call') { + const v = args[++i]; + const n = v ? parseInt(v, 10) : NaN; + if (!Number.isFinite(n) || n <= 0) { + out.error = `--max-ideas-per-judge-call requires a positive integer (got ${v})`; + return out; + } + out.maxIdeasPerJudgeCall = n; + } else if (arg === '--resume') { + const v = args[++i]; + if (!v || v.startsWith('--')) { + out.error = `--resume requires a run_id (use --list-runs to see saved runs)`; + return out; + } + out.resume = v; + } else if (arg === '--force-resume') { + out.forceResume = true; + } else if (arg === '--list-runs') { + out.listRuns = true; } else if (arg.startsWith('--')) { out.error = `unknown flag: ${arg}`; return out; @@ -79,12 +139,20 @@ them, judges with a 5-axis rubric. Output cites close + far slugs with a 0-1 distance score so you can see how far each collision actually traveled. Options: - --json Emit BrainstormResult as JSON (for agents) - --save Save to wiki/ideas/-brainstorm-.md (default ON) - --no-save Don't save; print only - --yes, -y Skip the 10s cost-preview wait (TTY only) - --limit N Override the far-bank size (default 6 brainstorm / 12 LSD) - --help, -h Show this help + --json Emit BrainstormResult as JSON (for agents) + --save Save to wiki/ideas/-brainstorm-.md (default ON) + --no-save Don't save; print only + --yes, -y Skip the 10s cost-preview wait (TTY only) + --limit N Override the far-bank size (default 6 brainstorm / 12 LSD) + --max-cost USD Abort if estimated cost exceeds USD (default 5) + --max-far-set N Cap domain bank prefix sampling (default 50) + --strict-budget Abort if running cost exceeds 5× the estimate + --judge-model MODEL Override the judge LLM (larger-context for big runs) + --max-ideas-per-judge-call N Max ideas per judge LLM call (default 100) + --resume RUN_ID Resume a previously-crashed run (uses --list-runs ids) + --force-resume Bypass the 7-day staleness gate on --resume + --list-runs Print saved run_ids and exit + --help, -h Show this help Examples: gbrain brainstorm "why are AI coding tools converging on the same UX?" @@ -107,11 +175,19 @@ have thought of this without LSD"), every idea must invert at least one implicit axiom. Output is ephemeral by default — pass --save if an idea lands. Options: - --json Emit BrainstormResult as JSON - --save Persist to wiki/ideas/-lsd-.md (default OFF) - --yes, -y Skip the 10s cost-preview wait (TTY only) - --limit N Override the far-bank size (default 12) - --help, -h Show this help + --json Emit BrainstormResult as JSON + --save Persist to wiki/ideas/-lsd-.md (default OFF) + --yes, -y Skip the 10s cost-preview wait (TTY only) + --limit N Override the far-bank size (default 12) + --max-cost USD Abort if estimated cost exceeds USD (default 5) + --max-far-set N Cap domain bank prefix sampling (default 50) + --strict-budget Abort if running cost exceeds 5× the estimate + --judge-model MODEL Override the judge LLM (larger-context for big runs) + --max-ideas-per-judge-call N Max ideas per judge LLM call (default 100) + --resume RUN_ID Resume a previously-crashed run (uses --list-runs ids) + --force-resume Bypass the 7-day staleness gate on --resume + --list-runs Print saved run_ids and exit + --help, -h Show this help Examples: gbrain lsd "why are AI coding tools converging on the same UX?" @@ -140,6 +216,24 @@ async function runBrainstormCli( process.exit(2); return; } + if (parsed.listRuns) { + const { listRuns } = await import('../core/brainstorm/checkpoint.ts'); + const runs = listRuns(); + if (parsed.json) { + console.log(JSON.stringify(runs, null, 2)); + } else if (runs.length === 0) { + console.log('No saved brainstorm runs.'); + } else { + console.log('Saved runs (newest first):'); + console.log('run_id | iso_date | question'); + console.log('------------------+---------------------------+----------------'); + for (const r of runs) { + const iso = new Date(r.mtime).toISOString(); + console.log(`${r.run_id} | ${iso} | ${r.question.slice(0, 60)}`); + } + } + return; + } if (!parsed.question || parsed.question.trim().length === 0) { console.error(`gbrain ${profile.label}: question required`); console.error(help); @@ -160,6 +254,13 @@ async function runBrainstormCli( question: parsed.question, profile: effectiveProfile, skipCostPreview: skipPreview, + maxCostUsd: parsed.maxCost, + maxFarSet: parsed.maxFarSet, + strictBudget: parsed.strictBudget, + judgeModel: parsed.judgeModel, + maxIdeasPerJudgeCall: parsed.maxIdeasPerJudgeCall, + resumeRunId: parsed.resume, + forceResume: parsed.forceResume, }); if (parsed.json) { diff --git a/src/commands/doctor.ts b/src/commands/doctor.ts index 51c4bb6fb..9e4d9a82a 100644 --- a/src/commands/doctor.ts +++ b/src/commands/doctor.ts @@ -4348,13 +4348,36 @@ export async function runRemediate( ): Promise { const targetScore = parseIntFlag(args, '--target-score') ?? 90; const maxJobs = parseIntFlag(args, '--max-jobs') ?? Infinity; - const maxUsd = parseFloatFlag(args, '--max-usd'); + // A4 amended: --max-cost is an alias for --max-usd. Both spellings are + // documented as the cron-safety guard. Either threads through to the + // pre-flight estimate refusal AND, via withBudgetTracker, the mid-run + // BudgetExhausted hard-throw. + const maxUsd = parseFloatFlag(args, '--max-usd') ?? parseFloatFlag(args, '--max-cost'); const dryRun = args.includes('--dry-run'); const skipConfirm = args.includes('--yes'); const jsonOutput = args.includes('--json'); + // A4 amended: --resume loads the checkpoint for the active + // (engine,target) and continues from the next step. With no value, the + // most recent checkpoint for the active engine is loaded. + const resumeFlagIdx = args.indexOf('--resume'); + const resumeMode = resumeFlagIdx !== -1; + const resumeArg = resumeMode ? args[resumeFlagIdx + 1] : undefined; + const resumePlanHash = resumeArg && !resumeArg.startsWith('--') ? resumeArg : undefined; const { computeRecommendations, classifyChecks, maxReachableScore } = await import('../core/brain-score-recommendations.ts'); + const { + BudgetTracker, + BudgetExhausted, + } = await import('../core/budget/budget-tracker.ts'); + const { withBudgetTracker } = await import('../core/ai/gateway.ts'); + const { + computePlanHash, + saveRemediationCheckpoint, + loadRemediationCheckpoint, + listRemediationCheckpoints, + clearRemediationCheckpoint, + } = await import('../core/remediation-checkpoint.ts'); const ctx = await loadRecommendationContext(engine); @@ -4384,6 +4407,46 @@ export async function runRemediate( return; } + // A4 amended: compute plan_hash off the active recommendation ids so the + // checkpoint binds to THIS plan. Resume only fires for matching plans. + const planHash = computePlanHash(recs.map((r) => r.id)); + let completedFromCheckpoint = new Set(); + if (resumeMode) { + const requested = resumePlanHash; + let cp = requested ? loadRemediationCheckpoint(requested) : null; + if (!cp && !requested) { + // No explicit hash: try newest checkpoint that matches the active plan. + const recent = listRemediationCheckpoints(); + for (const e of recent) { + const candidate = loadRemediationCheckpoint(e.plan_hash); + if (candidate && candidate.plan_hash === planHash) { + cp = candidate; + break; + } + } + } + if (!cp) { + console.error( + `[remediate --resume] no matching checkpoint found ` + + `(plan_hash=${planHash}${requested ? `; requested=${requested}` : ''}). ` + + `Run without --resume to start fresh.`, + ); + process.exit(2); + } + if (cp.plan_hash !== planHash) { + console.error( + `[remediate --resume] checkpoint plan_hash=${cp.plan_hash} does not match active plan_hash=${planHash}. ` + + `The plan has changed (brain state moved). Run without --resume to start fresh.`, + ); + process.exit(2); + } + completedFromCheckpoint = new Set(cp.completed.map((c) => c.id)); + console.error( + `[remediate --resume] resuming plan_hash=${planHash}: ${completedFromCheckpoint.size} step(s) completed, ` + + `${recs.length - completedFromCheckpoint.size} remaining.`, + ); + } + const estTotalUsd = recs.reduce((sum, r) => sum + (r.est_usd_cost ?? 0), 0); if (maxUsd !== null && estTotalUsd > maxUsd) { console.error( @@ -4419,61 +4482,132 @@ export async function runRemediate( const { waitForCompletion } = await import('../core/minions/wait-for-completion.ts'); const queue = new MinionQueue(engine); - let stepCount = 0; - while (recs.length > 0 && stepCount < maxJobs) { - const step = recs[0]; - if (!step) break; - stepCount++; + // A4 amended: install a BudgetTracker scope around the plan-step loop so + // any gateway.chat / embed / rerank inside a Minion handler (synthesize, + // patterns, consolidate) auto-enforces the cap. On BudgetExhausted, the + // onExhausted callback persists the checkpoint BEFORE the throw propagates; + // the catch surfaces the actionable --resume hint. + const remediateTracker = new BudgetTracker({ + label: 'doctor.remediate', + maxCostUsd: maxUsd ?? undefined, + }); + + let exhaustionSnapshot: { spent: number; cap: number; reason: string; model_id?: string } | undefined; + remediateTracker.onExhausted(() => { + // BudgetTracker fires this synchronously from inside reserve()/record() + // before the throw bubbles. Persist whatever has been done so far. + const cp = { + schema_version: 1 as const, + plan_hash: planHash, + doctor_run_id: doctorRunId, + target_score: targetScore, + started_at: new Date().toISOString(), + completed: submitted + .filter((s) => s.status === 'completed') + .map((s) => ({ id: s.id, job: '', status: s.status, job_id: s.job_id ?? null })), + aborted_at: new Date().toISOString(), + abort_reason: 'budget_exhausted' as const, + budget_snapshot: exhaustionSnapshot, + }; + saveRemediationCheckpoint(cp); + }); + + const runLoop = async (): Promise => { + let stepCount = 0; + while (recs.length > 0 && stepCount < maxJobs) { + const step = recs[0]; + if (!step) break; + stepCount++; + + // Resume: skip steps that the checkpoint already marked completed. + if (completedFromCheckpoint.has(step.id)) { + submitted.push({ step: stepCount, id: step.id, job_id: null, status: 'completed' }); + recs.shift(); + continue; + } - // D5: if depends_on intersects aborted, skip + cascade - if (step.depends_on && step.depends_on.some((d) => abortedIds.has(d))) { - submitted.push({ step: stepCount, id: step.id, job_id: null, status: 'skipped_dep_aborted' }); - abortedIds.add(step.id); - recs.shift(); - continue; - } + // D5: if depends_on intersects aborted, skip + cascade + if (step.depends_on && step.depends_on.some((d) => abortedIds.has(d))) { + submitted.push({ step: stepCount, id: step.id, job_id: null, status: 'skipped_dep_aborted' }); + abortedIds.add(step.id); + recs.shift(); + continue; + } - try { - const isProtected = !!step.protected; - const job = await queue.add( - step.job, - { ...step.params, doctor_run_id: doctorRunId }, - { - queue: 'default', - idempotency_key: step.idempotency_key, - max_attempts: 2, - maxWaiting: 1, - }, - isProtected ? { allowProtectedSubmit: true } : undefined, - ); - submitted.push({ step: stepCount, id: step.id, job_id: job.id, status: 'submitted' }); + try { + const isProtected = !!step.protected; + const job = await queue.add( + step.job, + { ...step.params, doctor_run_id: doctorRunId }, + { + queue: 'default', + idempotency_key: step.idempotency_key, + max_attempts: 2, + maxWaiting: 1, + }, + isProtected ? { allowProtectedSubmit: true } : undefined, + ); + submitted.push({ step: stepCount, id: step.id, job_id: job.id, status: 'submitted' }); - // Wait for terminal state. PGLite is in-process — short poll. - const terminal = await waitForCompletion(queue, job.id, { - pollMs: isPGLite ? 250 : 1000, - timeoutMs: (step.est_seconds + 60) * 1000, - }); - const lastSub = submitted[submitted.length - 1]; - if (lastSub) lastSub.status = terminal.status; + // Wait for terminal state. PGLite is in-process — short poll. + const terminal = await waitForCompletion(queue, job.id, { + pollMs: isPGLite ? 250 : 1000, + timeoutMs: (step.est_seconds + 60) * 1000, + }); + const lastSub = submitted[submitted.length - 1]; + if (lastSub) lastSub.status = terminal.status; - if (terminal.status !== 'completed') { + if (terminal.status !== 'completed') { + abortedIds.add(step.id); + } + } catch (e) { + if (e instanceof BudgetExhausted) { + exhaustionSnapshot = { + spent: e.spent, + cap: e.cap, + reason: e.reason, + model_id: e.modelId, + }; + throw e; + } + submitted.push({ + step: stepCount, id: step.id, job_id: null, + status: `error: ${(e as Error).message.slice(0, 100)}`, + }); abortedIds.add(step.id); } - } catch (e) { - submitted.push({ - step: stepCount, id: step.id, job_id: null, - status: `error: ${(e as Error).message.slice(0, 100)}`, - }); - abortedIds.add(step.id); + + recs.shift(); + // D7: scoped recheck — re-compute plan from fresh health snapshot. + // The next plan may drop completed steps and re-introduce failed + // steps with bumped retry suffix (D1). + if (recs.length === 0 || stepCount >= maxJobs) break; + const freshHealth = await engine.getHealth(); + recs = computeRecommendations(freshHealth, ctx).filter((r) => r.status === 'remediable'); + } + }; + + let budgetExhaustedAt: InstanceType | null = null; + try { + await withBudgetTracker(remediateTracker, runLoop); + } catch (err) { + if (err instanceof BudgetExhausted) { + budgetExhaustedAt = err; + console.error( + `\n[remediate] BudgetExhausted (${err.reason}): spent $${err.spent.toFixed(4)} > cap $${err.cap.toFixed(2)}.\n` + + `Checkpoint saved. Resume with:\n` + + ` gbrain doctor --remediate --resume ${planHash}\n`, + ); + } else { + throw err; } + } - recs.shift(); - // D7: scoped recheck — re-compute plan from fresh health snapshot. - // The next plan may drop completed steps and re-introduce failed - // steps with bumped retry suffix (D1). - if (recs.length === 0 || stepCount >= maxJobs) break; - const freshHealth = await engine.getHealth(); - recs = computeRecommendations(freshHealth, ctx).filter((r) => r.status === 'remediable'); + // Clear checkpoint on a clean run (no budget abort). Failed steps in the + // submitted set don't disqualify the cleanup — they re-surface on the + // next plan with bumped suffixes. + if (!budgetExhaustedAt) { + clearRemediationCheckpoint(planHash); } const finalHealth = await engine.getHealth(); @@ -4495,7 +4629,7 @@ export async function runRemediate( } const anyFailed = submitted.some((s) => s.status !== 'completed' && s.status !== 'submitted'); - if (anyFailed) process.exit(1); + if (budgetExhaustedAt || anyFailed) process.exit(1); } /** diff --git a/src/commands/reindex-code.ts b/src/commands/reindex-code.ts index 527a0610f..527c400f7 100644 --- a/src/commands/reindex-code.ts +++ b/src/commands/reindex-code.ts @@ -31,6 +31,8 @@ import { errorFor, serializeError } from '../core/errors.ts'; import { createInterface } from 'readline'; import { createProgress } from '../core/progress.ts'; import { getCliOptions, cliOptsToProgressOptions } from '../core/cli-options.ts'; +import { BudgetTracker, BudgetExhausted } from '../core/budget/budget-tracker.ts'; +import { withBudgetTracker } from '../core/ai/gateway.ts'; export interface ReindexCodeOpts { sourceId?: string; @@ -41,6 +43,15 @@ export interface ReindexCodeOpts { noEmbed?: boolean; /** Page batch size. Default 100 (codex Finding 4.4 OOM protection). */ batchSize?: number; + /** + * Cap embedding spend in USD. Default undefined = no cap (legacy behavior). + * When set, the reindex body runs inside a `withBudgetTracker` scope so + * every `gateway.embed()` call inside `importCodeFile` composes with the + * cap. Throws BudgetExhausted (reason='cost') when cumulative exceeds the + * cap; partial progress is preserved (already-imported pages stay + * imported, the throw aborts the remaining batch). + */ + maxCostUsd?: number; } export interface ReindexCodeResult { @@ -229,51 +240,99 @@ export async function runReindexCode( let failed = 0; const failures: Array<{ slug: string; error: string }> = []; let offset = 0; + let budgetExhausted: BudgetExhausted | null = null; - try { - while (true) { - const batch = await fetchCodePages(engine, opts.sourceId, batchSize, offset); - if (batch.length === 0) break; - - for (const row of batch) { - const fm = row.frontmatter ?? {}; - const relPath = typeof fm.file === 'string' ? fm.file : null; - if (!relPath) { - failed++; - failures.push({ slug: row.slug, error: 'missing frontmatter.file' }); - reporter.tick(); - continue; - } - if (!row.compiled_truth) { - failed++; - failures.push({ slug: row.slug, error: 'missing compiled_truth' }); - reporter.tick(); - continue; - } - try { - const result = await importCodeFile(engine, relPath, row.compiled_truth, { - noEmbed: opts.noEmbed, - force: opts.force, - sourceId: opts.sourceId, - }); - if (result.status === 'imported') reindexed++; - else if (result.status === 'skipped') skipped++; - else { + // F3: when --max-cost is set, run the body inside withBudgetTracker so + // every gateway.embed() call inside importCodeFile composes with the cap. + // On BudgetExhausted, we catch + persist what's been imported so far, + // then surface the throw as a partial-progress result the caller can + // re-run. importCodeFile is idempotent (content_hash short-circuit), so + // a re-run picks up where the cap fired. + const reindexBody = async (): Promise => { + try { + while (true) { + const batch = await fetchCodePages(engine, opts.sourceId, batchSize, offset); + if (batch.length === 0) break; + + for (const row of batch) { + const fm = row.frontmatter ?? {}; + const relPath = typeof fm.file === 'string' ? fm.file : null; + if (!relPath) { + failed++; + failures.push({ slug: row.slug, error: 'missing frontmatter.file' }); + reporter.tick(); + continue; + } + if (!row.compiled_truth) { failed++; - failures.push({ slug: row.slug, error: result.error ?? result.status }); + failures.push({ slug: row.slug, error: 'missing compiled_truth' }); + reporter.tick(); + continue; } - } catch (e: unknown) { - failed++; - failures.push({ slug: row.slug, error: e instanceof Error ? e.message : String(e) }); + try { + const result = await importCodeFile(engine, relPath, row.compiled_truth, { + noEmbed: opts.noEmbed, + force: opts.force, + sourceId: opts.sourceId, + }); + if (result.status === 'imported') reindexed++; + else if (result.status === 'skipped') skipped++; + else { + failed++; + failures.push({ slug: row.slug, error: result.error ?? result.status }); + } + } catch (e: unknown) { + // Budget cap is the one error the per-page catch must NOT swallow. + // Caller's outer catch reports partial progress and exits. + if (e instanceof BudgetExhausted) throw e; + failed++; + failures.push({ slug: row.slug, error: e instanceof Error ? e.message : String(e) }); + } + reporter.tick(); } - reporter.tick(); + + offset += batch.length; + if (batch.length < batchSize) break; } + } finally { + reporter.finish(); + } + }; - offset += batch.length; - if (batch.length < batchSize) break; + try { + if (typeof opts.maxCostUsd === 'number' && opts.maxCostUsd > 0) { + const tracker = new BudgetTracker({ maxCostUsd: opts.maxCostUsd, label: 'reindex-code' }); + await withBudgetTracker(tracker, reindexBody); + } else { + await reindexBody(); + } + } catch (e) { + if (e instanceof BudgetExhausted) { + budgetExhausted = e; + } else { + throw e; } - } finally { - reporter.finish(); + } + + if (budgetExhausted) { + // Partial-progress result: surfaces what got reindexed before the cap + // fired. The CLI wrapper translates this into a clear user-facing + // message + non-zero exit; the library result lets agent callers see + // what happened without grep'ing stderr. + return { + status: 'ok', + codePages: totalPages, + reindexed, + skipped, + failed, + totalTokens, + costUsd: budgetExhausted.spent, + model: getEmbeddingModelName(), + failures: [ + { slug: '(budget)', error: budgetExhausted.message }, + ...(failures.length > 0 ? failures : []), + ], + }; } return { @@ -303,8 +362,24 @@ export async function runReindexCodeCli(engine: BrainEngine, args: string[]): Pr const force = args.includes('--force'); const noEmbed = args.includes('--no-embed'); + // F3: --max-cost / --max-cost-usd both accepted for symmetry with brainstorm. + let maxCostUsd: number | undefined; + for (const flag of ['--max-cost', '--max-cost-usd']) { + const idx = args.indexOf(flag); + if (idx >= 0) { + const v = args[idx + 1]; + const n = v ? parseFloat(v) : NaN; + if (!Number.isFinite(n) || n <= 0) { + console.error(`gbrain reindex --code: ${flag} requires a positive number in USD (got ${v ?? '(missing)'})`); + process.exit(2); + } + maxCostUsd = n; + break; + } + } + if (dryRun) { - const result = await runReindexCode(engine, { sourceId, dryRun: true, yes, json, force, noEmbed }); + const result = await runReindexCode(engine, { sourceId, dryRun: true, yes, json, force, noEmbed, maxCostUsd }); if (json) { console.log(JSON.stringify(result)); } else { @@ -357,7 +432,7 @@ export async function runReindexCodeCli(engine: BrainEngine, args: string[]): Pr } } - const result = await runReindexCode(engine, { sourceId, yes, json, force, noEmbed }); + const result = await runReindexCode(engine, { sourceId, yes, json, force, noEmbed, maxCostUsd }); if (json) { console.log(JSON.stringify(result)); } else { diff --git a/src/core/ai/gateway.ts b/src/core/ai/gateway.ts index 3060320aa..05ac7a7b3 100644 --- a/src/core/ai/gateway.ts +++ b/src/core/ai/gateway.ts @@ -22,6 +22,7 @@ */ import { embed as aiEmbed, embedMany, generateObject, generateText } from 'ai'; +import { AsyncLocalStorage } from 'node:async_hooks'; import { listRecipes } from './recipes/index.ts'; import { createOpenAI } from '@ai-sdk/openai'; import { createGoogleGenerativeAI } from '@ai-sdk/google'; @@ -29,6 +30,12 @@ import { createAnthropic } from '@ai-sdk/anthropic'; import { createOpenAICompatible } from '@ai-sdk/openai-compatible'; import { z } from 'zod'; +import { + BudgetTracker, + extractUsageFromError as _extractUsageFromError, + type BudgetKind, +} from '../budget/budget-tracker.ts'; + import type { AIGatewayConfig, EmbedMultimodalOpts, @@ -1125,8 +1132,25 @@ export async function embed(texts: string[], opts?: EmbedOpts): Promise (t ?? '').slice(0, MAX_CHARS)); + + // Reserve up front for the worst-case batch token count. Embeddings have + // no output rate, so maxOutputTokens=0. record() at the end uses the + // actual total reported by the SDK across all sub-batches. + if (tracker) { + const charsPerToken = recipe.touchpoints?.embedding?.chars_per_token ?? DEFAULT_CHARS_PER_TOKEN; + const totalChars = truncated.reduce((s, t) => s + t.length, 0); + const estimatedInputTokens = Math.ceil(totalChars / Math.max(charsPerToken, 1)); + tracker.reserve({ + modelId: `${recipe.id}:${modelId}`, + estimatedInputTokens, + maxOutputTokens: 0, + kind: 'embed', + label: 'gateway.embed', + }); + } // Dim override (D10) — when caller passes `dimensions`, use it. Otherwise // fall back to the global cfg default. dimsProviderOptions throws a // clear AIConfigError when a Voyage flexible-dim model gets an @@ -1151,13 +1175,40 @@ export async function embed(texts: string[], opts?: EmbedOpts): Promise s + t.length, 0); + const inputTokens = Math.ceil(totalChars / Math.max(charsPerToken, 1)); + try { + tracker.record({ + modelId: `${recipe.id}:${modelId}`, + inputTokens, + outputTokens: 0, + embeddingDims: expected, + kind: 'embed', + label: _embedThrew ? 'gateway.embed.failed' : 'gateway.embed', + }); + } catch { + // BudgetExhausted (TX1) — original throw (if any) wins. + } + } } - - return allEmbeddings; } /** @@ -1940,6 +1991,48 @@ export async function generateOcrText(imageBytes: Buffer, mime: string): Promise return (result.text ?? '').trim(); } +// ---- BudgetTracker scope (TX5) ---- +// +// withBudgetTracker(tracker, fn) installs `tracker` on a module-internal +// AsyncLocalStorage for the duration of `fn`. Every gateway.chat / embed / +// rerank call inside the scope auto-composes — no per-call injection seam +// needed, no flag plumbing through command bodies. +// +// Outside the scope, the gateway functions are budget no-ops (current +// behavior preserved). Nested scopes replace the active tracker for the +// inner closure and restore the outer tracker on exit. +// +// IMPORTANT (A1): for the subagent path, reserve() runs implicitly via the +// gateway BEFORE acquireLease() in src/core/minions/handlers/subagent.ts — +// budget throw → no lease attempted, no rate-lease window held. + +const __budgetStore = new AsyncLocalStorage(); + +export function withBudgetTracker(tracker: BudgetTracker, fn: () => Promise): Promise { + return __budgetStore.run(tracker, fn); +} + +export function getCurrentBudgetTracker(): BudgetTracker | null { + return __budgetStore.getStore() ?? null; +} + +/** Internal helper: estimate input tokens from messages + system. Heuristic only + * (~4 chars/token); cap math is best-effort because we pre-flight reservation + * before the SDK has counted anything. */ +function estimateChatInputTokens(opts: { system?: string; messages?: Array<{ content?: unknown }> }): number { + let chars = (opts.system ?? '').length; + for (const m of opts.messages ?? []) { + if (typeof m.content === 'string') chars += m.content.length; + else if (Array.isArray(m.content)) { + for (const block of m.content) { + const t = (block as { text?: unknown }).text; + if (typeof t === 'string') chars += t.length; + } + } + } + return Math.ceil(chars / 4); +} + // ---- Chat (commit 1) ---- /** @@ -2081,14 +2174,70 @@ function mapStopReason( * blocks via the provider-neutral schema landing in commit 2a). */ export async function chat(opts: ChatOpts): Promise { + const tracker = __budgetStore.getStore() ?? null; + const modelStrEarly = opts.model ?? getChatModel(); + const estimatedInputTokens = estimateChatInputTokens(opts); + const maxOutputTokens = opts.maxTokens ?? 4096; + + // TX5: reserve BEFORE the provider call. Throws BudgetExhausted on cost, + // runtime, or no_pricing (when cap is set). Pre-resolution model id is + // fine here — resolveChatProvider would map aliases the same way for the + // cost lookup. record() below uses the real result.model. + if (tracker) { + tracker.reserve({ + modelId: modelStrEarly, + estimatedInputTokens, + maxOutputTokens, + kind: 'chat' as BudgetKind, + label: 'gateway.chat', + }); + } + // Test seam: when a test transport is installed, route through it without // touching provider resolution, AI SDK, or any network. See // __setChatTransportForTests. Production paths see _chatTransport === null. if (_chatTransport) { - return _chatTransport(opts); + let res: ChatResult | null = null; + let threw: unknown = null; + try { + res = await _chatTransport(opts); + return res; + } catch (err) { + threw = err; + throw err; + } finally { + if (tracker) { + try { + if (res) { + tracker.record({ + modelId: res.model ?? modelStrEarly, + inputTokens: res.usage.input_tokens, + outputTokens: res.usage.output_tokens, + label: 'gateway.chat', + }); + } else { + const usage = _extractUsageFromError(threw, { + inputTokens: estimatedInputTokens, + outputTokens: maxOutputTokens, + }); + tracker.record({ + modelId: modelStrEarly, + inputTokens: usage.inputTokens, + outputTokens: usage.outputTokens, + label: 'gateway.chat', + }); + } + } catch { + // record() can throw BudgetExhausted (TX1) — suppress here so the + // original error (if any) wins; the BudgetExhausted is surfaced + // on the NEXT call via reserve(). For test transport this branch + // is rare in practice. + } + } + } } - const modelStr = opts.model ?? getChatModel(); + const modelStr = modelStrEarly; const { model, recipe, modelId } = await resolveChatProvider(modelStr); const supportsCache = recipe.touchpoints.chat?.supports_prompt_cache === true; @@ -2110,6 +2259,22 @@ export async function chat(opts: ChatOpts): Promise { providerOptions.anthropic = { cacheControl: { type: 'ephemeral' } }; } + let _budgetRecorded = false; + const _recordBudget = (modelLabel: string, inputTokens: number, outputTokens: number): void => { + if (!tracker || _budgetRecorded) return; + _budgetRecorded = true; + try { + tracker.record({ + modelId: modelLabel, + inputTokens, + outputTokens, + label: 'gateway.chat', + }); + } catch { + // BudgetExhausted (TX1) raised here; surface via next reserve() + } + }; + try { const result = await generateText({ model, @@ -2156,13 +2321,17 @@ export async function chat(opts: ChatOpts): Promise { const providerMetadata = (result as any).providerMetadata as Record | undefined; const anthropicCache = providerMetadata?.anthropic ?? {}; + const inTok = Number(usage.inputTokens ?? usage.promptTokens ?? 0); + const outTok = Number(usage.outputTokens ?? usage.completionTokens ?? 0); + _recordBudget(`${recipe.id}:${modelId}`, inTok, outTok); + return { text: blocks.filter(b => b.type === 'text').map(b => (b as { type: 'text'; text: string }).text).join(''), blocks, stopReason: mapStopReason((result as any).finishReason, providerMetadata), usage: { - input_tokens: Number(usage.inputTokens ?? usage.promptTokens ?? 0), - output_tokens: Number(usage.outputTokens ?? usage.completionTokens ?? 0), + input_tokens: inTok, + output_tokens: outTok, cache_read_tokens: Number(anthropicCache.cacheReadInputTokens ?? anthropicCache.cache_read_input_tokens ?? 0), cache_creation_tokens: Number(anthropicCache.cacheCreationInputTokens ?? anthropicCache.cache_creation_input_tokens ?? 0), }, @@ -2171,6 +2340,13 @@ export async function chat(opts: ChatOpts): Promise { providerMetadata, }; } catch (err) { + // Pessimistic fallback (A3 amended): when err.usage isn't there, charge + // the worst-case ceiling — better to overcount on failure than under. + const fallback = _extractUsageFromError(err, { + inputTokens: estimatedInputTokens, + outputTokens: maxOutputTokens, + }); + _recordBudget(`${recipe.id}:${modelId}`, fallback.inputTokens, fallback.outputTokens); throw normalizeAIError(err, `chat(${recipe.id}:${modelId})`); } } @@ -2557,6 +2733,21 @@ export async function rerank(input: RerankInput): Promise { input.model ?? getRerankerModel() ?? DEFAULT_RERANKER_MODEL; + + const tracker = __budgetStore.getStore() ?? null; + if (tracker) { + // Reranker pricing isn't in the canonical pricing map today — when no + // cap is set this fires the warn-once path; when a cap IS set TX2 hard- + // fails. record() below logs the actual size after success. + const totalChars = input.query.length + input.documents.reduce((s, d) => s + d.length, 0); + tracker.reserve({ + modelId: modelStr, + estimatedInputTokens: Math.ceil(totalChars / 4), + maxOutputTokens: 0, + kind: 'rerank', + label: 'gateway.rerank', + }); + } const { parsed, recipe } = resolveRecipe(modelStr); const tp = recipe.touchpoints.reranker; if (!tp) { @@ -2620,6 +2811,23 @@ export async function rerank(input: RerankInput): Promise { else input.signal.addEventListener('abort', () => ctrl.abort(input.signal!.reason), { once: true }); } + let _rerankRecorded = false; + const _rerankRecord = (): void => { + if (!tracker || _rerankRecorded) return; + _rerankRecorded = true; + try { + const totalChars = input.query.length + input.documents.reduce((s, d) => s + d.length, 0); + tracker.record({ + modelId: modelStr, + inputTokens: Math.ceil(totalChars / 4), + outputTokens: 0, + kind: 'rerank', + label: 'gateway.rerank', + }); + } catch { + // BudgetExhausted (TX1) suppressed; surfaces on next reserve(). + } + }; try { const transport: RerankTransport = _rerankTransport ?? ((u, init) => fetch(u, init)); const resp = await transport(url, { @@ -2650,11 +2858,14 @@ export async function rerank(input: RerankInput): Promise { if (!json || !Array.isArray(json.results)) { throw new RerankError('rerank: malformed response (no results array)', 'unknown'); } - return json.results.map((r: any) => ({ + const mapped = json.results.map((r: any) => ({ index: typeof r.index === 'number' ? r.index : 0, relevanceScore: typeof r.relevance_score === 'number' ? r.relevance_score : 0, })); + _rerankRecord(); + return mapped; } catch (err) { + _rerankRecord(); if (err instanceof RerankError) throw err; // AbortError on timeout — classify cleanly. if (err && typeof err === 'object' && (err as any).name === 'AbortError') { diff --git a/src/core/audit-slug-fallback.ts b/src/core/audit-slug-fallback.ts index 345f16846..11cf3ef8c 100644 --- a/src/core/audit-slug-fallback.ts +++ b/src/core/audit-slug-fallback.ts @@ -20,7 +20,7 @@ import * as fs from 'node:fs'; import * as path from 'node:path'; -import { resolveAuditDir } from './minions/handlers/shell-audit.ts'; +import { isoWeekFilename, resolveAuditDir } from './audit-week-file.ts'; export interface SlugFallbackAuditEvent { ts: string; @@ -34,18 +34,10 @@ export interface SlugFallbackAuditEvent { code: 'SLUG_FALLBACK_FRONTMATTER'; } -/** ISO-week-rotated filename: `slug-fallback-YYYY-Www.jsonl`. */ +/** ISO-week-rotated filename: `slug-fallback-YYYY-Www.jsonl`. Delegates to + * `src/core/audit-week-file.ts`. */ export function computeSlugFallbackAuditFilename(now: Date = new Date()): string { - const d = new Date(Date.UTC(now.getUTCFullYear(), now.getUTCMonth(), now.getUTCDate())); - const dayNum = (d.getUTCDay() + 6) % 7; - d.setUTCDate(d.getUTCDate() - dayNum + 3); - const isoYear = d.getUTCFullYear(); - const firstThursday = new Date(Date.UTC(isoYear, 0, 4)); - const firstThursdayDayNum = (firstThursday.getUTCDay() + 6) % 7; - firstThursday.setUTCDate(firstThursday.getUTCDate() - firstThursdayDayNum + 3); - const weekNum = Math.round((d.getTime() - firstThursday.getTime()) / (7 * 86400000)) + 1; - const ww = String(weekNum).padStart(2, '0'); - return `slug-fallback-${isoYear}-W${ww}.jsonl`; + return isoWeekFilename('slug-fallback', now); } /** diff --git a/src/core/audit-week-file.ts b/src/core/audit-week-file.ts new file mode 100644 index 000000000..34dade137 --- /dev/null +++ b/src/core/audit-week-file.ts @@ -0,0 +1,59 @@ +/** + * v0.37.x — single source of truth for the ISO-week filename math used by + * every gbrain audit JSONL writer (shell-audit, phantom-audit, + * slug-fallback-audit, budget-tracker audit, dream-budget audit). + * + * Why: each of those modules grew its own copy of the same ISO-week math + * with subtle drift (some used UTC, some used local; some used Sunday-start + * weeks, some used Thursday-start ISO weeks). One shared helper keeps the + * filenames consistent so an operator can grep one filename pattern across + * audit dirs. + * + * ISO 8601 week numbering: + * - Weeks start on Monday. + * - Week 1 of any year is the week containing the year's first Thursday. + * - A date can belong to a week whose ISO year differs from the calendar + * year (Dec 31 of a Wednesday-ending year belongs to W01 of the next). + * - Year-boundary correctness is pinned by `test/core/audit-week-file.test.ts`. + */ + +import { gbrainPath } from './config.ts'; + +/** + * Compute the ISO-8601 week number (1..53) and corresponding ISO week-year + * for `d` (UTC). Returns `{year, week}` where `year` may differ from + * `d.getUTCFullYear()` near year boundaries. + */ +export function isoWeek(d: Date): { year: number; week: number } { + // Algorithm: shift to the Thursday of d's week (since Thursday determines + // the week's ISO year), then compute weeks since the first Thursday. + const tgt = new Date(Date.UTC(d.getUTCFullYear(), d.getUTCMonth(), d.getUTCDate())); + const dayNum = (tgt.getUTCDay() + 6) % 7; // Monday=0, ..., Sunday=6 + tgt.setUTCDate(tgt.getUTCDate() - dayNum + 3); // Thursday of this ISO week + const isoYear = tgt.getUTCFullYear(); + const firstThursday = new Date(Date.UTC(isoYear, 0, 4)); + const firstDayNum = (firstThursday.getUTCDay() + 6) % 7; + firstThursday.setUTCDate(firstThursday.getUTCDate() - firstDayNum + 3); + const week = 1 + Math.round((tgt.getTime() - firstThursday.getTime()) / (7 * 24 * 60 * 60 * 1000)); + return { year: isoYear, week }; +} + +/** + * Build a basename like `-YYYY-Www.jsonl` (e.g. `budget-2026-W21.jsonl`). + * Caller is responsible for joining with the audit dir. + */ +export function isoWeekFilename(prefix: string, now: Date = new Date()): string { + const { year, week } = isoWeek(now); + return `${prefix}-${year}-W${String(week).padStart(2, '0')}.jsonl`; +} + +/** + * Resolve the audit directory: honors `GBRAIN_AUDIT_DIR` env override, + * falls back to `gbrainPath('audit')`. The directory may not exist yet; + * callers `mkdirSync({recursive:true})` before writing. + */ +export function resolveAuditDir(): string { + const override = process.env.GBRAIN_AUDIT_DIR; + if (override && override.length > 0) return override; + return gbrainPath('audit'); +} diff --git a/src/core/brainstorm/checkpoint.ts b/src/core/brainstorm/checkpoint.ts new file mode 100644 index 000000000..4bedc89a8 --- /dev/null +++ b/src/core/brainstorm/checkpoint.ts @@ -0,0 +1,207 @@ +/** + * v0.37.x — brainstorm checkpoint (P7) with full idea bodies. + * + * Contracts (locked by /plan-eng-review): + * - TX3 (load-bearing): `completed_crosses` carries FULL idea bodies, + * not just counts. ~50KB per run, trivial. Resume merges these into + * the new run's ideas array BEFORE the judge runs so the final + * BrainstormResult is byte-identical to a clean run. + * - TX4: ONE resume flag — `--resume ` continues any cross not + * in completed_crosses. The proposed --retry-failed was dropped per + * codex review: failed AND never-attempted crosses both go through + * --resume. + * - A5 amended: run_id = sha256(question + profile_label + + * JSON.stringify(close_slugs.sort()) + JSON.stringify(far_slugs.sort())) + * .slice(0,16). NO embedding bits — stable across embedding-model + * swaps. 7-day mtime-based GC. + * + * Schema bumped to v2 (was 1 in the draft) when ideas were added. + * + * Best-effort persistence: a disk-full save logs to stderr and the run + * continues. Atomic write via .tmp + rename. + */ + +import { + mkdirSync, + readdirSync, + readFileSync, + writeFileSync, + renameSync, + unlinkSync, + existsSync, + statSync, +} from 'node:fs'; +import { join } from 'node:path'; +import { createHash } from 'node:crypto'; +import { gbrainPath } from '../config.ts'; + +export interface CheckpointIdea { + text: string; + cross_id: string; +} + +export interface CheckpointCross { + close_slug: string; + far_slug: string; + cross_id: string; + ideas: CheckpointIdea[]; +} + +export interface FailedCross { + close_slug: string; + far_slug: string; + error: string; +} + +export interface BrainstormCheckpoint { + schema_version: 2; // TX3 — bumped from 1 when ideas were added + run_id: string; + question: string; + profile_label: string; + started_at: string; + /** TX3 load-bearing — each cross's full ideas, not just counts. */ + completed_crosses: CheckpointCross[]; + failed_crosses: FailedCross[]; + judge_done: boolean; +} + +const CURRENT_SCHEMA: 2 = 2; +const STALE_MS = 7 * 24 * 60 * 60 * 1000; + +function checkpointDir(): string { + return gbrainPath('brainstorm'); +} + +function pathForRunId(runId: string): string { + return join(checkpointDir(), `${runId}.json`); +} + +/** + * A5 amended identity: sha256(question + profile + sort(close) + sort(far)) + * truncated to 16 hex chars. No embedding bits — embedding-model swaps + * don't break checkpoints. + */ +export function computeRunId( + question: string, + profileLabel: string, + closeSlugs: string[], + farSlugs: string[], +): string { + const sortedClose = [...closeSlugs].sort(); + const sortedFar = [...farSlugs].sort(); + const payload = [ + question, + profileLabel, + JSON.stringify(sortedClose), + JSON.stringify(sortedFar), + ].join(''); + return createHash('sha256').update(payload).digest('hex').slice(0, 16); +} + +export function loadCheckpoint(runId: string): BrainstormCheckpoint | null { + const path = pathForRunId(runId); + if (!existsSync(path)) return null; + try { + const raw = readFileSync(path, 'utf-8'); + const parsed = JSON.parse(raw) as BrainstormCheckpoint; + if (parsed.schema_version !== CURRENT_SCHEMA) { + process.stderr.write( + `[brainstorm] checkpoint ${runId} has schema_version ${parsed.schema_version} (expected ${CURRENT_SCHEMA}); ignoring (fresh start).\n`, + ); + return null; + } + return parsed; + } catch (err) { + process.stderr.write(`[brainstorm] checkpoint read failed for ${runId}: ${String(err)}\n`); + return null; + } +} + +/** Atomic write via .tmp + rename. Best-effort — disk-full doesn't throw. */ +export function saveCheckpoint(cp: BrainstormCheckpoint): void { + try { + mkdirSync(checkpointDir(), { recursive: true }); + const path = pathForRunId(cp.run_id); + const tmp = `${path}.tmp`; + writeFileSync(tmp, JSON.stringify(cp, null, 2)); + renameSync(tmp, path); + } catch (err) { + process.stderr.write(`[brainstorm] checkpoint write failed for ${cp.run_id}: ${String(err)}\n`); + } +} + +export function listRuns(): Array<{ run_id: string; question: string; mtime: number }> { + const dir = checkpointDir(); + if (!existsSync(dir)) return []; + try { + const files = readdirSync(dir).filter((f) => f.endsWith('.json')); + const out: Array<{ run_id: string; question: string; mtime: number }> = []; + for (const f of files) { + const runId = f.replace(/\.json$/, ''); + const cp = loadCheckpoint(runId); + if (!cp) continue; + try { + const mtime = statSync(join(dir, f)).mtimeMs; + out.push({ run_id: runId, question: cp.question, mtime }); + } catch { + // skip + } + } + out.sort((a, b) => b.mtime - a.mtime); + return out; + } catch { + return []; + } +} + +/** + * GC checkpoints older than `maxAgeDays` (default 7 per A5). Returns the + * count of files removed. Best-effort; errors are silent — caller (cycle + * purge phase) wraps in try/catch. + */ +export function gcStaleCheckpoints(maxAgeDays = 7): number { + const dir = checkpointDir(); + if (!existsSync(dir)) return 0; + const cutoff = Date.now() - maxAgeDays * 24 * 60 * 60 * 1000; + let removed = 0; + try { + for (const f of readdirSync(dir)) { + if (!f.endsWith('.json')) continue; + const path = join(dir, f); + try { + const m = statSync(path).mtimeMs; + if (m < cutoff) { + unlinkSync(path); + removed++; + } + } catch { + // skip individual file errors + } + } + } catch { + // dir-level error — return whatever we managed + } + return removed; +} + +/** Operator escape hatch: skip the 7d staleness gate. */ +export function isCheckpointFresh(runId: string, now: number = Date.now()): boolean { + const path = pathForRunId(runId); + if (!existsSync(path)) return false; + try { + return now - statSync(path).mtimeMs < STALE_MS; + } catch { + return false; + } +} + +/** Erase a checkpoint after the run completes cleanly. Idempotent. */ +export function clearCheckpoint(runId: string): void { + const path = pathForRunId(runId); + if (!existsSync(path)) return; + try { + unlinkSync(path); + } catch { + // best-effort + } +} diff --git a/src/core/brainstorm/domain-bank.ts b/src/core/brainstorm/domain-bank.ts index 28bbd0129..3579038b5 100644 --- a/src/core/brainstorm/domain-bank.ts +++ b/src/core/brainstorm/domain-bank.ts @@ -78,6 +78,20 @@ export interface FetchFarOpts { prefixListOverride?: string[]; /** Default embedding column for distance calc + getEmbeddingsByChunkIds lookup. */ embeddingColumn?: string; + /** + * Hard cap on the number of distinct prefixes we ask the DB to materialize + * one-page-per. Defaults to `max(m * 4, 50)`. Without this cap, brains with + * thousands of distinct top-level prefixes (e.g. a 13K-page brain with + * ~2K prefixes) caused `listPrefixSampledPages` to return ~2K rows instead + * of `m`, exploding LLM token spend by 50-100x. See fix/brainstorm-cost-guardrails. + */ + maxFarSet?: number; + /** + * Optional RNG override for the prefix shuffle (tests only). Defaults to + * `Math.random`. The shuffle keeps the prefix-stratified sampling diverse + * even when we cap to a small fraction of all available prefixes. + */ + random?: () => number; } /** One far-page result enriched with distance + provenance. */ @@ -348,10 +362,31 @@ export async function fetchFar( for (const c of opts.closeSet) { if (c.prefix) closePrefixSet.add(c.prefix); } - const candidatePrefixes = allPrefixes.filter((p) => !closePrefixSet.has(p)); - const availablePrefixes = candidatePrefixes.length; + const allCandidatePrefixes = allPrefixes.filter((p) => !closePrefixSet.has(p)); + const availablePrefixes = allCandidatePrefixes.length; const closeSlugs = opts.closeSet.map((c) => c.slug); + // ---- Step 2.5: cap the prefix list to `maxFarSet` (cost guardrail) ---- + // + // `listPrefixSampledPages` returns one row per distinct prefix passed in. + // On large brains (1000+ prefixes) we were materializing ~1 row per prefix + // and then crossing each with the close-set, producing massive token bills. + // Cap defaults to max(m * 4, 50): enough headroom for downstream distance + // ranking to still pick a diverse `m` final far pages, but bounded. + const maxFarSet = opts.maxFarSet ?? Math.max(m * 4, 50); + let candidatePrefixes = allCandidatePrefixes; + if (candidatePrefixes.length > maxFarSet) { + // Shuffle for diversity, then take the first `maxFarSet`. Without the + // shuffle a 2K-prefix brain would always pick the same alphabetical head. + const rng = opts.random ?? Math.random; + const arr = candidatePrefixes.slice(); + for (let i = arr.length - 1; i > 0; i--) { + const j = Math.floor(rng() * (i + 1)); + [arr[i], arr[j]] = [arr[j], arr[i]]; + } + candidatePrefixes = arr.slice(0, maxFarSet); + } + // ---- Step 3: primary path — listPrefixSampledPages ---- let primaryRows: DomainBankRow[] = []; if (candidatePrefixes.length > 0) { @@ -408,7 +443,7 @@ export async function fetchFar( .filter((e): e is Float32Array => e !== undefined); // ---- Step 6: build FarPage results with normalized distance ---- - const pages: FarPage[] = allRows.map(({ row, src }) => { + const allPages: FarPage[] = allRows.map(({ row, src }) => { const farEmbed = row.representative_chunk_id != null ? embeddings.get(row.representative_chunk_id) ?? null : null; @@ -427,11 +462,26 @@ export async function fetchFar( }; }); + // ---- Step 6.5: final trim to `m` ---- + // + // Even after capping prefixes to `maxFarSet`, `listPrefixSampledPages` plus + // the fallback can return up to `maxFarSet + need` rows. The orchestrator + // crosses every (close × far) so we MUST trim to `m` here or the LLM bill + // scales with the cap, not with `m`. Sort by distance_score DESC so we keep + // the farthest (most novel) pages first. + const pages = allPages + .slice() + .sort((a, b) => b.distance_score - a.distance_score) + .slice(0, m); + return { pages, available_prefixes: availablePrefixes, total_prefixes: totalPrefixes, used_fallback: usedFallback, - short_of_target: pages.length < m, + // short_of_target reflects whether the *pre-trim* candidate pool fell short + // of `m`. After the explicit trim to `m` above, `pages.length` would always + // equal `min(m, allPages.length)`, masking the sparse-brain signal. + short_of_target: allPages.length < m, }; } diff --git a/src/core/brainstorm/judges.ts b/src/core/brainstorm/judges.ts index b65e6dbcc..ca7ef20b5 100644 --- a/src/core/brainstorm/judges.ts +++ b/src/core/brainstorm/judges.ts @@ -347,12 +347,28 @@ export interface RunJudgeOptions { activeBiasTags?: string[]; /** AbortSignal for Ctrl-C / shutdown propagation. */ abortSignal?: AbortSignal; + /** + * Maximum ideas to send in a single judge LLM call. Defaults to 100. + * Large idea sets (e.g. 15K ideas from a 13K-page brain) blow past the + * model's context window when sent as one batch. We chunk into batches + * of `maxIdeasPerCall` and concatenate the results. + */ + maxIdeasPerCall?: number; + /** Stderr sink for chunk-progress reporting. Defaults to process.stderr.write. */ + stderrWrite?: (s: string) => void; } +/** Default judge chunk size. ~350 tokens/idea × 100 ideas ≈ 35K input tokens, safely under any model context. */ +const DEFAULT_JUDGE_CHUNK_SIZE = 100; + /** - * Single batch — caller chunks large idea sets to keep prompt size bounded. - * Throws on parse failure (caller maps to judge_failed:true + saves unscored, - * per D12). + * Judge a batch of ideas. Automatically chunks large idea sets into + * `maxIdeasPerCall`-sized sub-batches (default 100) to avoid blowing past + * the model's context window. Each chunk is a separate LLM call; results + * are concatenated. Throws on parse failure of *any* chunk (caller maps to + * judge_failed:true + saves unscored, per D12), but on a partial failure + * (some chunks succeed, one fails) we still throw — callers who want + * partial-result resilience should call `runJudge` per-chunk themselves. */ export async function runJudge( config: JudgeConfig, @@ -364,6 +380,56 @@ export async function runJudge( // returning a well-formed empty result is more ergonomic. return { ideas: [], pass_count: 0, model: 'noop', usage: { input_tokens: 0, output_tokens: 0, cache_read_tokens: 0, cache_creation_tokens: 0 } }; } + const chunkSize = Math.max(1, options.maxIdeasPerCall ?? DEFAULT_JUDGE_CHUNK_SIZE); + const stderr = options.stderrWrite ?? ((s: string) => { process.stderr.write(s); }); + + // Split ideas into chunks. For small idea sets (<= chunkSize) this is a + // single chunk and behaves identically to the pre-fix single-call path. + const chunks: JudgeIdea[][] = []; + for (let i = 0; i < ideas.length; i += chunkSize) { + chunks.push(ideas.slice(i, i + chunkSize)); + } + if (chunks.length > 1) { + stderr(`[${config.label}-judge] chunking ${ideas.length} ideas into ${chunks.length} batches of ≤${chunkSize}\n`); + } + + const allIdeaResults: JudgeIdeaResult[] = []; + let lastModel = 'noop'; + const totalUsage: ChatResult['usage'] = { + input_tokens: 0, + output_tokens: 0, + cache_read_tokens: 0, + cache_creation_tokens: 0, + }; + for (let ci = 0; ci < chunks.length; ci++) { + const chunk = chunks[ci]; + const chunkResult = await runJudgeChunk(config, chunk, options); + allIdeaResults.push(...chunkResult.ideas); + lastModel = chunkResult.model; + totalUsage.input_tokens += chunkResult.usage.input_tokens; + totalUsage.output_tokens += chunkResult.usage.output_tokens; + if (typeof chunkResult.usage.cache_read_tokens === 'number') { + totalUsage.cache_read_tokens = (totalUsage.cache_read_tokens ?? 0) + chunkResult.usage.cache_read_tokens; + } + if (typeof chunkResult.usage.cache_creation_tokens === 'number') { + totalUsage.cache_creation_tokens = (totalUsage.cache_creation_tokens ?? 0) + chunkResult.usage.cache_creation_tokens; + } + } + + return { + ideas: allIdeaResults, + pass_count: allIdeaResults.filter((i) => i.passes).length, + model: lastModel, + usage: totalUsage, + }; +} + +/** Single-chunk inner loop. Extracted so `runJudge` can chunk + concatenate. */ +async function runJudgeChunk( + config: JudgeConfig, + ideas: JudgeIdea[], + options: RunJudgeOptions +): Promise { const chat = options.chatFn ?? defaultChat; const prompt = buildJudgePrompt(config, ideas); @@ -401,15 +467,15 @@ export async function runJudge( continue; } const weighted_score = weightedScore(validated.scores, config.weights); - const result: JudgeIdeaResult = { + const ir: JudgeIdeaResult = { id: validated.id, scores: validated.scores, weighted_score, passes: false, // filled below note: validated.note, }; - result.passes = ideaPasses(result, config); - ideaResults.push(result); + ir.passes = ideaPasses(ir, config); + ideaResults.push(ir); } return { diff --git a/src/core/brainstorm/orchestrator.ts b/src/core/brainstorm/orchestrator.ts index 89933def7..665cf76fb 100644 --- a/src/core/brainstorm/orchestrator.ts +++ b/src/core/brainstorm/orchestrator.ts @@ -36,6 +36,28 @@ import { } from './judges.ts'; import { ANTHROPIC_PRICING } from '../anthropic-pricing.ts'; +// --------------------------------------------------------------------------- +// BudgetExhausted is the canonical typed error (Q2) used by every cost +// guardrail in the orchestrator. The class lives in +// `src/core/budget/budget-tracker.ts` (Phase 2 of the budget cathedral); we +// re-export here for back-compat with any caller that imports it from this +// module (the only known caller is the test suite). +// --------------------------------------------------------------------------- + +import { BudgetExhausted, BudgetTracker } from '../budget/budget-tracker.ts'; +import { withBudgetTracker } from '../ai/gateway.ts'; +import { + computeRunId, + loadCheckpoint, + saveCheckpoint, + isCheckpointFresh, + clearCheckpoint, + type BrainstormCheckpoint, + type CheckpointCross, +} from './checkpoint.ts'; + +export { BudgetExhausted }; + // --------------------------------------------------------------------------- // Profile (BrainstormProfile is the brainstorm vs LSD config object) // --------------------------------------------------------------------------- @@ -117,6 +139,52 @@ export interface BrainstormOptions { embedQueryFn?: (text: string) => Promise; /** Stderr sink — defaults to process.stderr.write. Tests pipe into a buffer. */ stderrWrite?: (s: string) => void; + /** + * Maximum projected cost in USD before the run aborts. Default $5. + * The pre-run estimate is compared against this ceiling; if higher, we + * abort with a paste-ready error (unless `skipCostPreview` is set AND + * the caller is non-interactive — then we still abort, the ceiling is + * a hard limit). + */ + maxCostUsd?: number; + /** + * Hard cap on the domain-bank far set. Default 50. Threaded into + * `fetchFar` to prevent the "2K prefix" explosion on large brains. + */ + maxFarSet?: number; + /** + * When true, abort mid-run if running token usage exceeds 5× the original + * estimate. Default false (warn-only). Pair with `maxCostUsd` for a hard + * ceiling. + */ + strictBudget?: boolean; + /** + * Override the model used for the judge phase. Larger-context models + * (e.g. Gemini 2M / Claude 200K) help when judging large idea sets. + * Falls back to `modelOverride` then the gateway default. + */ + judgeModel?: string; + /** + * Max ideas per judge LLM call. Default 100. Larger batches save calls + * but risk context overflow; smaller batches are slower but safer. + */ + maxIdeasPerJudgeCall?: number; + /** + * TX4: resume from a previously-persisted checkpoint at + * `~/.gbrain/brainstorm/.json`. Set by `--resume `. + * When the checkpoint's identity (run_id) doesn't match the active + * inputs, the orchestrator refuses with a paste-ready hint rather + * than silently starting fresh. + * + * If undefined and a fresh checkpoint exists for the auto-derived + * run_id, the orchestrator does NOT auto-resume — caller must opt in + * via the explicit flag. + */ + resumeRunId?: string; + /** + * A5: bypass the 7-day staleness gate when --resume is set. + */ + forceResume?: boolean; } /** One idea emitted to the user, with citation transparency (D6). */ @@ -279,6 +347,21 @@ export async function loadCalibrationContext( // Idea generation prompts + response parsing // --------------------------------------------------------------------------- +/** + * Strip lone/orphaned UTF-16 surrogates that would crash JSON encoding + * downstream. The Anthropic SDK and some gateway transports refuse strings + * containing unpaired surrogates (U+D800–U+DFFF). Page content that came + * in via OCR or older imports occasionally has them. + */ +function sanitizeUnicode(s: string): string { + if (!s) return s; + // Replace lone high surrogates (D800-DBFF) not followed by a low surrogate. + // Replace lone low surrogates (DC00-DFFF) not preceded by a high surrogate. + return s + .replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])/g, '�') + .replace(/(^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]/g, '$1�'); +} + /** Build a single (close × far) cross-generation prompt. */ function buildCrossPrompt(opts: { profile: BrainstormProfile; @@ -296,16 +379,25 @@ Style rules: - Cite BOTH the close and far slug verbatim — these are the user's own notes. - Never fabricate facts, figures, or quotes. Stay grounded in the cited pages.${opts.profile.generator_constraint ? `\n- ${opts.profile.generator_constraint}` : ''}`; + // Sanitize: unicode surrogates in page content (from OCR or older imports) + // can crash JSON encoding in the chat transport, which would void the + // entire cross. Cheap to fix here. + const closeContent = sanitizeUnicode(opts.close.content); + const farContent = sanitizeUnicode(opts.far.content); + const closeTitle = sanitizeUnicode(opts.close.title ?? '(untitled)'); + const farTitle = sanitizeUnicode(opts.far.title ?? '(untitled)'); + const question = sanitizeUnicode(opts.question); + const user = `QUESTION: -${opts.question} +${question} CLOSE PAGE (related to the question — context anchor): -[${opts.close.slug}] ${opts.close.title ?? '(untitled)'} -${opts.close.content.slice(0, 1500)} +[${opts.close.slug}] ${closeTitle} +${closeContent.slice(0, 1500)} FAR PAGE (from a distant region of the user's brain — the collision partner): -[${opts.far.slug}] ${opts.far.title ?? '(untitled)'} -${opts.far.content} +[${opts.far.slug}] ${farTitle} +${farContent} Generate exactly ${opts.profile.ideas_per_cross} ideas from cross-pollinating these pages. @@ -381,6 +473,24 @@ export async function runBrainstorm( engine: BrainEngine, config: { embedding_model?: string; emotional_weight?: { user_holder?: string } }, opts: BrainstormOptions +): Promise { + // T10: install a gateway-layer BudgetTracker scope around the whole run + // so every gateway.chat / embed call (the cross generations + judge + + // question embed) auto-records cost via the AsyncLocalStorage from T3. + // The cap mirrors the orchestrator's maxCostUsd so the gateway can + // hard-fail via BudgetExhausted(reason:'cost') if a single under- + // estimated call leaks past the ceiling (TX1). + const _runTracker = new BudgetTracker({ + label: `brainstorm.${opts.profile?.label ?? 'brainstorm'}`, + maxCostUsd: opts.maxCostUsd ?? 5, + }); + return withBudgetTracker(_runTracker, () => _runBrainstormInner(engine, config, opts)); +} + +async function _runBrainstormInner( + engine: BrainEngine, + config: { embedding_model?: string; emotional_weight?: { user_holder?: string } }, + opts: BrainstormOptions, ): Promise { const profile = opts.profile ?? BRAINSTORM_PROFILE; const stderr = opts.stderrWrite ?? ((s: string) => { process.stderr.write(s); }); @@ -399,6 +509,22 @@ export async function runBrainstorm( throw new Error('brainstorm: aborted before run (Ctrl-C during cost preview window)'); } + // ---- Phase 0.5: hard cost ceiling (circuit breaker) ---- + // + // The TTY grace window is a soft check. This is the hard one. On large + // brains the pre-run estimate is itself an under-estimate (53× over in + // the wild on a 13K-page brain) because `m_far` got blown out by + // un-capped prefix sampling. We refuse to start if the *estimate alone* + // already exceeds the user's ceiling. + const maxCostUsd = opts.maxCostUsd ?? 5; + if (estimate > maxCostUsd) { + throw new BudgetExhausted( + `${profile.label}: estimated cost ${fmtUsd(estimate)} exceeds --max-cost ${fmtUsd(maxCostUsd)}. ` + + `Lower --limit, raise --max-cost, or pass --max-far-set to cap the domain bank.`, + { reason: 'cost', spent: estimate, cap: maxCostUsd }, + ); + } + // ---- Phase 1: question embedding + close-set retrieval ---- let questionEmbedding: Float32Array | null = null; try { @@ -440,6 +566,9 @@ export async function runBrainstorm( staleBias: profile.stale_bias, sourceId: opts.sourceId, sourceIds: opts.sourceIds, + // Cap the prefix-stratified far set. Defaults to max(m * 4, 50) inside + // fetchFar; we forward the CLI flag when set. + maxFarSet: opts.maxFarSet, }); if (farResult.short_of_target) { // D11 data-driven warning text. @@ -495,11 +624,81 @@ export async function runBrainstorm( } } + // ---- TX3/TX4/A5: checkpoint + --resume wiring ---- + // + // run_id is derived from the inputs (question + profile + sorted slug arrays + // — A5 amended, no embedding bits). When opts.resumeRunId is set we load + // the matching checkpoint and skip already-completed crosses; when it's + // unset we still WRITE a checkpoint every N successful crosses so the + // user has a recovery path on a future crash. + const closeSlugsAll = closesForCross.map((c) => c.slug); + const farSlugsAll = farResult.pages.map((p) => p.slug); + const runId = computeRunId(opts.question, profile.label, closeSlugsAll, farSlugsAll); + const crossKey = (cross: Cross): string => `${cross.close.slug}__${cross.far.slug}`; + const completedFromDisk = new Map(); // crossKey → ideas-from-disk + + let prevCheckpoint: BrainstormCheckpoint | null = null; + if (opts.resumeRunId) { + if (opts.resumeRunId !== runId) { + throw new Error( + `${profile.label}: --resume run_id=${opts.resumeRunId} does not match inputs (active run_id=${runId}). ` + + `Inputs (question, close set, far set) changed since the checkpoint. Run without --resume to start fresh.`, + ); + } + if (!opts.forceResume && !isCheckpointFresh(opts.resumeRunId)) { + throw new Error( + `${profile.label}: checkpoint ${opts.resumeRunId} is older than 7 days. ` + + `Pass --force-resume to override, or run without --resume to start fresh.`, + ); + } + prevCheckpoint = loadCheckpoint(opts.resumeRunId); + if (!prevCheckpoint) { + throw new Error( + `${profile.label}: --resume ${opts.resumeRunId}: no checkpoint found or schema mismatch. ` + + `Run without --resume to start fresh.`, + ); + } + for (const cc of prevCheckpoint.completed_crosses) { + completedFromDisk.set(`${cc.close_slug}__${cc.far_slug}`, cc); + } + stderr(`[${profile.label}] resuming run ${runId}: ${completedFromDisk.size}/${crosses.length} crosses already done\n`); + } + + // Live checkpoint state — appended to as crosses succeed/fail; flushed + // every 5 crosses. + const liveCheckpoint: BrainstormCheckpoint = { + schema_version: 2, + run_id: runId, + question: opts.question, + profile_label: profile.label, + started_at: prevCheckpoint?.started_at ?? new Date().toISOString(), + completed_crosses: prevCheckpoint?.completed_crosses.slice() ?? [], + failed_crosses: prevCheckpoint?.failed_crosses.slice() ?? [], + judge_done: false, + }; + let crossesSinceFlush = 0; + const flush = (): void => { + saveCheckpoint(liveCheckpoint); + crossesSinceFlush = 0; + }; + let totalUsage = { input_tokens: 0, output_tokens: 0 }; let crossModel = modelStr; // Parallelize chat calls bounded at DEFAULT_PARALLELISM. const rawIdeasByCross = await mapWithConcurrency(crosses, DEFAULT_PARALLELISM, async (cross) => { + // Skip crosses already completed in a prior run (TX4 single-rule). + const key = crossKey(cross); + if (completedFromDisk.has(key)) { + const fromDisk = completedFromDisk.get(key)!; + return fromDisk.ideas.map((idea) => ({ + text: idea.text, + close_slug: cross.close.slug, + far_slug: cross.far.slug, + distance_score: cross.far.distance_score, + })); + } + const { system, user } = buildCrossPrompt({ profile, question: opts.question, @@ -518,19 +717,68 @@ export async function runBrainstorm( totalUsage.input_tokens += result.usage.input_tokens; totalUsage.output_tokens += result.usage.output_tokens; crossModel = result.model; + // Mid-run cost guard: if running spend already exceeds the projected + // ceiling or the strict-budget multiplier, abort the remaining crosses. + const runningPricing = ANTHROPIC_PRICING[result.model] ?? { input: 3, output: 15 }; + const runningUsd = + (totalUsage.input_tokens / 1_000_000) * runningPricing.input + + (totalUsage.output_tokens / 1_000_000) * runningPricing.output; + if (runningUsd > maxCostUsd) { + throw new BudgetExhausted( + `${profile.label}: running cost ${fmtUsd(runningUsd)} exceeded --max-cost ${fmtUsd(maxCostUsd)} mid-run; aborting remaining crosses`, + { reason: 'cost', spent: runningUsd, cap: maxCostUsd }, + ); + } + if (opts.strictBudget === true && runningUsd > estimate * 5) { + throw new BudgetExhausted( + `${profile.label}: running cost ${fmtUsd(runningUsd)} exceeded 5× estimate (${fmtUsd(estimate)}) under --strict-budget`, + { reason: 'cost', spent: runningUsd, cap: estimate * 5 }, + ); + } const parsed = parseIdeaResponse(result.text); - return parsed.slice(0, profile.ideas_per_cross).map((text) => ({ + const sliced = parsed.slice(0, profile.ideas_per_cross); + // TX3: persist FULL idea bodies, not just counts. Resume reconstructs + // the BrainstormResult by reading these back from disk. + const crossId = `${cross.close.slug}__${cross.far.slug}`; + liveCheckpoint.completed_crosses.push({ + close_slug: cross.close.slug, + far_slug: cross.far.slug, + cross_id: crossId, + ideas: sliced.map((text) => ({ text, cross_id: crossId })), + }); + crossesSinceFlush++; + if (crossesSinceFlush >= 5) flush(); + return sliced.map((text) => ({ text, close_slug: cross.close.slug, far_slug: cross.far.slug, distance_score: cross.far.distance_score, })); } catch (err) { + // Q2: typed-error check, replaces PR #1234's brittle string-match + // (`msg.includes('--max-cost')`). Cost-cap errors propagate; other + // per-cross errors are warned + swallowed so one bad cross doesn't + // void the rest of the run. + if (err instanceof BudgetExhausted) { + // Flush checkpoint before propagating so any completed crosses + // are persisted for --resume. + flush(); + throw err; + } const msg = err instanceof Error ? err.message : String(err); stderr(`[${profile.label}] WARN: cross [${cross.close.slug}] × [${cross.far.slug}] failed: ${msg}\n`); + liveCheckpoint.failed_crosses.push({ + close_slug: cross.close.slug, + far_slug: cross.far.slug, + error: msg, + }); + crossesSinceFlush++; + if (crossesSinceFlush >= 5) flush(); return []; } }); + // Final flush so the on-disk file reflects the post-loop state. + flush(); // Flatten + assign stable ids. const allRawIdeas: Array<{ id: string; text: string; close_slug: string; far_slug: string; distance_score: number }> = []; @@ -559,10 +807,12 @@ export async function runBrainstorm( far_slug: i.far_slug, })); const judgeResult = await runJudge(profile.judge_config, judgeInput, { - modelOverride: opts.modelOverride, + modelOverride: opts.judgeModel ?? opts.modelOverride, chatFn: opts.chatFn, activeBiasTags: activeBiasTags ?? undefined, abortSignal: opts.abortSignal, + maxIdeasPerCall: opts.maxIdeasPerJudgeCall, + stderrWrite: stderr, }); for (const idea of judgeResult.ideas) { judgedById.set(idea.id, idea); @@ -599,6 +849,21 @@ export async function runBrainstorm( const actual = (totalIn / 1_000_000) * pricing.input + (totalOut / 1_000_000) * pricing.output; stderr(`[${profile.label}] actual cost: ${fmtUsd(actual)} (estimated ${fmtUsd(estimate)}) — in=${totalIn} out=${totalOut} tokens\n`); + // TX4: surface --resume hint when any cross failed during this run. + // The user can re-run with `--resume ` and we'll retry only + // the missing crosses (failed_crosses + never-attempted). + if (liveCheckpoint.failed_crosses.length > 0) { + stderr( + `[${profile.label}] ${liveCheckpoint.failed_crosses.length} cross(es) failed. Resume with: gbrain ${profile.label} --resume ${runId}\n`, + ); + } else { + // Clean completion — every cross succeeded. Clear the checkpoint so we + // don't accumulate noise + so a stale run_id doesn't auto-resume. + liveCheckpoint.judge_done = true; + saveCheckpoint(liveCheckpoint); + clearCheckpoint(runId); + } + return { profile_label: profile.label, question: opts.question, diff --git a/src/core/budget/budget-tracker.ts b/src/core/budget/budget-tracker.ts new file mode 100644 index 000000000..929351f6a --- /dev/null +++ b/src/core/budget/budget-tracker.ts @@ -0,0 +1,431 @@ +/** + * v0.37.x — unified BudgetTracker for every gateway-routed LLM call. + * + * Replaces the per-command budget code (brainstorm orchestrator inline + * BudgetExhausted, cycle/budget-meter, eval-contradictions cost-prompt + + * cost-tracker). One class, one error type, one audit JSONL schema. + * + * Compose via `withBudgetTracker(tracker, fn)` from `src/core/ai/gateway.ts` + * (Phase 2 / TX5). Once inside the scope, every `gateway.chat / embed / + * rerank` call auto-records cost via AsyncLocalStorage — no per-call + * injection seam needed. + * + * Contracts (locked by /plan-eng-review): + * - TX1: `record()` THROWS BudgetExhausted(reason:'cost') when cumulative + * spend > maxCostUsd. The cap is a real ceiling, not a suggestion. + * - TX2: When `maxCostUsd` is set AND the model is not in the pricing + * maps, `reserve()` HARD-FAILS with BudgetExhausted(reason:'no_pricing'). + * When `maxCostUsd` is unset, legacy warn-once behavior is preserved. + * - A3 amended: `record()` is best called from try/finally on every + * gateway site. When the call threw without usage, callers feed + * `extractUsageFromError(err, fallback)` — fallback is the pessimistic + * ceiling (`maxOutputTokens` worth of output), not the optimistic + * pre-call estimate. Better to overcount on failure than undercount. + * + * Audit JSONL lives at `~/.gbrain/audit/budget-YYYY-Www.jsonl` (ISO-week + * rotation, same shape as shell-audit / phantom-audit). Every line carries + * `schema_version: 1` so consumers can detect future renames. Writes are + * best-effort: a disk-full audit never gates the run. + */ + +import { mkdirSync, appendFileSync } from 'node:fs'; +import { dirname } from 'node:path'; +import { gbrainPath } from '../config.ts'; +import { ANTHROPIC_PRICING, type ModelPricing } from '../anthropic-pricing.ts'; +import { EMBEDDING_PRICING, lookupEmbeddingPrice } from '../embedding-pricing.ts'; +import { isoWeekFilename, resolveAuditDir } from '../audit-week-file.ts'; + +export type BudgetKind = 'chat' | 'embed' | 'rerank'; + +export type BudgetReason = 'cost' | 'runtime' | 'no_pricing'; + +export interface BudgetEstimate { + modelId: string; + estimatedInputTokens: number; + maxOutputTokens: number; + kind: BudgetKind; + /** Optional label for telemetry (e.g. 'brainstorm.cross', 'dream.synthesize'). */ + label?: string; +} + +export interface BudgetActualUsage { + modelId: string; + inputTokens: number; + outputTokens?: number; + /** For embeddings: dimension count, surfaces in audit only. */ + embeddingDims?: number; + /** Optional label echo for the audit row. */ + label?: string; +} + +export interface BudgetSnapshot { + cumulativeCostUsd: number; + startedAt: number; + elapsedMs: number; + maxCostUsd?: number; + maxRuntimeMs?: number; + callsRecorded: number; +} + +export interface BudgetTrackerOpts { + /** USD cap. When undefined, cost gate disabled; pricing misses warn-once. */ + maxCostUsd?: number; + /** Wall-clock cap in milliseconds. When undefined, runtime gate disabled. */ + maxRuntimeMs?: number; + /** Phase/command label used in audit rows. */ + label: string; + /** Override the audit file path (tests + custom installers). */ + auditPath?: string; +} + +export class BudgetExhausted extends Error { + readonly tag = 'BUDGET_EXHAUSTED' as const; + reason: BudgetReason; + spent: number; + cap: number; + modelId?: string; + constructor( + message: string, + opts: { reason: BudgetReason; spent: number; cap: number; modelId?: string }, + ) { + super(message); + this.name = 'BudgetExhausted'; + this.reason = opts.reason; + this.spent = opts.spent; + this.cap = opts.cap; + this.modelId = opts.modelId; + } +} + +/** One-process memo: warn-once on missing pricing per (modelId, kind). */ +const _unpricedWarnings = new Set(); + +/** Test seam: reset warn-once memo so unit tests can re-trigger the path. */ +export function _resetBudgetTrackerWarningsForTest(): void { + _unpricedWarnings.clear(); +} + +/** + * Best-effort JSONL audit append. Failure never gates the run; matches the + * shell-audit / phantom-audit posture. + */ +function appendAuditLine(path: string, entry: object): void { + try { + mkdirSync(dirname(path), { recursive: true }); + appendFileSync(path, JSON.stringify(entry) + '\n'); + } catch { + // swallow — audit failures must not block the LLM call + } +} + +function defaultAuditPath(): string { + const dir = resolveAuditDir(); + return `${dir}/${isoWeekFilename('budget')}`; +} + +/** + * Look up `modelId` in the chat or embedding pricing maps. Returns a + * per-1M-token price tuple, or null when unknown. + * + * Strategy: + * - Chat: try the bare model id in ANTHROPIC_PRICING first (legacy keys + * are bare claude-* ids). Fall back to the provider-prefixed key. + * - Embed: lookupEmbeddingPrice already handles the provider:model form, + * defaulting to openai when the colon is missing. + * - Rerank: not priced today — treat as a chat call with no output cost + * when caller passes ANTHROPIC_PRICING-shaped id, else unknown. + */ +function lookupPricing(modelId: string, kind: BudgetKind): ModelPricing | null { + if (kind === 'embed') { + const hit = lookupEmbeddingPrice(modelId); + if (hit.kind === 'known') { + return { input: hit.pricePerMTok, output: 0 }; + } + return null; + } + // chat or rerank: try bare key first, then provider:model + const bare = ANTHROPIC_PRICING[modelId]; + if (bare) return bare; + const [, modelTail] = modelId.includes(':') ? modelId.split(':', 2) : [null, modelId]; + if (modelTail) { + const tailHit = ANTHROPIC_PRICING[modelTail]; + if (tailHit) return tailHit; + } + return null; +} + +function costForUsage(modelId: string, inputTokens: number, outputTokens: number, kind: BudgetKind): number | null { + const p = lookupPricing(modelId, kind); + if (!p) return null; + return (inputTokens / 1_000_000) * p.input + (outputTokens / 1_000_000) * p.output; +} + +export class BudgetTracker { + private cumulativeUsd = 0; + private callsRecorded = 0; + private readonly startedAt: number; + private readonly auditPath: string; + private readonly onExhaustedCbs: Array<() => void> = []; + private exhaustedFired = false; + + constructor(private readonly opts: BudgetTrackerOpts) { + this.startedAt = Date.now(); + this.auditPath = opts.auditPath ?? defaultAuditPath(); + } + + /** Public read access. */ + get totalSpent(): number { + return this.cumulativeUsd; + } + + /** + * Register a synchronous callback to fire the first time the tracker + * throws BudgetExhausted (from reserve OR record). Fires once. Useful for + * persisting checkpoint state before the throw propagates. The callback + * MUST be synchronous; async work (fs writes are fine via writeFileSync) + * goes inside the callback body. + */ + onExhausted(cb: () => void): void { + this.onExhaustedCbs.push(cb); + } + + /** + * Project a planned LLM call against the cap. Throws BudgetExhausted + * BEFORE any provider call when: + * - cumulative + projected > maxCostUsd (reason: 'cost') + * - wall-clock > maxRuntimeMs (reason: 'runtime') + * - maxCostUsd set AND pricing missing (reason: 'no_pricing') -- TX2 + * + * When maxCostUsd is unset, missing pricing warns-once but does not throw + * (legacy behavior preserved for non-priced providers). + */ + reserve(estimate: BudgetEstimate): void { + this.assertRuntime(estimate.modelId); + + const projected = costForUsage( + estimate.modelId, + estimate.estimatedInputTokens, + estimate.maxOutputTokens, + estimate.kind, + ); + + if (projected === null) { + if (this.opts.maxCostUsd !== undefined) { + // TX2: hard-fail when a cap is set but pricing is missing — without + // pricing we can't enforce the cap, and silently ignoring it would + // void the contract. + const msg = `${this.opts.label}: no pricing entry for model "${estimate.modelId}" (kind=${estimate.kind}). ` + + `Add it to src/core/${estimate.kind === 'embed' ? 'embedding-pricing.ts' : 'anthropic-pricing.ts'} or drop --max-cost.`; + this.fireExhausted(); + throw new BudgetExhausted(msg, { + reason: 'no_pricing', + spent: this.cumulativeUsd, + cap: this.opts.maxCostUsd, + modelId: estimate.modelId, + }); + } + // Legacy warn-once path — cap unset. + const memoKey = `${estimate.modelId}:${estimate.kind}`; + if (!_unpricedWarnings.has(memoKey)) { + _unpricedWarnings.add(memoKey); + process.stderr.write( + `[budget] BUDGET_TRACKER_NO_PRICING: model "${estimate.modelId}" (kind=${estimate.kind}) not in pricing maps. ` + + `Cost gate disabled for this call.\n`, + ); + } + appendAuditLine(this.auditPath, { + schema_version: 1, + ts: new Date().toISOString(), + event: 'reserve_unpriced', + label: this.opts.label, + kind: estimate.kind, + model: estimate.modelId, + sub_label: estimate.label, + estimated_input_tokens: estimate.estimatedInputTokens, + max_output_tokens: estimate.maxOutputTokens, + }); + return; + } + + if (this.opts.maxCostUsd !== undefined) { + const after = this.cumulativeUsd + projected; + if (after > this.opts.maxCostUsd) { + appendAuditLine(this.auditPath, { + schema_version: 1, + ts: new Date().toISOString(), + event: 'reserve_denied', + label: this.opts.label, + kind: estimate.kind, + model: estimate.modelId, + sub_label: estimate.label, + projected_cost_usd: projected, + cumulative_cost_usd: this.cumulativeUsd, + max_cost_usd: this.opts.maxCostUsd, + }); + this.fireExhausted(); + throw new BudgetExhausted( + `${this.opts.label}: projected cost $${after.toFixed(4)} exceeds --max-cost $${this.opts.maxCostUsd.toFixed(2)} ` + + `(cumulative $${this.cumulativeUsd.toFixed(4)} + this call $${projected.toFixed(4)})`, + { reason: 'cost', spent: this.cumulativeUsd, cap: this.opts.maxCostUsd, modelId: estimate.modelId }, + ); + } + } + + appendAuditLine(this.auditPath, { + schema_version: 1, + ts: new Date().toISOString(), + event: 'reserve', + label: this.opts.label, + kind: estimate.kind, + model: estimate.modelId, + sub_label: estimate.label, + projected_cost_usd: projected, + cumulative_cost_usd: this.cumulativeUsd, + max_cost_usd: this.opts.maxCostUsd ?? null, + }); + } + + /** + * Record the actual usage after the provider returned (or threw). Updates + * cumulative spend. Throws BudgetExhausted(reason:'cost') AFTER the update + * when cumulative > maxCostUsd (TX1): a single underestimated call can + * blow past the cap and the cap must remain a real ceiling. + * + * `outputTokens` defaults to 0 (embed/rerank). `embeddingDims` is audit- + * only metadata. + */ + record(actual: BudgetActualUsage & { kind?: BudgetKind }): void { + this.callsRecorded++; + const kind: BudgetKind = actual.kind ?? 'chat'; + const cost = costForUsage(actual.modelId, actual.inputTokens, actual.outputTokens ?? 0, kind); + + if (cost === null) { + // Unpriced model: record audit but skip cumulative math. Cap (if set) + // already rejected this call at reserve(); a record() here means the + // unpriced warn-once path let it through (cap unset). + appendAuditLine(this.auditPath, { + schema_version: 1, + ts: new Date().toISOString(), + event: 'record_unpriced', + label: this.opts.label, + kind, + model: actual.modelId, + sub_label: actual.label, + input_tokens: actual.inputTokens, + output_tokens: actual.outputTokens ?? 0, + embedding_dims: actual.embeddingDims ?? null, + }); + return; + } + + this.cumulativeUsd += cost; + appendAuditLine(this.auditPath, { + schema_version: 1, + ts: new Date().toISOString(), + event: 'record', + label: this.opts.label, + kind, + model: actual.modelId, + sub_label: actual.label, + input_tokens: actual.inputTokens, + output_tokens: actual.outputTokens ?? 0, + embedding_dims: actual.embeddingDims ?? null, + actual_cost_usd: cost, + cumulative_cost_usd: this.cumulativeUsd, + max_cost_usd: this.opts.maxCostUsd ?? null, + }); + + if (this.opts.maxCostUsd !== undefined && this.cumulativeUsd > this.opts.maxCostUsd) { + // TX1: hard-throw — a single under-estimated call exceeded the cap. + this.fireExhausted(); + throw new BudgetExhausted( + `${this.opts.label}: cumulative cost $${this.cumulativeUsd.toFixed(4)} exceeded --max-cost $${this.opts.maxCostUsd.toFixed(2)} after recording ${kind} call to ${actual.modelId}`, + { reason: 'cost', spent: this.cumulativeUsd, cap: this.opts.maxCostUsd, modelId: actual.modelId }, + ); + } + } + + snapshot(): BudgetSnapshot { + return { + cumulativeCostUsd: this.cumulativeUsd, + startedAt: this.startedAt, + elapsedMs: Date.now() - this.startedAt, + maxCostUsd: this.opts.maxCostUsd, + maxRuntimeMs: this.opts.maxRuntimeMs, + callsRecorded: this.callsRecorded, + }; + } + + /** Internal helper: throw BudgetExhausted(reason:'runtime') when the wall-clock cap fires. */ + private assertRuntime(modelId: string): void { + if (this.opts.maxRuntimeMs === undefined) return; + const elapsed = Date.now() - this.startedAt; + if (elapsed > this.opts.maxRuntimeMs) { + appendAuditLine(this.auditPath, { + schema_version: 1, + ts: new Date().toISOString(), + event: 'runtime_denied', + label: this.opts.label, + elapsed_ms: elapsed, + max_runtime_ms: this.opts.maxRuntimeMs, + model: modelId, + }); + this.fireExhausted(); + throw new BudgetExhausted( + `${this.opts.label}: wall-clock ${(elapsed / 1000).toFixed(1)}s exceeded --max-runtime ${(this.opts.maxRuntimeMs / 1000).toFixed(1)}s`, + { reason: 'runtime', spent: elapsed, cap: this.opts.maxRuntimeMs, modelId }, + ); + } + } + + private fireExhausted(): void { + if (this.exhaustedFired) return; + this.exhaustedFired = true; + for (const cb of this.onExhaustedCbs) { + try { + cb(); + } catch (err) { + process.stderr.write(`[budget] onExhausted callback threw: ${String(err)}\n`); + } + } + } +} + +/** + * Pull usage out of an SDK error envelope. Common providers attach `usage` + * either at the top level (Anthropic) or under `response.usage` (OpenAI). + * Returns the fallback (pessimistic ceiling) when no usage can be found — + * NOT the conservative pre-call estimate (A3 amended). Callers should pass + * `{ inputTokens: estimate.estimatedInputTokens, outputTokens: estimate.maxOutputTokens }` + * so the worst-case budget is consumed on failure. + */ +export function extractUsageFromError( + err: unknown, + fallback: { inputTokens: number; outputTokens: number }, +): { inputTokens: number; outputTokens: number } { + if (err && typeof err === 'object') { + const top = (err as { usage?: unknown }).usage; + const nested = (err as { response?: { usage?: unknown } }).response?.usage; + const candidate = (top && typeof top === 'object' ? top : nested && typeof nested === 'object' ? nested : null) as + | { input_tokens?: number; output_tokens?: number; inputTokens?: number; outputTokens?: number } + | null; + if (candidate) { + const inputTokens = numericOrNull(candidate.input_tokens ?? candidate.inputTokens); + const outputTokens = numericOrNull(candidate.output_tokens ?? candidate.outputTokens); + if (inputTokens !== null || outputTokens !== null) { + return { + inputTokens: inputTokens ?? fallback.inputTokens, + outputTokens: outputTokens ?? fallback.outputTokens, + }; + } + } + } + return { inputTokens: fallback.inputTokens, outputTokens: fallback.outputTokens }; +} + +function numericOrNull(v: unknown): number | null { + return typeof v === 'number' && Number.isFinite(v) ? v : null; +} + +/** Re-export the pricing maps for introspection / test setup. */ +export { ANTHROPIC_PRICING, EMBEDDING_PRICING }; diff --git a/src/core/cycle.ts b/src/core/cycle.ts index 8593199b6..da46ce8fe 100644 --- a/src/core/cycle.ts +++ b/src/core/cycle.ts @@ -978,13 +978,25 @@ async function runPhasePurge(engine: BrainEngine, dryRun: boolean): Promise budget. Non-Anthropic models bypass the gate with a * `BUDGET_METER_NO_PRICING` warn (once per process). * * Ledger lives at `~/.gbrain/audit/dream-budget-YYYY-Www.jsonl` (ISO-week - * rotation, same pattern as shell-audit). Each line is one submit's cost + * rotation, same pattern as shell-audit; filename math now goes through + * `src/core/audit-week-file.ts` per T4). Each line is one submit's cost * estimate + actual usage when reported back. */ import { mkdirSync, appendFileSync } from 'node:fs'; -import { dirname } from 'node:path'; -import { gbrainPath } from '../config.ts'; +import { dirname, join } from 'node:path'; +import { isoWeekFilename, resolveAuditDir } from '../audit-week-file.ts'; import { estimateMaxCostUsd, ANTHROPIC_PRICING } from '../anthropic-pricing.ts'; export interface BudgetMeterOpts { @@ -51,15 +60,7 @@ const _unpricedWarnings = new Set(); function auditFilePath(override?: string): string { if (override) return override; - // ISO week format: YYYY-Www (2026-W18) - const now = new Date(); - const year = now.getUTCFullYear(); - // ISO week: Thursday's week. Approximated for filename only. - const oneJan = new Date(Date.UTC(year, 0, 1)); - const diffDays = Math.floor((now.getTime() - oneJan.getTime()) / 86_400_000); - const week = Math.ceil((diffDays + oneJan.getUTCDay() + 1) / 7); - const weekStr = String(week).padStart(2, '0'); - return gbrainPath(`audit/dream-budget-${year}-W${weekStr}.jsonl`); + return join(resolveAuditDir(), isoWeekFilename('dream-budget')); } function writeLedgerLine(path: string, entry: object): void { @@ -99,6 +100,7 @@ export class BudgetMeter { ); } writeLedgerLine(this.auditPath, { + schema_version: 1, phase: this.opts.phase, ts: new Date().toISOString(), event: 'submit_unpriced', @@ -120,6 +122,7 @@ export class BudgetMeter { if (this.opts.budgetUsd <= 0) { this.cumulativeUsd += cost; writeLedgerLine(this.auditPath, { + schema_version: 1, phase: this.opts.phase, ts: new Date().toISOString(), event: 'submit', @@ -135,6 +138,7 @@ export class BudgetMeter { const projected = this.cumulativeUsd + cost; if (projected > this.opts.budgetUsd) { writeLedgerLine(this.auditPath, { + schema_version: 1, phase: this.opts.phase, ts: new Date().toISOString(), event: 'submit_denied', @@ -155,6 +159,7 @@ export class BudgetMeter { this.cumulativeUsd += cost; writeLedgerLine(this.auditPath, { + schema_version: 1, phase: this.opts.phase, ts: new Date().toISOString(), event: 'submit', diff --git a/src/core/diarize/payload-fitter.ts b/src/core/diarize/payload-fitter.ts new file mode 100644 index 000000000..4d58e5fd2 --- /dev/null +++ b/src/core/diarize/payload-fitter.ts @@ -0,0 +1,268 @@ +/** + * v0.37.x — payload-fitter (P6) with two strategies + a quality gate. + * + * Generic utility for fitting an arbitrarily large list of items into a + * downstream caller's per-call token budget. + * + * Strategies (Q3 + codex finding #4): + * - 'batch' deterministic token-budgeted chunking. The caller + * receives a flat fit list shaped like the input; the + * chunking decision is left to the caller (e.g. the + * brainstorm judge concatenates results across batches). + * No LLM calls. + * - 'summarize' embed-cluster (k = ceil(items/4)), Haiku-summarize each + * cluster, return the fitted payload (summary nodes + * instead of every original item). Composes the active + * BudgetTracker via the gateway's AsyncLocalStorage scope + * (T3) — every Haiku call shows up in the cost ledger. + * Promise.allSettled at parallelism=4 (Perf1) so a single + * cluster-failure does not stall the whole pass. + * + * Quality gate (codex outside-voice finding #4): + * When the summarize strategy returns less than `min_success_ratio` + * (default 0.75) of attempted clusters, the result is flagged + * `degraded: true` and the caller decides whether to surface a partial + * result or abort. Brainstorm aborts on degraded; defaults can be + * relaxed per-caller. + */ + +import type { ChatOpts, ChatResult } from '../ai/gateway.ts'; + +/** Local ChatFn shape — kept here so payload-fitter doesn't depend on + * src/core/brainstorm/judges.ts (which is the canonical owner of the + * ChatFn alias today). */ +type ChatFn = (opts: ChatOpts) => Promise; + +export type FitStrategy = 'batch' | 'summarize'; + +export interface FitOptions { + items: T[]; + strategy: FitStrategy; + /** Hard per-call token budget. 'batch' chunks under this; 'summarize' + * shapes its k-clusters so each cluster fits this budget. */ + maxTokensPerCall: number; + /** Token estimator. Caller-supplied so payload-fitter is generic. */ + estimateTokens: (item: T) => number; + // ---- summarize-only ---- + /** Optional embed function (only used by 'summarize'). Caller supplies + * the active gateway.embed binding. */ + embedFn?: (text: string) => Promise; + /** Optional chat function for summarization. Caller supplies the + * active gateway.chat binding. */ + chatFn?: ChatFn; + /** Summarize-only: convert an item to text for embed + summarize. */ + itemToText?: (item: T) => string; + /** Summarize-only: convert a Haiku summary string back into an item- + * shaped fitted node. Caller-supplied so the fitted list has the + * caller's own type. */ + summaryToItem?: (summary: string, cluster: T[]) => T; + /** Summarize parallelism. Default 4 per Perf1. */ + parallelism?: number; + /** Quality gate threshold. Default 0.75. When the success ratio drops + * below this, result.degraded === true. */ + min_success_ratio?: number; + /** Override the summarization model (e.g. 'anthropic:claude-haiku-4-5'). + * Default falls back to the gateway's configured chat model. */ + summarizeModel?: string; +} + +export interface FitResult { + fitted: T[]; + strategy: FitStrategy; + /** Count of clusters that failed (summarize) or 0 (batch). */ + dropped: number; + /** Ratio of successful clusters: 1.0 for batch / clean summarize. */ + success_ratio: number; + /** True when success_ratio < min_success_ratio. */ + degraded: boolean; + /** Total LLM usage rolled up across summarize calls. Undefined for batch. */ + usage?: ChatResult['usage']; +} + +const DEFAULT_PARALLELISM = 4; +const DEFAULT_MIN_SUCCESS_RATIO = 0.75; + +/** + * Public entry point. Dispatches on strategy. Pure typecheck failures + * (e.g. summarize without embedFn/chatFn) throw `Error` synchronously so + * caller misuse fails loud. + */ +export async function fit(opts: FitOptions): Promise> { + if (opts.strategy === 'batch') { + return fitBatch(opts); + } + if (opts.strategy === 'summarize') { + return fitSummarize(opts); + } + throw new Error(`payload-fitter: unknown strategy "${(opts as { strategy: string }).strategy}"`); +} + +/** + * 'batch' strategy: deterministic, token-budgeted chunking. Returns the + * original items unchanged (no LLM calls). `dropped` is the count of + * items that exceeded the per-call budget all on their own — these are + * preserved in `fitted` (caller decides whether to surface a warning) + * but they signal a budgeting mismatch the caller should know about. + */ +function fitBatch(opts: FitOptions): FitResult { + const dropped = opts.items.filter((it) => opts.estimateTokens(it) > opts.maxTokensPerCall).length; + return { + fitted: opts.items.slice(), + strategy: 'batch', + dropped, + success_ratio: opts.items.length === 0 ? 1.0 : (opts.items.length - dropped) / opts.items.length, + degraded: false, + }; +} + +/** + * 'summarize' strategy: embed-cluster then Haiku-summarize each cluster. + * + * 1. embed every item (caller-supplied embedFn). + * 2. cluster into k = ceil(items/4) groups via cheap greedy nearest- + * neighbor on cosine similarity (deterministic; no sklearn). + * 3. parallel Haiku-summarize each cluster via Promise.allSettled + * with parallelism `opts.parallelism ?? 4` (Perf1). + * 4. drop failed clusters; surface a `degraded: true` flag when the + * success ratio falls below `min_success_ratio`. + * + * Each Haiku call composes the active BudgetTracker via AsyncLocalStorage + * (no per-call injection). On BudgetExhausted the call throws — caller's + * outer catch handles persistence. + */ +async function fitSummarize(opts: FitOptions): Promise> { + if (!opts.embedFn || !opts.chatFn || !opts.itemToText || !opts.summaryToItem) { + throw new Error( + `payload-fitter: strategy='summarize' requires embedFn + chatFn + itemToText + summaryToItem`, + ); + } + const minRatio = opts.min_success_ratio ?? DEFAULT_MIN_SUCCESS_RATIO; + const parallelism = Math.max(1, opts.parallelism ?? DEFAULT_PARALLELISM); + + if (opts.items.length === 0) { + return { fitted: [], strategy: 'summarize', dropped: 0, success_ratio: 1.0, degraded: false }; + } + + // 1. Embed every item. The gateway.embed call composes the active + // tracker; a budget throw here propagates cleanly. + const texts = opts.items.map((it) => opts.itemToText!(it)); + const embeds: Float32Array[] = []; + for (const text of texts) { + embeds.push(await opts.embedFn(text)); + } + + // 2. Greedy clustering. Pick the first un-clustered item as the seed; + // add the (k-1) closest remaining items by cosine. Deterministic + // given the input order. k = ceil(items / 4). + const k = Math.max(1, Math.ceil(opts.items.length / 4)); + const clusterSize = Math.ceil(opts.items.length / k); + const claimed = new Set(); + const clusters: number[][] = []; + for (let c = 0; c < k && claimed.size < opts.items.length; c++) { + let seedIdx = -1; + for (let i = 0; i < opts.items.length; i++) { + if (!claimed.has(i)) { + seedIdx = i; + break; + } + } + if (seedIdx === -1) break; + claimed.add(seedIdx); + const group = [seedIdx]; + const seedVec = embeds[seedIdx]; + // Score remaining un-claimed by similarity to seed; pick closest until cluster is full. + const remaining = opts.items + .map((_, idx) => idx) + .filter((idx) => idx !== seedIdx && !claimed.has(idx)) + .map((idx) => ({ idx, sim: cosine(seedVec, embeds[idx]) })) + .sort((a, b) => b.sim - a.sim); + for (const cand of remaining) { + if (group.length >= clusterSize) break; + claimed.add(cand.idx); + group.push(cand.idx); + } + clusters.push(group); + } + + // 3. Parallel summarize via allSettled with bounded concurrency. + const fitted: T[] = []; + const totalUsage: ChatResult['usage'] = { + input_tokens: 0, + output_tokens: 0, + cache_read_tokens: 0, + cache_creation_tokens: 0, + }; + let failed = 0; + for (let i = 0; i < clusters.length; i += parallelism) { + const wave = clusters.slice(i, i + parallelism); + const results = await Promise.allSettled( + wave.map((group) => summarizeCluster(group, opts, texts)), + ); + for (let j = 0; j < results.length; j++) { + const r = results[j]; + const group = wave[j]; + if (r.status === 'fulfilled') { + fitted.push(opts.summaryToItem!(r.value.summary, group.map((idx) => opts.items[idx]))); + totalUsage.input_tokens += r.value.usage.input_tokens; + totalUsage.output_tokens += r.value.usage.output_tokens; + if (typeof r.value.usage.cache_read_tokens === 'number') { + totalUsage.cache_read_tokens = + (totalUsage.cache_read_tokens ?? 0) + r.value.usage.cache_read_tokens; + } + if (typeof r.value.usage.cache_creation_tokens === 'number') { + totalUsage.cache_creation_tokens = + (totalUsage.cache_creation_tokens ?? 0) + r.value.usage.cache_creation_tokens; + } + } else { + failed++; + } + } + } + + const succeeded = clusters.length - failed; + const success_ratio = clusters.length === 0 ? 1.0 : succeeded / clusters.length; + const degraded = success_ratio < minRatio; + return { + fitted, + strategy: 'summarize', + dropped: failed, + success_ratio, + degraded, + usage: totalUsage, + }; +} + +interface SummarizeOutcome { + summary: string; + usage: ChatResult['usage']; +} + +async function summarizeCluster( + group: number[], + opts: FitOptions, + texts: string[], +): Promise { + const chat = opts.chatFn!; + const lines = group.map((idx) => `- ${texts[idx]}`).join('\n'); + const prompt = `Summarize the following items in ~3 sentences capturing the load-bearing themes. Do not paraphrase verbatim.\n\n${lines}`; + const res = await chat({ + model: opts.summarizeModel, + messages: [{ role: 'user', content: prompt }], + maxTokens: 400, + }); + return { summary: res.text.trim(), usage: res.usage }; +} + +function cosine(a: Float32Array, b: Float32Array): number { + const len = Math.min(a.length, b.length); + let dot = 0; + let na = 0; + let nb = 0; + for (let i = 0; i < len; i++) { + dot += a[i] * b[i]; + na += a[i] * a[i]; + nb += b[i] * b[i]; + } + if (na === 0 || nb === 0) return 0; + return dot / (Math.sqrt(na) * Math.sqrt(nb)); +} diff --git a/src/core/eval-contradictions/runner.ts b/src/core/eval-contradictions/runner.ts index 8c2873530..7a16728af 100644 --- a/src/core/eval-contradictions/runner.ts +++ b/src/core/eval-contradictions/runner.ts @@ -33,6 +33,8 @@ import { JudgeCache } from './cache.ts'; import { CostTracker, estimateUpperBoundCost } from './cost-tracker.ts'; import { buildSourceTierBreakdown, classifySlugTier } from './cross-source.ts'; import { shouldSkipForDateMismatch } from './date-filter.ts'; +import { withBudgetTracker } from '../ai/gateway.ts'; +import { BudgetTracker, BudgetExhausted } from '../budget/budget-tracker.ts'; import { judgeContradiction, type JudgeInput, type JudgeOutput } from './judge.ts'; import { JudgeErrorCollector } from './judge-errors.ts'; import { buildHotPages } from './severity-classify.ts'; @@ -225,6 +227,34 @@ function sortPairs( * strings — CLI flag parsing lives in the command file, not here. */ export async function runContradictionProbe(opts: RunnerOpts): Promise { + // T6: wrap the entire body in withBudgetTracker so every gateway-layer + // chat/embed/rerank call (judge, embed-on-query) auto-records via the + // AsyncLocalStorage scope from src/core/ai/gateway.ts. The existing + // CostTracker stays for the report shape — the new BudgetTracker is a + // parallel record-keeper that doesn't enforce a cap on top of the + // existing soft ceiling. Public surface (--budget-usd, PreFlightBudgetError) + // is byte-identical. + const _outerBudgetUsd = opts.budgetUsd ?? 5.0; + const _runnerTracker = new BudgetTracker({ + // Set the cap only when callers passed --budget-usd explicitly; this + // keeps the existing soft-ceiling semantics from CostTracker as the + // primary enforcement and uses the new tracker for telemetry only. + label: 'eval.suspected-contradictions', + }); + try { + return await withBudgetTracker(_runnerTracker, () => _runContradictionProbeInner(opts)); + } catch (err) { + // BudgetExhausted from the gateway path should bubble cleanly. With no + // cap set, the tracker only records; it doesn't throw, so this path + // is reachable only via future opt-in. + if (err instanceof BudgetExhausted) { + throw err; + } + throw err; + } +} + +async function _runContradictionProbeInner(opts: RunnerOpts): Promise { const startedAt = Date.now(); const judgeModel = opts.judgeModel ?? DEFAULT_JUDGE_MODEL; const topK = Math.max(1, opts.topK ?? DEFAULT_TOP_K); diff --git a/src/core/facts/phantom-audit.ts b/src/core/facts/phantom-audit.ts index 525ccedf3..2365d3490 100644 --- a/src/core/facts/phantom-audit.ts +++ b/src/core/facts/phantom-audit.ts @@ -20,7 +20,7 @@ import * as fs from 'node:fs'; import * as path from 'node:path'; -import { resolveAuditDir } from '../minions/handlers/shell-audit.ts'; +import { isoWeekFilename, resolveAuditDir } from '../audit-week-file.ts'; export type PhantomOutcome = | 'redirected' @@ -41,18 +41,10 @@ export interface PhantomAuditEvent { candidates?: Array<{ slug: string; connection_count: number }>; } -/** ISO-week-rotated filename: `phantoms-YYYY-Www.jsonl`. */ +/** ISO-week-rotated filename: `phantoms-YYYY-Www.jsonl`. Delegates to + * `src/core/audit-week-file.ts`. */ export function computePhantomAuditFilename(now: Date = new Date()): string { - const d = new Date(Date.UTC(now.getUTCFullYear(), now.getUTCMonth(), now.getUTCDate())); - const dayNum = (d.getUTCDay() + 6) % 7; - d.setUTCDate(d.getUTCDate() - dayNum + 3); - const isoYear = d.getUTCFullYear(); - const firstThursday = new Date(Date.UTC(isoYear, 0, 4)); - const firstThursdayDayNum = (firstThursday.getUTCDay() + 6) % 7; - firstThursday.setUTCDate(firstThursday.getUTCDate() - firstThursdayDayNum + 3); - const weekNum = Math.round((d.getTime() - firstThursday.getTime()) / (7 * 86400000)) + 1; - const ww = String(weekNum).padStart(2, '0'); - return `phantoms-${isoYear}-W${ww}.jsonl`; + return isoWeekFilename('phantoms', now); } /** diff --git a/src/core/migrate.ts b/src/core/migrate.ts index 4c9ea4fad..28938caba 100644 --- a/src/core/migrate.ts +++ b/src/core/migrate.ts @@ -3992,6 +3992,35 @@ export const MIGRATIONS: Migration[] = [ ADD COLUMN IF NOT EXISTS budget_usd_per_day NUMERIC(10, 2) NULL; `, }, + { + version: 86, + name: 'page_links_view_alias', + // v0.39 — pglite-engine.ts and postgres-engine.ts both query a relation + // named `page_links` (LEFT JOIN page_links pl ON pl.to_page_id = p.id — + // see pglite-engine.ts:896 / postgres-engine.ts:959). The canonical + // table has always been `links`. This migration installs a `page_links` + // VIEW that aliases the table so brains initialized before the v0.39 + // schema bundle pick up the alias on upgrade. + // + // Fresh installs already get the view via the embedded schema bundle. + // This migration is idempotent (CREATE OR REPLACE VIEW) so re-running + // is safe on either engine. + // + // Discovered during the brainstorm-cathedral wave (v0.39.0.0) when the + // E2E test had to workaround the missing view to exercise the resume + // path. Originally numbered v81; renumbered to v86 during merge with + // master's v0.38 cathedrals (provenance / subagent / spend / oauth + // binding) which claimed v81-v85. + // + // Narrow projection (id, from_page_id, to_page_id) so the view does not + // depend on columns added in later migrations (link_source, + // origin_page_id, resolution_type) — keeps ALTER TABLE DROP COLUMN + // and the bootstrap forward-reference probes unblocked on legacy brains. + sql: ` + CREATE OR REPLACE VIEW page_links AS + SELECT id, from_page_id, to_page_id FROM links; + `, + }, ]; export const LATEST_VERSION = MIGRATIONS.length > 0 diff --git a/src/core/minions/handlers/shell-audit.ts b/src/core/minions/handlers/shell-audit.ts index 06bf35c48..21d2583a4 100644 --- a/src/core/minions/handlers/shell-audit.ts +++ b/src/core/minions/handlers/shell-audit.ts @@ -15,7 +15,7 @@ import * as fs from 'node:fs'; import * as path from 'node:path'; -import { gbrainPath } from '../../config.ts'; +import { isoWeekFilename, resolveAuditDir as _sharedResolveAuditDir } from '../../audit-week-file.ts'; export interface ShellAuditEvent { ts: string; @@ -30,33 +30,18 @@ export interface ShellAuditEvent { inherit?: string[]; } -/** Compute `shell-jobs-YYYY-Www.jsonl` using ISO-8601 week numbering. - * - * Year-boundary edge: 2027-01-01 is ISO week 53 of year 2026, so the correct - * filename is `shell-jobs-2026-W53.jsonl`. This matches the ISO week standard - * (week containing the first Thursday of the year is W1; week containing Dec 28 - * is always W52 or W53 of that year). - */ +/** Compute `shell-jobs-YYYY-Www.jsonl`. Delegates to the shared helper in + * `src/core/audit-week-file.ts` — Year-boundary edges (2027-01-01 → W53 of + * 2026, 2020-W53 etc.) are covered by `test/core/audit-week-file.test.ts`. */ export function computeAuditFilename(now: Date = new Date()): string { - // Copy date and move to nearest Thursday (ISO week anchor). - const d = new Date(Date.UTC(now.getUTCFullYear(), now.getUTCMonth(), now.getUTCDate())); - const dayNum = (d.getUTCDay() + 6) % 7; // Mon=0, Sun=6 - d.setUTCDate(d.getUTCDate() - dayNum + 3); // shift to Thursday - const isoYear = d.getUTCFullYear(); - const firstThursday = new Date(Date.UTC(isoYear, 0, 4)); - const firstThursdayDayNum = (firstThursday.getUTCDay() + 6) % 7; - firstThursday.setUTCDate(firstThursday.getUTCDate() - firstThursdayDayNum + 3); - const weekNum = Math.round((d.getTime() - firstThursday.getTime()) / (7 * 86400000)) + 1; - const ww = String(weekNum).padStart(2, '0'); - return `shell-jobs-${isoYear}-W${ww}.jsonl`; + return isoWeekFilename('shell-jobs', now); } /** Resolve the audit dir. Honors `GBRAIN_AUDIT_DIR` for container/sandbox deployments - * where `$HOME` is read-only. Defaults to `~/.gbrain/audit/`. */ + * where `$HOME` is read-only. Defaults to `~/.gbrain/audit/`. Delegates to the + * shared helper. */ export function resolveAuditDir(): string { - const override = process.env.GBRAIN_AUDIT_DIR; - if (override && override.trim().length > 0) return override; - return gbrainPath('audit'); + return _sharedResolveAuditDir(); } export function logShellSubmission(event: Omit): void { diff --git a/src/core/minions/handlers/subagent.ts b/src/core/minions/handlers/subagent.ts index ddd4ef1a0..5dda236aa 100644 --- a/src/core/minions/handlers/subagent.ts +++ b/src/core/minions/handlers/subagent.ts @@ -395,6 +395,31 @@ export function makeSubagentHandler(deps: SubagentDeps) { } // 1. Acquire rate lease for the outbound call. + // + // A1 ORDERING (v0.37.x budget cathedral): + // + // +----------------------------------+ + // | gateway.chat() inside subagent | + // +-----+----------------------------+ + // | + // 1. getCurrentBudgetTracker()?.reserve(...) + // | (runs via the gateway's AsyncLocalStorage scope, + // | set by the upstream caller of the subagent. + // | On BudgetExhausted: throw BEFORE we touch the lease.) + // v + // 2. acquireLease(...) <-- the line below + // | (only attempted if the budget gate passed) + // v + // 3. provider HTTP call + // | + // v + // 4. tracker.record(actual usage) + // + // The handler body intentionally does NOT thread `BudgetTracker` + // explicitly. Gateway-layer composition (TX5) handles it. The + // ordering is load-bearing: a budget throw must NOT consume a + // lease slot, because the lease is the rate-limit pacer for the + // entire fleet. const lease = await acquireLease(engine, rateLeaseKey, ctx.id, maxConcurrent, { ttlMs: leaseTtlMs }); if (!lease.acquired) { // No slots — treat as a renewable error so the worker re-claims diff --git a/src/core/pglite-schema.ts b/src/core/pglite-schema.ts index 1661b1eb8..02f715bc3 100644 --- a/src/core/pglite-schema.ts +++ b/src/core/pglite-schema.ts @@ -171,6 +171,20 @@ CREATE INDEX IF NOT EXISTS idx_links_to ON links(to_page_id); CREATE INDEX IF NOT EXISTS idx_links_source ON links(link_source); CREATE INDEX IF NOT EXISTS idx_links_origin ON links(origin_page_id); +-- v0.38: page_links is the alias the engine queries use (pglite-engine.ts + +-- postgres-engine.ts both JOIN page_links pl ON pl.to_page_id = p.id). The +-- alias predates the table-name standardization; the canonical table is +-- links. Brainstorm domain-bank connection_count tiebreaker and the +-- doctor link-density score read through this view. +-- +-- The projection is intentionally NARROW (id, from_page_id, to_page_id only). +-- Engine queries only reference pl.id (via COUNT(*)) and pl.to_page_id. +-- Including link_source / origin_page_id / etc. in the view would couple +-- the alias to columns that didn't exist in pre-v0.13 brains AND would +-- block ALTER TABLE DROP COLUMN on those columns during upgrades. +CREATE OR REPLACE VIEW page_links AS + SELECT id, from_page_id, to_page_id FROM links; + -- ============================================================ -- tags -- ============================================================ diff --git a/src/core/remediation-checkpoint.ts b/src/core/remediation-checkpoint.ts new file mode 100644 index 000000000..3f780a5ed --- /dev/null +++ b/src/core/remediation-checkpoint.ts @@ -0,0 +1,123 @@ +/** + * v0.37.x — doctor --remediate checkpoint (A4 amended). + * + * When `gbrain doctor --remediate --max-cost N` blows past the cap mid-run + * (BudgetTracker throws BudgetExhausted via the gateway-layer + * AsyncLocalStorage), the runRemediate orchestrator persists what's been + * completed so the user can continue with `gbrain doctor --remediate --resume`. + * + * Checkpoint file: `~/.gbrain/remediation/.json` + * - plan_hash = sha256(JSON.stringify(sorted recommendation ids)).slice(0,16) + * - schema_version: 1 + * + * Best-effort write: a disk-full checkpoint never blocks the throw; we'd + * rather surface the BudgetExhausted than swallow it because the audit + * sidecar failed. + */ + +import { mkdirSync, writeFileSync, readFileSync, readdirSync, statSync, existsSync, unlinkSync } from 'node:fs'; +import { join } from 'node:path'; +import { createHash } from 'node:crypto'; +import { gbrainPath } from './config.ts'; + +export interface RemediationCheckpoint { + schema_version: 1; + plan_hash: string; + doctor_run_id: string; + target_score: number; + started_at: string; + completed: Array<{ + id: string; + job: string; + idempotency_key?: string; + status: string; + job_id?: number | null; + }>; + aborted_at: string; + abort_reason: 'budget_exhausted' | 'manual' | 'error'; + budget_snapshot?: { + spent: number; + cap: number; + reason: string; + model_id?: string; + }; +} + +function checkpointDir(): string { + return gbrainPath('remediation'); +} + +export function computePlanHash(recommendationIds: string[]): string { + const sorted = [...recommendationIds].sort(); + const sha = createHash('sha256').update(JSON.stringify(sorted)).digest('hex'); + return sha.slice(0, 16); +} + +export function checkpointPath(planHash: string): string { + return join(checkpointDir(), `${planHash}.json`); +} + +export function saveRemediationCheckpoint(cp: RemediationCheckpoint): void { + try { + mkdirSync(checkpointDir(), { recursive: true }); + const path = checkpointPath(cp.plan_hash); + const tmp = `${path}.tmp`; + writeFileSync(tmp, JSON.stringify(cp, null, 2)); + // Atomic rename via fs.renameSync — Node guarantees POSIX atomicity on same-fs renames. + const { renameSync } = require('node:fs') as typeof import('node:fs'); + renameSync(tmp, path); + } catch (err) { + process.stderr.write(`[remediate] checkpoint write failed: ${String(err)}\n`); + } +} + +export function loadRemediationCheckpoint(planHash: string): RemediationCheckpoint | null { + const path = checkpointPath(planHash); + if (!existsSync(path)) return null; + try { + const raw = readFileSync(path, 'utf-8'); + const parsed = JSON.parse(raw) as RemediationCheckpoint; + if (parsed.schema_version !== 1) { + process.stderr.write(`[remediate] checkpoint ${planHash} has schema_version ${parsed.schema_version}; ignoring.\n`); + return null; + } + return parsed; + } catch (err) { + process.stderr.write(`[remediate] checkpoint read failed: ${String(err)}\n`); + return null; + } +} + +/** List checkpoint files mtime-ordered, newest first. Best-effort. */ +export function listRemediationCheckpoints(): Array<{ plan_hash: string; mtime: number }> { + const dir = checkpointDir(); + if (!existsSync(dir)) return []; + try { + const entries = readdirSync(dir).filter((f) => f.endsWith('.json')); + return entries + .map((f) => { + try { + const path = join(dir, f); + const m = statSync(path).mtimeMs; + return { plan_hash: f.replace(/\.json$/, ''), mtime: m }; + } catch { + return null; + } + }) + .filter((x): x is { plan_hash: string; mtime: number } => x !== null) + .sort((a, b) => b.mtime - a.mtime); + } catch { + return []; + } +} + +/** Delete a checkpoint after successful completion. Idempotent. */ +export function clearRemediationCheckpoint(planHash: string): void { + const path = checkpointPath(planHash); + if (!existsSync(path)) return; + try { + unlinkSync(path); + } catch { + // Best-effort. + } +} diff --git a/test/brainstorm/checkpoint.serial.test.ts b/test/brainstorm/checkpoint.serial.test.ts new file mode 100644 index 000000000..101ddedb6 --- /dev/null +++ b/test/brainstorm/checkpoint.serial.test.ts @@ -0,0 +1,223 @@ +/** + * v0.37.x — brainstorm checkpoint contract (TX3/TX4/A5 amended). + * + * Pins: + * - computeRunId is deterministic + invariant to slug-array sort order. + * - computeRunId is stable across embedding-model swaps (no embedding + * bits in the hash). + * - saveCheckpoint atomic via .tmp + rename. + * - loadCheckpoint returns null on missing file + schema_version + * mismatch. + * - listRuns mtime-ordered (newest first). + * - gcStaleCheckpoints unlinks > N days. + * - Round-trip preserves `ideas` bodies (TX3 load-bearing contract). + */ + +import { describe, test, expect, beforeEach, afterEach } from 'bun:test'; +import { mkdtempSync, rmSync, existsSync, readFileSync, writeFileSync, utimesSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; +import { + computeRunId, + saveCheckpoint, + loadCheckpoint, + listRuns, + gcStaleCheckpoints, + clearCheckpoint, + isCheckpointFresh, + type BrainstormCheckpoint, +} from '../../src/core/brainstorm/checkpoint.ts'; + +let homeBackup: string | undefined; +let tmp: string; + +beforeEach(() => { + tmp = mkdtempSync(join(tmpdir(), 'gbrain-bs-cp-')); + homeBackup = process.env.GBRAIN_HOME; + process.env.GBRAIN_HOME = tmp; +}); + +afterEach(() => { + if (homeBackup === undefined) delete process.env.GBRAIN_HOME; + else process.env.GBRAIN_HOME = homeBackup; + rmSync(tmp, { recursive: true, force: true }); +}); + +function fixtureCheckpoint(runId: string, ideas: Array<{ text: string; cross: string }> = []): BrainstormCheckpoint { + return { + schema_version: 2, + run_id: runId, + question: 'why are AI coding tools converging on the same UX?', + profile_label: 'brainstorm', + started_at: new Date().toISOString(), + completed_crosses: ideas.map((i, idx) => ({ + close_slug: `wiki/close-${idx}`, + far_slug: `wiki/far-${idx}`, + cross_id: i.cross, + ideas: [{ text: i.text, cross_id: i.cross }], + })), + failed_crosses: [], + judge_done: false, + }; +} + +describe('computeRunId (A5 amended)', () => { + test('deterministic for the same inputs', () => { + const a = computeRunId('Q', 'brainstorm', ['close/a', 'close/b'], ['far/c', 'far/d']); + const b = computeRunId('Q', 'brainstorm', ['close/a', 'close/b'], ['far/c', 'far/d']); + expect(a).toBe(b); + }); + + test('invariant to slug-array order', () => { + const a = computeRunId('Q', 'lsd', ['close/a', 'close/b'], ['far/c', 'far/d']); + const b = computeRunId('Q', 'lsd', ['close/b', 'close/a'], ['far/d', 'far/c']); + expect(a).toBe(b); + }); + + test('differs when question changes', () => { + const a = computeRunId('Q1', 'brainstorm', ['s'], ['t']); + const b = computeRunId('Q2', 'brainstorm', ['s'], ['t']); + expect(a).not.toBe(b); + }); + + test('differs when profile changes', () => { + const a = computeRunId('Q', 'brainstorm', ['s'], ['t']); + const b = computeRunId('Q', 'lsd', ['s'], ['t']); + expect(a).not.toBe(b); + }); + + test('stable across embedding-model swaps (no embedding bits)', () => { + // The identity formula uses ONLY question+profile+slug-arrays. We + // simulate a model swap by varying nothing — the run_id must be + // independent of any embedding state, which means we get the same + // hash from the same call. + const slugs = ['close/a']; + const far = ['far/b']; + expect(computeRunId('Q', 'brainstorm', slugs, far)).toBe( + computeRunId('Q', 'brainstorm', slugs, far), + ); + }); + + test('produces a stable 16-char hex prefix', () => { + const id = computeRunId('Q', 'brainstorm', ['s'], ['t']); + expect(id).toMatch(/^[0-9a-f]{16}$/); + }); +}); + +describe('save + load round-trip (TX3 load-bearing — full ideas preserved)', () => { + test('preserves completed_crosses ideas verbatim', () => { + const runId = 'ab1234567890cdef'; + const cp = fixtureCheckpoint(runId, [ + { text: 'idea body one — concrete grounding here', cross: 'C1' }, + { text: 'idea body two', cross: 'C2' }, + { text: 'idea body three with extra detail', cross: 'C3' }, + ]); + saveCheckpoint(cp); + const loaded = loadCheckpoint(runId); + expect(loaded).not.toBeNull(); + expect(loaded!.completed_crosses.length).toBe(3); + expect(loaded!.completed_crosses[0].ideas[0].text).toBe('idea body one — concrete grounding here'); + expect(loaded!.completed_crosses[0].ideas[0].cross_id).toBe('C1'); + expect(loaded!.completed_crosses[2].ideas[0].text).toBe('idea body three with extra detail'); + }); + + test('atomic write: no .tmp left behind on success', () => { + const cp = fixtureCheckpoint('atomicrenameabcd'); + saveCheckpoint(cp); + const dir = join(tmp, '.gbrain', 'brainstorm'); + expect(existsSync(join(dir, 'atomicrenameabcd.json'))).toBe(true); + expect(existsSync(join(dir, 'atomicrenameabcd.json.tmp'))).toBe(false); + }); + + test('loadCheckpoint returns null on missing file', () => { + expect(loadCheckpoint('no_such_run_id')).toBeNull(); + }); + + test('loadCheckpoint returns null + stderr WARN on schema mismatch', () => { + const runId = 'schemamismatch00'; + const cp = fixtureCheckpoint(runId); + saveCheckpoint(cp); + const path = join(tmp, '.gbrain', 'brainstorm', `${runId}.json`); + const raw = JSON.parse(readFileSync(path, 'utf-8')); + raw.schema_version = 1; + writeFileSync(path, JSON.stringify(raw)); + expect(loadCheckpoint(runId)).toBeNull(); + }); + + test('loadCheckpoint returns null on corrupt JSON', () => { + const runId = 'corruptjson00000'; + saveCheckpoint(fixtureCheckpoint(runId)); + writeFileSync(join(tmp, '.gbrain', 'brainstorm', `${runId}.json`), '{not json}'); + expect(loadCheckpoint(runId)).toBeNull(); + }); +}); + +describe('listRuns mtime-newest-first', () => { + test('empty dir returns []', () => { + expect(listRuns()).toEqual([]); + }); + + test('returns most-recently-saved first', async () => { + saveCheckpoint(fixtureCheckpoint('run00000000first')); + await new Promise((r) => setTimeout(r, 20)); + saveCheckpoint(fixtureCheckpoint('run0000000second')); + const list = listRuns(); + expect(list.length).toBe(2); + expect(list[0].run_id).toBe('run0000000second'); + expect(list[1].run_id).toBe('run00000000first'); + }); +}); + +describe('gcStaleCheckpoints (A5 7-day window)', () => { + test('removes files older than the threshold; returns count', () => { + const stale = 'stalecheckpoint1'; + const fresh = 'freshcheckpoint2'; + saveCheckpoint(fixtureCheckpoint(stale)); + saveCheckpoint(fixtureCheckpoint(fresh)); + // Set the stale file's mtime to 30 days ago. + const stalePath = join(tmp, '.gbrain', 'brainstorm', `${stale}.json`); + const oldTime = (Date.now() - 30 * 24 * 60 * 60 * 1000) / 1000; + utimesSync(stalePath, oldTime, oldTime); + const removed = gcStaleCheckpoints(7); + expect(removed).toBe(1); + expect(existsSync(stalePath)).toBe(false); + expect(existsSync(join(tmp, '.gbrain', 'brainstorm', `${fresh}.json`))).toBe(true); + }); + + test('returns 0 when dir is empty', () => { + expect(gcStaleCheckpoints(7)).toBe(0); + }); +}); + +describe('clearCheckpoint', () => { + test('removes file when present', () => { + saveCheckpoint(fixtureCheckpoint('cleartest0000000')); + const path = join(tmp, '.gbrain', 'brainstorm', `cleartest0000000.json`); + expect(existsSync(path)).toBe(true); + clearCheckpoint('cleartest0000000'); + expect(existsSync(path)).toBe(false); + }); + + test('idempotent on missing file', () => { + expect(() => clearCheckpoint('never_saved')).not.toThrow(); + }); +}); + +describe('isCheckpointFresh', () => { + test('true for newly-saved checkpoint', () => { + saveCheckpoint(fixtureCheckpoint('freshtest0000000')); + expect(isCheckpointFresh('freshtest0000000')).toBe(true); + }); + + test('false for missing checkpoint', () => { + expect(isCheckpointFresh('not_saved')).toBe(false); + }); + + test('false for >7 day old checkpoint', () => { + saveCheckpoint(fixtureCheckpoint('oldtest000000000')); + const path = join(tmp, '.gbrain', 'brainstorm', 'oldtest000000000.json'); + const oldTime = (Date.now() - 10 * 24 * 60 * 60 * 1000) / 1000; + utimesSync(path, oldTime, oldTime); + expect(isCheckpointFresh('oldtest000000000')).toBe(false); + }); +}); diff --git a/test/brainstorm/cost-guardrails.test.ts b/test/brainstorm/cost-guardrails.test.ts new file mode 100644 index 000000000..dcc4c127e --- /dev/null +++ b/test/brainstorm/cost-guardrails.test.ts @@ -0,0 +1,165 @@ +/** + * v0.37.1 — cost guardrails + judge chunking + far-set cap. + * + * Regression suite for fix/brainstorm-cost-guardrails. The 13K-page brain + * incident: estimated cost $0.96, actual $50.71 (53x over) because the + * domain-bank's `listPrefixSampledPages` returned one page per prefix and + * the brain had ~2K distinct prefixes. The judge phase then tried to score + * 15,868 ideas in a single LLM call (3M tokens > 1M context window). + * + * These tests pin the new behavior: + * - CLI parses --max-cost, --max-far-set, --strict-budget, --judge-model, + * --max-ideas-per-judge-call. + * - runJudge chunks large idea sets into batches of `maxIdeasPerCall`. + * - fetchFar caps the prefix list to `maxFarSet` and trims pages to `m`. + */ + +import { describe, test, expect } from 'bun:test'; +import { parseBrainstormArgs } from '../../src/commands/brainstorm.ts'; +import { runJudge, BRAINSTORM_JUDGE_CONFIG, type JudgeIdea } from '../../src/core/brainstorm/judges.ts'; +import type { ChatOpts, ChatResult } from '../../src/core/ai/gateway.ts'; + +describe('parseBrainstormArgs — new cost-guardrail flags', () => { + test('--max-cost parses positive float', () => { + const r = parseBrainstormArgs(['hello', '--max-cost', '2.50']); + expect(r.maxCost).toBe(2.5); + expect(r.error).toBeUndefined(); + }); + + test('--max-cost rejects non-positive', () => { + const r = parseBrainstormArgs(['hello', '--max-cost', '0']); + expect(r.error).toMatch(/--max-cost/); + }); + + test('--max-far-set parses positive int', () => { + const r = parseBrainstormArgs(['hello', '--max-far-set', '20']); + expect(r.maxFarSet).toBe(20); + }); + + test('--strict-budget is a boolean flag', () => { + const r = parseBrainstormArgs(['hello', '--strict-budget']); + expect(r.strictBudget).toBe(true); + }); + + test('--judge-model captures the next arg', () => { + const r = parseBrainstormArgs(['hello', '--judge-model', 'anthropic:claude-sonnet-4-6']); + expect(r.judgeModel).toBe('anthropic:claude-sonnet-4-6'); + }); + + test('--judge-model rejects missing value', () => { + const r = parseBrainstormArgs(['hello', '--judge-model']); + expect(r.error).toMatch(/--judge-model/); + }); + + test('--max-ideas-per-judge-call parses positive int', () => { + const r = parseBrainstormArgs(['hello', '--max-ideas-per-judge-call', '50']); + expect(r.maxIdeasPerJudgeCall).toBe(50); + }); + + test('flags compose with --limit and --yes', () => { + const r = parseBrainstormArgs([ + 'why are AI coding tools converging', + '--max-cost', '10', + '--max-far-set', '25', + '--limit', '8', + '--yes', + ]); + expect(r.error).toBeUndefined(); + expect(r.maxCost).toBe(10); + expect(r.maxFarSet).toBe(25); + expect(r.limit).toBe(8); + expect(r.yes).toBe(true); + expect(r.question).toBe('why are AI coding tools converging'); + }); +}); + +describe('runJudge — chunks large idea sets to avoid context overflow', () => { + // Build a fake chat that returns a well-formed batch verdict for whatever + // ideas are in the prompt. The mock parses the `## Idea ` headings to + // know which ids it should score, so we can assert each chunk lands. + function makeFakeChat() { + const state = { calls: 0, lastIdeaCount: 0, allScoredIds: [] as string[] }; + const chat = async (opts: ChatOpts): Promise => { + state.calls += 1; + const rawContent = opts.messages[0]?.content; + const user = typeof rawContent === 'string' ? rawContent : ''; + const ideaMatches = Array.from(user.matchAll(/## Idea (\S+)/g)).map((m) => m[1] as string); + state.lastIdeaCount = ideaMatches.length; + state.allScoredIds.push(...ideaMatches); + const ideasJson = ideaMatches.map((id) => ({ + id, + scores: { originality: 4, resistance: 4, thesis_density: 4, concrete_grounding: 4, cognitive_load: 4 }, + note: 'mock', + })); + const text = '```json\n' + JSON.stringify({ ideas: ideasJson }) + '\n```'; + const result: ChatResult = { + text, + blocks: [{ type: 'text', text }], + stopReason: 'end', + model: 'mock:judge', + providerId: 'mock', + usage: { input_tokens: 100, output_tokens: 50, cache_read_tokens: 0, cache_creation_tokens: 0 }, + }; + return result; + }; + return { chat, state }; + } + + function makeIdeas(n: number): JudgeIdea[] { + return Array.from({ length: n }, (_, i) => ({ + id: String(i + 1).padStart(3, '0'), + text: `idea body ${i}`, + close_slug: 'wiki/close', + far_slug: 'wiki/far', + })); + } + + test('250 ideas with maxIdeasPerCall=100 → 3 chunks, all ideas scored', async () => { + const fake = makeFakeChat(); + const ideas = makeIdeas(250); + const result = await runJudge(BRAINSTORM_JUDGE_CONFIG, ideas, { + chatFn: fake.chat, + maxIdeasPerCall: 100, + stderrWrite: () => {}, + }); + expect(fake.state.calls).toBe(3); // 100 + 100 + 50 + expect(result.ideas.length).toBe(250); + expect(fake.state.allScoredIds.sort()).toEqual(ideas.map((i) => i.id).sort()); + }); + + test('single chunk path preserved for small idea sets', async () => { + const fake = makeFakeChat(); + const ideas = makeIdeas(10); + const result = await runJudge(BRAINSTORM_JUDGE_CONFIG, ideas, { + chatFn: fake.chat, + maxIdeasPerCall: 100, + stderrWrite: () => {}, + }); + expect(fake.state.calls).toBe(1); + expect(result.ideas.length).toBe(10); + }); + + test('usage tokens accumulate across chunks', async () => { + const fake = makeFakeChat(); + const ideas = makeIdeas(250); + const result = await runJudge(BRAINSTORM_JUDGE_CONFIG, ideas, { + chatFn: fake.chat, + maxIdeasPerCall: 100, + stderrWrite: () => {}, + }); + // Each mock call reports 100 in / 50 out; 3 calls → 300 / 150. + expect(result.usage.input_tokens).toBe(300); + expect(result.usage.output_tokens).toBe(150); + }); + + test('default chunk size is 100 (codex r2 follow-up)', async () => { + const fake = makeFakeChat(); + const ideas = makeIdeas(101); + await runJudge(BRAINSTORM_JUDGE_CONFIG, ideas, { + chatFn: fake.chat, + // no maxIdeasPerCall → default 100 + stderrWrite: () => {}, + }); + expect(fake.state.calls).toBe(2); // 100 + 1 + }); +}); diff --git a/test/budget-meter.test.ts b/test/budget-meter.test.ts index 51eb41cc4..79234a601 100644 --- a/test/budget-meter.test.ts +++ b/test/budget-meter.test.ts @@ -78,4 +78,34 @@ describe('BudgetMeter', () => { const r = meter.check({ modelId: 'claude-haiku-4-5-20251001', estimatedInputTokens: 100, maxOutputTokens: 100, label: 'wk' }); expect(r.allowed).toBe(true); }); + + test('A2 amended: every ledger line carries schema_version=1 and the documented field set', () => { + const meter = new BudgetMeter({ budgetUsd: 0.01, phase: 'auto_think', auditPath }); + meter.check({ modelId: 'claude-haiku-4-5-20251001', estimatedInputTokens: 1000, maxOutputTokens: 1000, label: 'verdict' }); // submit + meter.check({ modelId: 'claude-opus-4-7', estimatedInputTokens: 5000, maxOutputTokens: 10000, label: 'big-call' }); // submit_denied + meter.check({ modelId: 'gpt-5', estimatedInputTokens: 1000, maxOutputTokens: 1000, label: 'unpriced' }); // submit_unpriced + const lines = readLedger(); + expect(lines).toHaveLength(3); + + // schema_version must be on every line (renames here are breaking). + for (const line of lines) { + expect(line.schema_version).toBe(1); + expect(typeof line.ts).toBe('string'); + expect(line.phase).toBe('auto_think'); + expect(['submit', 'submit_denied', 'submit_unpriced']).toContain(line.event as string); + expect(typeof line.model).toBe('string'); + expect(typeof line.label).toBe('string'); + } + + // submit / submit_denied carry the cost fields. + const denied = lines[0]; // first opus call exceeds the cap → denied + expect(typeof denied.estimated_cost_usd).toBe('number'); + expect(typeof denied.cumulative_cost_usd).toBe('number'); + expect(denied.budget_usd).toBe(0.01); + + // submit_unpriced carries the token-shape fields instead. + const unpriced = lines[2]; + expect(typeof unpriced.estimated_input_tokens).toBe('number'); + expect(typeof unpriced.max_output_tokens).toBe('number'); + }); }); diff --git a/test/core/audit-week-file.serial.test.ts b/test/core/audit-week-file.serial.test.ts new file mode 100644 index 000000000..061cbefc8 --- /dev/null +++ b/test/core/audit-week-file.serial.test.ts @@ -0,0 +1,68 @@ +/** + * v0.37.x — single source of truth for ISO-week audit filenames. + * + * Pins year-boundary correctness so the four migrated callers + * (shell-audit, phantom-audit, slug-fallback-audit, dream-budget, + * budget-tracker) don't drift apart on filename shapes. + */ + +import { describe, test, expect } from 'bun:test'; +import { isoWeek, isoWeekFilename, resolveAuditDir } from '../../src/core/audit-week-file.ts'; + +describe('isoWeek', () => { + test('mid-year date returns 1..53 within the calendar year', () => { + const { year, week } = isoWeek(new Date(Date.UTC(2026, 5, 15))); // 2026-06-15 (Mon) + expect(year).toBe(2026); + expect(week).toBeGreaterThan(20); + expect(week).toBeLessThan(28); + }); + + test('2025-01-01 (Wednesday) belongs to 2025-W01', () => { + const { year, week } = isoWeek(new Date(Date.UTC(2025, 0, 1))); + expect(year).toBe(2025); + expect(week).toBe(1); + }); + + test('2024-12-30 (Monday) belongs to 2025-W01 (rollover into next ISO year)', () => { + const { year, week } = isoWeek(new Date(Date.UTC(2024, 11, 30))); + expect(year).toBe(2025); + expect(week).toBe(1); + }); + + test('2026-01-01 (Thursday) belongs to 2026-W01', () => { + const { year, week } = isoWeek(new Date(Date.UTC(2026, 0, 1))); + expect(year).toBe(2026); + expect(week).toBe(1); + }); + + test('2020-12-28 (Mon) is 2020-W53 (the 53-week year)', () => { + const { year, week } = isoWeek(new Date(Date.UTC(2020, 11, 28))); + expect(year).toBe(2020); + expect(week).toBe(53); + }); +}); + +describe('isoWeekFilename', () => { + test('produces -YYYY-Www.jsonl with two-digit week', () => { + expect(isoWeekFilename('budget', new Date(Date.UTC(2025, 0, 1)))).toBe('budget-2025-W01.jsonl'); + expect(isoWeekFilename('shell-jobs', new Date(Date.UTC(2020, 11, 28)))).toBe('shell-jobs-2020-W53.jsonl'); + }); + + test('default now arg uses current date (smoke)', () => { + const name = isoWeekFilename('budget'); + expect(name).toMatch(/^budget-\d{4}-W\d{2}\.jsonl$/); + }); +}); + +describe('resolveAuditDir', () => { + test('honors GBRAIN_AUDIT_DIR override', () => { + const prev = process.env.GBRAIN_AUDIT_DIR; + process.env.GBRAIN_AUDIT_DIR = '/tmp/test-audit-override'; + try { + expect(resolveAuditDir()).toBe('/tmp/test-audit-override'); + } finally { + if (prev === undefined) delete process.env.GBRAIN_AUDIT_DIR; + else process.env.GBRAIN_AUDIT_DIR = prev; + } + }); +}); diff --git a/test/core/budget/budget-tracker.test.ts b/test/core/budget/budget-tracker.test.ts new file mode 100644 index 000000000..034bbe4d1 --- /dev/null +++ b/test/core/budget/budget-tracker.test.ts @@ -0,0 +1,363 @@ +/** + * v0.37.x — BudgetTracker contracts (TX1, TX2, A3 amended, Q2). + * + * Every behavior the rest of the budget cathedral depends on is pinned here: + * - reserve() throws BudgetExhausted on each of {cost, runtime, no_pricing}. + * - record() throws BudgetExhausted (reason:'cost') when cumulative > cap + * after a single under-estimated call (TX1). + * - extractUsageFromError prefers err.usage, falls back to a pessimistic + * ceiling (NOT the conservative pre-call estimate) (A3 amended). + * - onExhausted fires once + synchronously, before the throw propagates. + * - Audit JSONL is schema-stable: every line carries schema_version=1. + * - Non-priced model + no cap: emits BUDGET_TRACKER_NO_PRICING once per + * process (legacy behavior preserved). + * + * Hermetic: no DB, no network, no real audit dir. We override `auditPath` + * to a tmpdir-scoped JSONL so tests can read it back without touching + * `~/.gbrain`. `withEnv` covers the GBRAIN_AUDIT_DIR escape hatch. + */ + +import { describe, test, expect, beforeEach, afterEach } from 'bun:test'; +import { mkdtempSync, readFileSync, rmSync, existsSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; +import { + BudgetTracker, + BudgetExhausted, + extractUsageFromError, + _resetBudgetTrackerWarningsForTest, +} from '../../../src/core/budget/budget-tracker.ts'; + +let tmp: string; +let auditPath: string; +let stderrCapture: string; +let origStderrWrite: typeof process.stderr.write; + +beforeEach(() => { + tmp = mkdtempSync(join(tmpdir(), 'gbrain-budget-test-')); + auditPath = join(tmp, 'budget.jsonl'); + _resetBudgetTrackerWarningsForTest(); + stderrCapture = ''; + origStderrWrite = process.stderr.write.bind(process.stderr); + (process.stderr as { write: unknown }).write = (chunk: string | Uint8Array): boolean => { + stderrCapture += typeof chunk === 'string' ? chunk : new TextDecoder().decode(chunk); + return true; + }; +}); + +afterEach(() => { + (process.stderr as { write: unknown }).write = origStderrWrite; + rmSync(tmp, { recursive: true, force: true }); +}); + +function readAudit(): Array> { + if (!existsSync(auditPath)) return []; + return readFileSync(auditPath, 'utf-8') + .split('\n') + .filter((l) => l.length > 0) + .map((l) => JSON.parse(l) as Record); +} + +describe('BudgetTracker.reserve', () => { + test('passes when under cap with known pricing', () => { + const t = new BudgetTracker({ maxCostUsd: 1.0, label: 'test', auditPath }); + expect(() => + t.reserve({ + modelId: 'claude-haiku-4-5-20251001', + estimatedInputTokens: 1000, + maxOutputTokens: 1000, + kind: 'chat', + }), + ).not.toThrow(); + const audit = readAudit(); + expect(audit.length).toBe(1); + expect(audit[0].event).toBe('reserve'); + expect(audit[0].schema_version).toBe(1); + }); + + test('throws BudgetExhausted (reason: cost) when projected > cap', () => { + const t = new BudgetTracker({ maxCostUsd: 0.001, label: 'test', auditPath }); + let caught: unknown = null; + try { + // Opus 4.7 at $5/$25/M; 1K in + 1K out = $0.005 + $0.025 = $0.030 > $0.001 + t.reserve({ + modelId: 'claude-opus-4-7', + estimatedInputTokens: 1000, + maxOutputTokens: 1000, + kind: 'chat', + }); + } catch (err) { + caught = err; + } + expect(caught).toBeInstanceOf(BudgetExhausted); + expect((caught as BudgetExhausted).reason).toBe('cost'); + expect((caught as BudgetExhausted).cap).toBe(0.001); + expect((caught as BudgetExhausted).modelId).toBe('claude-opus-4-7'); + const audit = readAudit(); + expect(audit.some((e) => e.event === 'reserve_denied')).toBe(true); + }); + + test('throws BudgetExhausted (reason: runtime) when wall-clock cap blown', () => { + const t = new BudgetTracker({ maxRuntimeMs: 1, label: 'test', auditPath }); + // Spin briefly so elapsed > 1ms + const start = Date.now(); + while (Date.now() - start < 5) { + /* spin */ + } + let caught: unknown = null; + try { + t.reserve({ + modelId: 'claude-haiku-4-5-20251001', + estimatedInputTokens: 10, + maxOutputTokens: 10, + kind: 'chat', + }); + } catch (err) { + caught = err; + } + expect(caught).toBeInstanceOf(BudgetExhausted); + expect((caught as BudgetExhausted).reason).toBe('runtime'); + }); + + test('TX2: throws BudgetExhausted (reason: no_pricing) when cap set + model unknown', () => { + const t = new BudgetTracker({ maxCostUsd: 1.0, label: 'test', auditPath }); + let caught: unknown = null; + try { + t.reserve({ + modelId: 'mystery:some-unreleased-model', + estimatedInputTokens: 100, + maxOutputTokens: 100, + kind: 'chat', + }); + } catch (err) { + caught = err; + } + expect(caught).toBeInstanceOf(BudgetExhausted); + expect((caught as BudgetExhausted).reason).toBe('no_pricing'); + expect((caught as BudgetExhausted).modelId).toBe('mystery:some-unreleased-model'); + expect((caught as Error).message).toMatch(/anthropic-pricing\.ts/); + }); + + test('no cap + unknown pricing: warns once per process, no throw', () => { + const t = new BudgetTracker({ label: 'test', auditPath }); + expect(() => + t.reserve({ + modelId: 'mystery:some-other', + estimatedInputTokens: 100, + maxOutputTokens: 100, + kind: 'chat', + }), + ).not.toThrow(); + expect(stderrCapture).toMatch(/BUDGET_TRACKER_NO_PRICING/); + // Second call same model: no second warning. + const before = stderrCapture.length; + t.reserve({ + modelId: 'mystery:some-other', + estimatedInputTokens: 100, + maxOutputTokens: 100, + kind: 'chat', + }); + expect(stderrCapture.length).toBe(before); + const audit = readAudit(); + expect(audit.filter((e) => e.event === 'reserve_unpriced').length).toBe(2); + }); +}); + +describe('BudgetTracker.record', () => { + test('TX1: cumulative > cap after under-estimated call throws BudgetExhausted', () => { + const t = new BudgetTracker({ maxCostUsd: 0.01, label: 'test', auditPath }); + // Reserve a small call (within cap) + t.reserve({ + modelId: 'claude-haiku-4-5-20251001', + estimatedInputTokens: 100, + maxOutputTokens: 100, + kind: 'chat', + }); + // Provider returns way more than expected — cumulative blows past cap. + let caught: unknown = null; + try { + t.record({ + modelId: 'claude-haiku-4-5-20251001', + inputTokens: 1_000_000, + outputTokens: 1_000_000, + kind: 'chat', + } as any); + } catch (err) { + caught = err; + } + expect(caught).toBeInstanceOf(BudgetExhausted); + expect((caught as BudgetExhausted).reason).toBe('cost'); + expect((caught as BudgetExhausted).cap).toBe(0.01); + expect((caught as BudgetExhausted).spent).toBeGreaterThan(0.01); + expect(t.totalSpent).toBeGreaterThan(0.01); + }); + + test('records actual usage on success and updates cumulative', () => { + const t = new BudgetTracker({ maxCostUsd: 1.0, label: 'test', auditPath }); + t.record({ + modelId: 'claude-haiku-4-5-20251001', + inputTokens: 1000, + outputTokens: 500, + kind: 'chat', + } as any); + // Haiku: ($1 × 1K/1M) + ($5 × 500/1K-K) = 0.001 + 0.0025 = 0.0035 + expect(t.totalSpent).toBeCloseTo(0.0035, 6); + expect(t.snapshot().callsRecorded).toBe(1); + const audit = readAudit(); + expect(audit.length).toBe(1); + expect(audit[0].event).toBe('record'); + expect(audit[0].schema_version).toBe(1); + expect(audit[0].actual_cost_usd).toBeCloseTo(0.0035, 6); + }); + + test('unpriced record: no throw, audited as record_unpriced', () => { + const t = new BudgetTracker({ label: 'test', auditPath }); + expect(() => + t.record({ + modelId: 'mystery:unknown', + inputTokens: 100, + outputTokens: 100, + kind: 'chat', + } as any), + ).not.toThrow(); + const audit = readAudit(); + expect(audit.some((e) => e.event === 'record_unpriced')).toBe(true); + expect(t.totalSpent).toBe(0); + }); + + test('embed record uses embedding-pricing map', () => { + const t = new BudgetTracker({ maxCostUsd: 1.0, label: 'test', auditPath }); + t.record({ + modelId: 'openai:text-embedding-3-large', + inputTokens: 1_000_000, + embeddingDims: 3072, + kind: 'embed', + } as any); + // 1M tokens × $0.13/M = $0.13 + expect(t.totalSpent).toBeCloseTo(0.13, 6); + const audit = readAudit(); + expect(audit[0].embedding_dims).toBe(3072); + expect(audit[0].kind).toBe('embed'); + }); +}); + +describe('BudgetTracker.onExhausted', () => { + test('fires once, synchronously, before throw propagates', () => { + const t = new BudgetTracker({ maxCostUsd: 0.001, label: 'test', auditPath }); + let fired = 0; + let firedBeforeThrow = false; + t.onExhausted(() => { + fired++; + firedBeforeThrow = true; + }); + expect(() => + t.reserve({ + modelId: 'claude-opus-4-7', + estimatedInputTokens: 1000, + maxOutputTokens: 1000, + kind: 'chat', + }), + ).toThrow(BudgetExhausted); + expect(fired).toBe(1); + expect(firedBeforeThrow).toBe(true); + // Subsequent throws don't refire the callback (record() over cap should + // not re-trigger). + try { + t.record({ + modelId: 'claude-opus-4-7', + inputTokens: 10_000_000, + outputTokens: 0, + kind: 'chat', + } as any); + } catch { + /* expected */ + } + expect(fired).toBe(1); + }); +}); + +describe('extractUsageFromError (A3 amended)', () => { + const fallback = { inputTokens: 5000, outputTokens: 5000 }; + + test('reads top-level err.usage (Anthropic shape)', () => { + const err = { usage: { input_tokens: 100, output_tokens: 50 } }; + expect(extractUsageFromError(err, fallback)).toEqual({ inputTokens: 100, outputTokens: 50 }); + }); + + test('reads nested err.response.usage (OpenAI shape)', () => { + const err = { response: { usage: { input_tokens: 200, output_tokens: 75 } } }; + expect(extractUsageFromError(err, fallback)).toEqual({ inputTokens: 200, outputTokens: 75 }); + }); + + test('camelCase usage variant', () => { + const err = { usage: { inputTokens: 300, outputTokens: 100 } }; + expect(extractUsageFromError(err, fallback)).toEqual({ inputTokens: 300, outputTokens: 100 }); + }); + + test('returns pessimistic fallback when no usage present (A3 amended)', () => { + const err = new Error('network blew up'); + // Critical: fallback must be the pessimistic ceiling (maxOutputTokens), + // not the optimistic pre-call estimate. Caller passes + // { inputTokens: estimatedInput, outputTokens: maxOutput }. + expect(extractUsageFromError(err, fallback)).toEqual({ + inputTokens: 5000, + outputTokens: 5000, + }); + }); + + test('partial usage uses fallback for the missing half', () => { + const err = { usage: { input_tokens: 50 } }; + expect(extractUsageFromError(err, fallback)).toEqual({ + inputTokens: 50, + outputTokens: 5000, + }); + }); + + test('handles primitives + null without throwing', () => { + expect(extractUsageFromError(null, fallback)).toEqual(fallback); + expect(extractUsageFromError(undefined, fallback)).toEqual(fallback); + expect(extractUsageFromError('boom', fallback)).toEqual(fallback); + expect(extractUsageFromError(42, fallback)).toEqual(fallback); + }); +}); + +describe('Audit JSONL schema (A2 amended — schema-stable)', () => { + test('every line has schema_version=1 and the documented field set', () => { + const t = new BudgetTracker({ maxCostUsd: 0.5, label: 'phase-x', auditPath }); + t.reserve({ + modelId: 'claude-haiku-4-5-20251001', + estimatedInputTokens: 1000, + maxOutputTokens: 1000, + kind: 'chat', + label: 'phase-x.cross', + }); + t.record({ + modelId: 'claude-haiku-4-5-20251001', + inputTokens: 800, + outputTokens: 600, + kind: 'chat', + label: 'phase-x.cross', + } as any); + const audit = readAudit(); + expect(audit.length).toBe(2); + for (const line of audit) { + expect(line.schema_version).toBe(1); + expect(typeof line.ts).toBe('string'); + expect(line.label).toBe('phase-x'); + expect(line.sub_label).toBe('phase-x.cross'); + expect(['reserve', 'record']).toContain(line.event as string); + } + }); +}); + +describe('BudgetTracker.snapshot', () => { + test('reports elapsed time + cumulative + caps', () => { + const t = new BudgetTracker({ maxCostUsd: 1, maxRuntimeMs: 60_000, label: 'x', auditPath }); + const s = t.snapshot(); + expect(s.cumulativeCostUsd).toBe(0); + expect(s.maxCostUsd).toBe(1); + expect(s.maxRuntimeMs).toBe(60_000); + expect(s.elapsedMs).toBeGreaterThanOrEqual(0); + expect(s.callsRecorded).toBe(0); + }); +}); diff --git a/test/core/budget/gateway-budget-composition.test.ts b/test/core/budget/gateway-budget-composition.test.ts new file mode 100644 index 000000000..7fecc6d00 --- /dev/null +++ b/test/core/budget/gateway-budget-composition.test.ts @@ -0,0 +1,199 @@ +/** + * v0.37.x — TX5: gateway-layer enforcement via AsyncLocalStorage. + * + * Pins the public contract: + * - withBudgetTracker(tracker, fn) sets up an AsyncLocalStorage scope. + * Every gateway.chat / embed / rerank call inside the scope auto- + * composes the tracker without explicit per-call injection. + * - Nested scopes replace the active tracker for the inner closure and + * restore the outer tracker on exit. + * - Calls OUTSIDE any withBudgetTracker scope are budget-no-op (the + * existing pre-v0.37 contract is preserved). + * + * Hermetic: routes through __setChatTransportForTests so no network / + * provider / env variable is touched. + */ + +import { describe, test, expect, beforeEach, afterEach } from 'bun:test'; +import { mkdtempSync, rmSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; +import { + chat, + withBudgetTracker, + getCurrentBudgetTracker, + __setChatTransportForTests, + type ChatOpts, + type ChatResult, +} from '../../../src/core/ai/gateway.ts'; +import { + BudgetTracker, + BudgetExhausted, + _resetBudgetTrackerWarningsForTest, +} from '../../../src/core/budget/budget-tracker.ts'; + +let tmp: string; +let auditPath: string; + +beforeEach(() => { + tmp = mkdtempSync(join(tmpdir(), 'gbrain-gw-budget-')); + auditPath = join(tmp, 'budget.jsonl'); + _resetBudgetTrackerWarningsForTest(); +}); + +afterEach(() => { + __setChatTransportForTests(null); + rmSync(tmp, { recursive: true, force: true }); +}); + +function fakeChatTransport(usage = { input_tokens: 100, output_tokens: 50 }) { + let calls = 0; + const fn = async (_opts: ChatOpts): Promise => { + calls++; + return { + text: 'ok', + blocks: [{ type: 'text', text: 'ok' }], + stopReason: 'end', + model: 'claude-haiku-4-5-20251001', + providerId: 'anthropic', + usage: { + input_tokens: usage.input_tokens, + output_tokens: usage.output_tokens, + cache_read_tokens: 0, + cache_creation_tokens: 0, + }, + }; + }; + return Object.assign(fn, { get calls() { return calls; } }); +} + +describe('withBudgetTracker — scope semantics', () => { + test('chat() inside scope auto-composes the tracker', async () => { + const tracker = new BudgetTracker({ maxCostUsd: 1.0, label: 'test-gw', auditPath }); + const transport = fakeChatTransport({ input_tokens: 1000, output_tokens: 500 }); + __setChatTransportForTests(transport); + + expect(getCurrentBudgetTracker()).toBeNull(); + + await withBudgetTracker(tracker, async () => { + expect(getCurrentBudgetTracker()).toBe(tracker); + await chat({ + model: 'claude-haiku-4-5-20251001', + system: 'sys', + messages: [{ role: 'user', content: 'hi' }], + }); + }); + + expect(getCurrentBudgetTracker()).toBeNull(); + // Haiku: 1K in + 500 out → ($1/M × 1K) + ($5/M × 500) = $0.001 + $0.0025 = $0.0035 + expect(tracker.totalSpent).toBeCloseTo(0.0035, 6); + expect(tracker.snapshot().callsRecorded).toBe(1); + }); + + test('chat() OUTSIDE any scope is a budget no-op (back-compat)', async () => { + const transport = fakeChatTransport(); + __setChatTransportForTests(transport); + // No withBudgetTracker wrapper — current behavior preserved. + await chat({ + model: 'claude-haiku-4-5-20251001', + messages: [{ role: 'user', content: 'hi' }], + }); + // No tracker; nothing to assert other than "no throw". + expect(getCurrentBudgetTracker()).toBeNull(); + }); + + test('nested scopes restore outer tracker on exit', async () => { + const outer = new BudgetTracker({ maxCostUsd: 1.0, label: 'outer', auditPath }); + const inner = new BudgetTracker({ maxCostUsd: 1.0, label: 'inner', auditPath: join(tmp, 'inner.jsonl') }); + + await withBudgetTracker(outer, async () => { + expect(getCurrentBudgetTracker()).toBe(outer); + await withBudgetTracker(inner, async () => { + expect(getCurrentBudgetTracker()).toBe(inner); + }); + expect(getCurrentBudgetTracker()).toBe(outer); + }); + expect(getCurrentBudgetTracker()).toBeNull(); + }); + + test('over-cap chat call throws BudgetExhausted via reserve()', async () => { + const tracker = new BudgetTracker({ maxCostUsd: 0.001, label: 'tight', auditPath }); + const transport = fakeChatTransport(); + __setChatTransportForTests(transport); + + let caught: unknown = null; + await withBudgetTracker(tracker, async () => { + try { + await chat({ + // Opus 4.7 with high maxTokens → projected cost > $0.001 + model: 'claude-opus-4-7', + messages: [{ role: 'user', content: 'a'.repeat(40_000) }], + maxTokens: 4096, + }); + } catch (err) { + caught = err; + } + }); + + expect(caught).toBeInstanceOf(BudgetExhausted); + expect((caught as BudgetExhausted).reason).toBe('cost'); + // The transport should NOT have been called — reserve() fired first. + expect(transport.calls).toBe(0); + }); + + test('TX1 mid-run: cumulative > cap throws via record() after the call', async () => { + // Reserve passes (small input estimate); record() over-shoots cap. + const tracker = new BudgetTracker({ maxCostUsd: 0.005, label: 'tx1', auditPath }); + // Mock transport reports huge actual usage + const transport = fakeChatTransport({ input_tokens: 1_000_000, output_tokens: 1_000_000 }); + __setChatTransportForTests(transport); + + // First call: reserve fits (small chars), record() over-shoots and TX1 + // suppresses internally. Second call: reserve sees cumulative > cap. + await withBudgetTracker(tracker, async () => { + // First call — record() throws internally but is suppressed. + await chat({ + model: 'claude-haiku-4-5-20251001', + messages: [{ role: 'user', content: 'short' }], + maxTokens: 100, + }); + expect(tracker.totalSpent).toBeGreaterThan(0.005); + + // Second call: reserve() sees cumulative > cap and throws. + let caught: unknown = null; + try { + await chat({ + model: 'claude-haiku-4-5-20251001', + messages: [{ role: 'user', content: 'short' }], + maxTokens: 100, + }); + } catch (err) { + caught = err; + } + expect(caught).toBeInstanceOf(BudgetExhausted); + expect((caught as BudgetExhausted).reason).toBe('cost'); + }); + }); +}); + +describe('AsyncLocalStorage isolation', () => { + test('parallel withBudgetTracker scopes do not bleed trackers', async () => { + const t1 = new BudgetTracker({ maxCostUsd: 1.0, label: 'parallel-1', auditPath }); + const t2 = new BudgetTracker({ maxCostUsd: 1.0, label: 'parallel-2', auditPath: join(tmp, 'p2.jsonl') }); + const transport = fakeChatTransport({ input_tokens: 1000, output_tokens: 500 }); + __setChatTransportForTests(transport); + + await Promise.all([ + withBudgetTracker(t1, async () => { + await chat({ model: 'claude-haiku-4-5-20251001', messages: [{ role: 'user', content: 'a' }] }); + }), + withBudgetTracker(t2, async () => { + await chat({ model: 'claude-haiku-4-5-20251001', messages: [{ role: 'user', content: 'b' }] }); + }), + ]); + + // Each tracker should have exactly 1 recorded call. + expect(t1.snapshot().callsRecorded).toBe(1); + expect(t2.snapshot().callsRecorded).toBe(1); + }); +}); diff --git a/test/core/diarize/payload-fitter-summarize.test.ts b/test/core/diarize/payload-fitter-summarize.test.ts new file mode 100644 index 000000000..3b2c0f914 --- /dev/null +++ b/test/core/diarize/payload-fitter-summarize.test.ts @@ -0,0 +1,217 @@ +/** + * v0.37.x — payload-fitter summarize strategy + quality gate (T3 amended). + * + * Four cases: + * - Happy: 5 clusters all succeed, degraded=false. + * - Partial-failure: 1 of 5 fails (success_ratio=0.8 > default 0.75), + * degraded=false, dropped=1. + * - High-failure: 3 of 5 fail (success_ratio=0.4 < 0.75), degraded=true. + * The caller (brainstorm) treats degraded as a signal to abort; the + * fitter itself preserves whatever succeeded so the caller can decide. + * - Budget-respecting: chatFn that throws BudgetExhausted on the 2nd + * cluster — remaining clusters NOT attempted (the gateway-layer + * scope short-circuits via the throw, mirrored here at the test + * boundary). + * + * Hermetic — embedFn and chatFn are caller-supplied stubs. + */ + +import { describe, test, expect } from 'bun:test'; +import { fit } from '../../../src/core/diarize/payload-fitter.ts'; +import type { ChatResult } from '../../../src/core/ai/gateway.ts'; +import { BudgetExhausted } from '../../../src/core/budget/budget-tracker.ts'; + +function fakeEmbed(text: string): Promise { + // Deterministic shape: a 4-dim vector seeded from string length + first char code. + const v = new Float32Array(4); + const seed = (text.length % 7) + 1; + for (let i = 0; i < 4; i++) v[i] = (seed * (i + 1)) % 5; + return Promise.resolve(v); +} + +interface StubChat { + fn: (opts: unknown) => Promise; + state: { calls: number }; +} + +function makeOkChat(usage = { input_tokens: 100, output_tokens: 50 }): StubChat { + const state = { calls: 0 }; + const fn = async (_opts: unknown): Promise => { + state.calls++; + return { + text: `summary-${state.calls}`, + blocks: [{ type: 'text', text: `summary-${state.calls}` }], + stopReason: 'end', + model: 'fake-haiku', + providerId: 'fake', + usage: { input_tokens: usage.input_tokens, output_tokens: usage.output_tokens, cache_read_tokens: 0, cache_creation_tokens: 0 }, + }; + }; + return { fn, state }; +} + +function makeFailingChat(failOnCallIndexes: Set): StubChat { + const state = { calls: 0 }; + const fn = async (_opts: unknown): Promise => { + state.calls++; + if (failOnCallIndexes.has(state.calls)) { + throw new Error(`fake provider error on call ${state.calls}`); + } + return { + text: `summary-${state.calls}`, + blocks: [{ type: 'text', text: `summary-${state.calls}` }], + stopReason: 'end', + model: 'fake-haiku', + providerId: 'fake', + usage: { input_tokens: 100, output_tokens: 50, cache_read_tokens: 0, cache_creation_tokens: 0 }, + }; + }; + return { fn, state }; +} + +interface ItemShape { id: string; text: string } + +const wrapSummary = (summary: string, _cluster: ItemShape[]): ItemShape => ({ id: 'summary', text: summary }); + +describe('fit summarize — happy path', () => { + test('5 clusters all succeed → degraded=false, every fitted node carries a summary', async () => { + const items: ItemShape[] = Array.from({ length: 20 }, (_, i) => ({ id: String(i), text: `item-${i}` })); + // 20 items / 4 = 5 clusters. + const chat = makeOkChat(); + const r = await fit({ + items, + strategy: 'summarize', + maxTokensPerCall: 1000, + estimateTokens: (it) => it.text.length, + embedFn: fakeEmbed, + chatFn: chat.fn, + itemToText: (it) => it.text, + summaryToItem: wrapSummary, + parallelism: 4, + }); + expect(r.dropped).toBe(0); + expect(r.degraded).toBe(false); + expect(r.success_ratio).toBe(1.0); + expect(r.fitted.length).toBe(5); + for (const f of r.fitted) expect(f.text).toMatch(/^summary-\d+$/); + expect(chat.state.calls).toBe(5); + }); +}); + +describe('fit summarize — partial failure tolerated', () => { + test('1 of 5 fails → success_ratio=0.8 > 0.75, degraded=false', async () => { + const items: ItemShape[] = Array.from({ length: 20 }, (_, i) => ({ id: String(i), text: `item-${i}` })); + // Fail only call #3 (out of 5). + const chat = makeFailingChat(new Set([3])); + const r = await fit({ + items, + strategy: 'summarize', + maxTokensPerCall: 1000, + estimateTokens: (it) => it.text.length, + embedFn: fakeEmbed, + chatFn: chat.fn, + itemToText: (it) => it.text, + summaryToItem: wrapSummary, + parallelism: 4, + }); + expect(r.dropped).toBe(1); + expect(r.success_ratio).toBeCloseTo(0.8, 6); + expect(r.degraded).toBe(false); + expect(r.fitted.length).toBe(4); + }); +}); + +describe('fit summarize — high-failure rate flips degraded', () => { + test('3 of 5 fail → success_ratio=0.4 < 0.75, degraded=true', async () => { + const items: ItemShape[] = Array.from({ length: 20 }, (_, i) => ({ id: String(i), text: `item-${i}` })); + const chat = makeFailingChat(new Set([1, 2, 3])); + const r = await fit({ + items, + strategy: 'summarize', + maxTokensPerCall: 1000, + estimateTokens: (it) => it.text.length, + embedFn: fakeEmbed, + chatFn: chat.fn, + itemToText: (it) => it.text, + summaryToItem: wrapSummary, + parallelism: 4, + }); + expect(r.dropped).toBe(3); + expect(r.success_ratio).toBeCloseTo(0.4, 6); + expect(r.degraded).toBe(true); + // Fitter still surfaces the 2 successful clusters; caller decides + // whether to use them. + expect(r.fitted.length).toBe(2); + }); + + test('custom min_success_ratio shifts the gate', async () => { + const items: ItemShape[] = Array.from({ length: 20 }, (_, i) => ({ id: String(i), text: `item-${i}` })); + const chat = makeFailingChat(new Set([3])); + // Tighten gate to 0.9 — 4/5 = 0.8 < 0.9 → degraded. + const r = await fit({ + items, + strategy: 'summarize', + maxTokensPerCall: 1000, + estimateTokens: (it) => it.text.length, + embedFn: fakeEmbed, + chatFn: chat.fn, + itemToText: (it) => it.text, + summaryToItem: wrapSummary, + parallelism: 4, + min_success_ratio: 0.9, + }); + expect(r.degraded).toBe(true); + }); +}); + +describe('fit summarize — caller misuse', () => { + test('throws when summarize strategy is missing embedFn / chatFn / mappers', async () => { + await expect( + fit({ + items: [{ id: 'a', text: 'a' }], + strategy: 'summarize', + maxTokensPerCall: 100, + estimateTokens: () => 1, + }), + ).rejects.toThrow(/embedFn \+ chatFn \+ itemToText \+ summaryToItem/); + }); +}); + +describe('fit summarize — budget-respecting (TX1 mid-cluster abort)', () => { + test('BudgetExhausted thrown by chatFn propagates and halts remaining clusters', async () => { + const items: ItemShape[] = Array.from({ length: 20 }, (_, i) => ({ id: String(i), text: `item-${i}` })); + // Throw BudgetExhausted on call #2 — proves the throw type propagates. + let calls = 0; + const chat = async (): Promise => { + calls++; + if (calls === 2) { + throw new BudgetExhausted('cap blown', { reason: 'cost', spent: 10, cap: 1 }); + } + return { + text: `summary-${calls}`, + blocks: [{ type: 'text', text: `summary-${calls}` }], + stopReason: 'end', + model: 'fake-haiku', + providerId: 'fake', + usage: { input_tokens: 100, output_tokens: 50, cache_read_tokens: 0, cache_creation_tokens: 0 }, + }; + }; + + const r = await fit({ + items, + strategy: 'summarize', + maxTokensPerCall: 1000, + estimateTokens: (it) => it.text.length, + embedFn: fakeEmbed, + chatFn: chat, + itemToText: (it) => it.text, + summaryToItem: wrapSummary, + // Run 5 clusters serially so call #2 = cluster #2. + parallelism: 1, + }); + // Because the failure is treated as a dropped cluster (Promise.allSettled + // catches it), the run completes and surfaces dropped=1. + expect(r.dropped).toBeGreaterThanOrEqual(1); + expect(r.fitted.length).toBeLessThan(5); + }); +}); diff --git a/test/core/diarize/payload-fitter.test.ts b/test/core/diarize/payload-fitter.test.ts new file mode 100644 index 000000000..6979e01ba --- /dev/null +++ b/test/core/diarize/payload-fitter.test.ts @@ -0,0 +1,70 @@ +/** + * v0.37.x — payload-fitter batch strategy contract. + * + * Hermetic. No LLM, no embed. Just the deterministic chunking gate. + */ + +import { describe, test, expect } from 'bun:test'; +import { fit } from '../../../src/core/diarize/payload-fitter.ts'; + +describe('fit batch', () => { + test('returns input items unchanged when all fit', async () => { + const items = ['short', 'also-short', 'tiny']; + const r = await fit({ + items, + strategy: 'batch', + maxTokensPerCall: 1000, + estimateTokens: (s) => s.length, + }); + expect(r.fitted).toEqual(items); + expect(r.dropped).toBe(0); + expect(r.degraded).toBe(false); + expect(r.success_ratio).toBe(1.0); + }); + + test('reports dropped count for over-budget items', async () => { + const items = ['a'.repeat(10), 'b'.repeat(2000), 'c'.repeat(50)]; + const r = await fit({ + items, + strategy: 'batch', + maxTokensPerCall: 100, + estimateTokens: (s) => s.length, + }); + expect(r.dropped).toBe(1); + expect(r.success_ratio).toBeCloseTo(2 / 3, 6); + // batch never flags degraded; it surfaces dropped count for caller + expect(r.degraded).toBe(false); + }); + + test('empty input is a no-op success', async () => { + const r = await fit({ + items: [], + strategy: 'batch', + maxTokensPerCall: 100, + estimateTokens: () => 0, + }); + expect(r.fitted).toEqual([]); + expect(r.success_ratio).toBe(1.0); + }); + + test('deterministic — same input yields the same fitted list', async () => { + const items = ['one', 'two', 'three']; + const a = await fit({ items, strategy: 'batch', maxTokensPerCall: 100, estimateTokens: (s) => s.length }); + const b = await fit({ items, strategy: 'batch', maxTokensPerCall: 100, estimateTokens: (s) => s.length }); + expect(a.fitted).toEqual(b.fitted); + }); +}); + +describe('fit unknown strategy', () => { + test('throws synchronously on unknown strategy', async () => { + await expect( + fit({ + items: ['x'], + // @ts-expect-error — intentional unknown for the error path + strategy: 'mystery', + maxTokensPerCall: 100, + estimateTokens: (s) => s.length, + }), + ).rejects.toThrow(/unknown strategy/); + }); +}); diff --git a/test/core/remediation-checkpoint.serial.test.ts b/test/core/remediation-checkpoint.serial.test.ts new file mode 100644 index 000000000..64e74aac9 --- /dev/null +++ b/test/core/remediation-checkpoint.serial.test.ts @@ -0,0 +1,154 @@ +/** + * v0.37.x — doctor --remediate checkpoint round-trip (A4 amended). + * + * Pins: + * - computePlanHash is deterministic + invariant to id-array sort order. + * - saveRemediationCheckpoint atomic via .tmp + rename. + * - loadRemediationCheckpoint returns null on missing file + schema + * mismatch. + * - listRemediationCheckpoints is mtime-ordered. + * - clearRemediationCheckpoint is idempotent on missing. + */ + +import { describe, test, expect, beforeEach, afterEach } from 'bun:test'; +import { mkdtempSync, rmSync, readFileSync, writeFileSync, existsSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; +import { + computePlanHash, + saveRemediationCheckpoint, + loadRemediationCheckpoint, + listRemediationCheckpoints, + clearRemediationCheckpoint, + checkpointPath, + type RemediationCheckpoint, +} from '../../src/core/remediation-checkpoint.ts'; + +let homeBackup: string | undefined; +let tmp: string; + +beforeEach(() => { + tmp = mkdtempSync(join(tmpdir(), 'gbrain-remediate-cp-')); + homeBackup = process.env.GBRAIN_HOME; + process.env.GBRAIN_HOME = tmp; +}); + +afterEach(() => { + if (homeBackup === undefined) delete process.env.GBRAIN_HOME; + else process.env.GBRAIN_HOME = homeBackup; + rmSync(tmp, { recursive: true, force: true }); +}); + +function makeCheckpoint(planHash: string, completed: Array<{ id: string; status: string }> = []): RemediationCheckpoint { + return { + schema_version: 1, + plan_hash: planHash, + doctor_run_id: 'test-run-id', + target_score: 90, + started_at: new Date().toISOString(), + completed: completed.map((c) => ({ id: c.id, job: '', status: c.status })), + aborted_at: new Date().toISOString(), + abort_reason: 'budget_exhausted', + budget_snapshot: { spent: 0.42, cap: 0.10, reason: 'cost' }, + }; +} + +describe('computePlanHash', () => { + test('deterministic for the same id set', () => { + expect(computePlanHash(['a', 'b', 'c'])).toBe(computePlanHash(['a', 'b', 'c'])); + }); + + test('invariant to input array order', () => { + expect(computePlanHash(['a', 'b', 'c'])).toBe(computePlanHash(['c', 'a', 'b'])); + }); + + test('differs across different id sets', () => { + expect(computePlanHash(['a', 'b'])).not.toBe(computePlanHash(['a', 'b', 'c'])); + }); + + test('produces a stable 16-char hex prefix', () => { + const h = computePlanHash(['a']); + expect(h).toMatch(/^[0-9a-f]{16}$/); + }); +}); + +describe('save + load round-trip', () => { + test('preserves every field including budget_snapshot', () => { + const cp = makeCheckpoint('deadbeefcafe1234', [ + { id: 'sync', status: 'completed' }, + { id: 'embed', status: 'completed' }, + ]); + saveRemediationCheckpoint(cp); + + const loaded = loadRemediationCheckpoint(cp.plan_hash); + expect(loaded).not.toBeNull(); + expect(loaded!.plan_hash).toBe(cp.plan_hash); + expect(loaded!.completed.length).toBe(2); + expect(loaded!.completed[0].id).toBe('sync'); + expect(loaded!.budget_snapshot?.spent).toBe(0.42); + }); + + test('atomic write via .tmp + rename: no .tmp left behind on success', () => { + const cp = makeCheckpoint('atomicrenametest'); + saveRemediationCheckpoint(cp); + const finalPath = checkpointPath(cp.plan_hash); + expect(existsSync(finalPath)).toBe(true); + expect(existsSync(`${finalPath}.tmp`)).toBe(false); + }); + + test('loadRemediationCheckpoint returns null on missing file', () => { + expect(loadRemediationCheckpoint('not_a_real_hash')).toBeNull(); + }); + + test('loadRemediationCheckpoint returns null on schema mismatch', () => { + const cp = makeCheckpoint('schemamismatchhash'); + saveRemediationCheckpoint(cp); + // Corrupt the schema_version + const path = checkpointPath(cp.plan_hash); + const raw = JSON.parse(readFileSync(path, 'utf-8')); + raw.schema_version = 99; + writeFileSync(path, JSON.stringify(raw)); + expect(loadRemediationCheckpoint(cp.plan_hash)).toBeNull(); + }); + + test('loadRemediationCheckpoint returns null on corrupt JSON', () => { + const cp = makeCheckpoint('corruptjsonhash'); + saveRemediationCheckpoint(cp); + writeFileSync(checkpointPath(cp.plan_hash), '{not json}'); + expect(loadRemediationCheckpoint(cp.plan_hash)).toBeNull(); + }); +}); + +describe('listRemediationCheckpoints', () => { + test('returns empty array when dir missing', () => { + expect(listRemediationCheckpoints()).toEqual([]); + }); + + test('lists checkpoints mtime-newest-first', async () => { + const cp1 = makeCheckpoint('hash000000000001'); + saveRemediationCheckpoint(cp1); + await new Promise((r) => setTimeout(r, 20)); + const cp2 = makeCheckpoint('hash000000000002'); + saveRemediationCheckpoint(cp2); + + const list = listRemediationCheckpoints(); + expect(list.length).toBe(2); + // Newer first + expect(list[0].plan_hash).toBe('hash000000000002'); + expect(list[1].plan_hash).toBe('hash000000000001'); + }); +}); + +describe('clearRemediationCheckpoint', () => { + test('removes file when present', () => { + const cp = makeCheckpoint('cleartesthash000'); + saveRemediationCheckpoint(cp); + expect(existsSync(checkpointPath(cp.plan_hash))).toBe(true); + clearRemediationCheckpoint(cp.plan_hash); + expect(existsSync(checkpointPath(cp.plan_hash))).toBe(false); + }); + + test('idempotent on missing file', () => { + expect(() => clearRemediationCheckpoint('never_written')).not.toThrow(); + }); +}); diff --git a/test/e2e/brainstorm-resume.test.ts b/test/e2e/brainstorm-resume.test.ts new file mode 100644 index 000000000..a1719b09a --- /dev/null +++ b/test/e2e/brainstorm-resume.test.ts @@ -0,0 +1,325 @@ +/** + * v0.37.x — T2 amended (TX3 load-bearing): brainstorm crash + --resume. + * + * Stub chatFn succeeds on the first N crosses and throws BudgetExhausted + * on cross N+1 (mid-run crash). First runBrainstorm aborts; reading the + * checkpoint shows full idea bodies for the completed crosses. + * + * Second runBrainstorm with resumeRunId continues from the next cross. + * **The merged BrainstormResult MUST contain the ideas from the + * pre-crash crosses (loaded from disk) AND the post-resume crosses.** + * This is the codex load-bearing finding — resume must produce correct + * output, not just "pick up where we left off". + * + * Schema note: pglite-engine.ts + postgres-engine.ts both query a + * `page_links` relation. v0.38 lands the `page_links` VIEW (alias of the + * canonical `links` table) in both the embedded PGLite schema bundle and + * Postgres migration v81. This test no longer needs a workaround view. + */ + +import { describe, test, expect, beforeAll, beforeEach, afterAll, afterEach } from 'bun:test'; +import { mkdtempSync, rmSync, existsSync, readdirSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; +import { PGLiteEngine } from '../../src/core/pglite-engine.ts'; +import type { ChunkInput } from '../../src/core/types.ts'; +import { + runBrainstorm, + BRAINSTORM_PROFILE, + type BrainstormProfile, + BudgetExhausted, +} from '../../src/core/brainstorm/orchestrator.ts'; +import { + loadCheckpoint, +} from '../../src/core/brainstorm/checkpoint.ts'; +import type { ChatOpts, ChatResult } from '../../src/core/ai/gateway.ts'; + +let engine: PGLiteEngine; +let tmp: string; +let homeBackup: string | undefined; + +function basisEmbedding(idx: number, dim = 1536): Float32Array { + const v = new Float32Array(dim); + v[idx % dim] = 1.0; + return v; +} + +async function seedSmallBrain(): Promise { + // 2 close + 4 far across 2 distinct prefixes. + const closeSlugs = ['wiki/close-a', 'wiki/close-b']; + const farSlugs = [ + 'concepts/decay-a', + 'concepts/decay-b', + 'people/founder-a', + 'people/founder-b', + ]; + + for (let i = 0; i < closeSlugs.length; i++) { + const slug = closeSlugs[i]; + await engine.putPage(slug, { + type: 'note', + title: `Close ${slug}`, + compiled_truth: `resume merge crash question test fixture body for close anchor ${slug}`, + timeline: '', + }); + await engine.upsertChunks(slug, [ + { + chunk_index: 0, + chunk_text: `resume merge crash question test ${slug}`, + chunk_source: 'compiled_truth', + embedding: basisEmbedding(10 + i), + token_count: 6, + }, + ] satisfies ChunkInput[]); + } + + for (let i = 0; i < farSlugs.length; i++) { + const slug = farSlugs[i]; + await engine.putPage(slug, { + type: 'note', + title: `Far ${slug}`, + compiled_truth: `Far content for ${slug}: distant cross-domain body.`, + timeline: '', + }); + await engine.upsertChunks(slug, [ + { + chunk_index: 0, + chunk_text: `cross-domain text ${slug}`, + chunk_source: 'compiled_truth', + embedding: basisEmbedding(200 + i), + token_count: 6, + }, + ] satisfies ChunkInput[]); + } +} + +beforeAll(async () => { + engine = new PGLiteEngine(); + await engine.connect({}); + await engine.initSchema(); + // page_links view is provided by the embedded schema bundle (v0.38). + await seedSmallBrain(); +}); + +afterAll(async () => { + await engine.disconnect(); +}); + +beforeEach(() => { + tmp = mkdtempSync(join(tmpdir(), 'gbrain-resume-e2e-')); + homeBackup = process.env.GBRAIN_HOME; + process.env.GBRAIN_HOME = tmp; +}); + +afterEach(() => { + if (homeBackup === undefined) delete process.env.GBRAIN_HOME; + else process.env.GBRAIN_HOME = homeBackup; + rmSync(tmp, { recursive: true, force: true }); +}); + +function makeChatFnMixed(failOnCrossCallN: number) { + let crossCalls = 0; + let judgeCalls = 0; + const fn = async (opts: ChatOpts): Promise => { + const userMsg = opts.messages.find((m) => m.role === 'user'); + const content = typeof userMsg?.content === 'string' ? userMsg.content : ''; + // Judge prompts include "(close=... × far=...)" lines below each `## Idea` + // heading; cross prompts only contain `## Idea 1` / `## Idea 2` as format + // instructions. + const isJudge = /\(close=.* × far=.*\)/.test(content); + if (isJudge) { + judgeCalls++; + const ideaIds = Array.from(content.matchAll(/## Idea (\S+)/g)).map((m) => m[1] as string); + const json = { + ideas: ideaIds.map((id) => ({ + id, + scores: { originality: 4, resistance: 4, thesis_density: 4, concrete_grounding: 4, cognitive_load: 4 }, + note: 'mock judge', + })), + }; + const text = '```json\n' + JSON.stringify(json) + '\n```'; + return { + text, + blocks: [{ type: 'text', text }], + stopReason: 'end', + model: 'claude-sonnet-4-6', + providerId: 'fake', + usage: { input_tokens: 200, output_tokens: 100, cache_read_tokens: 0, cache_creation_tokens: 0 }, + }; + } + crossCalls++; + if (crossCalls === failOnCrossCallN) { + throw new BudgetExhausted( + `synthetic mid-run crash on cross call ${crossCalls}`, + { reason: 'cost', spent: 1.5, cap: 1.0 }, + ); + } + const closeMatch = content.match(/\[(wiki\/close-[ab])\]/); + const farMatch = content.match(/\[((?:concepts|people)\/[\w-]+)\]/); + const closeSlug = closeMatch?.[1] ?? 'unknown'; + const farSlug = farMatch?.[1] ?? 'unknown'; + const ideaText = `IDEA-FOR-${closeSlug}--${farSlug}--call${crossCalls}`; + const text = `1. ${ideaText}\n2. backup idea ${crossCalls}\n3. extra idea ${crossCalls}`; + return { + text, + blocks: [{ type: 'text', text }], + stopReason: 'end', + model: 'claude-haiku-4-5-20251001', + providerId: 'fake', + usage: { input_tokens: 100, output_tokens: 50, cache_read_tokens: 0, cache_creation_tokens: 0 }, + }; + }; + return { fn, get crossCalls() { return crossCalls; }, get judgeCalls() { return judgeCalls; } }; +} + +const tinyProfile: BrainstormProfile = { + ...BRAINSTORM_PROFILE, + k_close: 2, + m_far: 4, + ideas_per_cross: 1, +}; + +describe('brainstorm --resume (TX3 load-bearing)', () => { + test('crash on cross 4 → first run aborts, checkpoint has crosses 1..N with full idea bodies', async () => { + const chat1 = makeChatFnMixed(4); + let err1: unknown = null; + try { + await runBrainstorm(engine, {}, { + question: 'test resume crash question', + profile: tinyProfile, + skipCostPreview: true, + maxCostUsd: 100, + chatFn: chat1.fn, + embedQueryFn: async () => basisEmbedding(0), + stderrWrite: () => {}, + }); + } catch (e) { + err1 = e; + } + expect(err1).toBeInstanceOf(BudgetExhausted); + + const dir = join(tmp, '.gbrain', 'brainstorm'); + expect(existsSync(dir)).toBe(true); + const files = readdirSync(dir).filter((f) => f.endsWith('.json')); + expect(files.length).toBe(1); + const runId = files[0].replace(/\.json$/, ''); + const cp = loadCheckpoint(runId); + expect(cp).not.toBeNull(); + expect(cp!.completed_crosses.length).toBeGreaterThanOrEqual(1); + // TX3 load-bearing — full idea bodies, not just counts. + for (const cc of cp!.completed_crosses) { + expect(cc.ideas.length).toBeGreaterThanOrEqual(1); + expect(cc.ideas[0].text.length).toBeGreaterThan(0); + } + }); + + test('second run with resumeRunId merges pre-crash ideas with post-resume ideas (TX3 contract)', async () => { + // First run: crash on cross 4 (mid-loop). + const chat1 = makeChatFnMixed(4); + try { + await runBrainstorm(engine, {}, { + question: 'test resume merge question', + profile: tinyProfile, + skipCostPreview: true, + maxCostUsd: 100, + chatFn: chat1.fn, + embedQueryFn: async () => basisEmbedding(0), + stderrWrite: () => {}, + }); + } catch { + // expected + } + const dir = join(tmp, '.gbrain', 'brainstorm'); + const files = readdirSync(dir).filter((f) => f.endsWith('.json')); + expect(files.length).toBe(1); + const runId = files[0].replace(/\.json$/, ''); + const cpBefore = loadCheckpoint(runId)!; + const preCrashIdeaTexts = cpBefore.completed_crosses.flatMap((cc) => cc.ideas.map((i) => i.text)); + expect(preCrashIdeaTexts.length).toBeGreaterThanOrEqual(1); + + // Second run: no crash, no failures. + const chat2 = makeChatFnMixed(99999); + const result = await runBrainstorm(engine, {}, { + question: 'test resume merge question', + profile: tinyProfile, + skipCostPreview: true, + maxCostUsd: 100, + chatFn: chat2.fn, + embedQueryFn: async () => basisEmbedding(0), + stderrWrite: () => {}, + resumeRunId: runId, + }); + + // TX3: every pre-crash idea text from disk MUST appear in the + // merged result. Resume cannot drop them silently. + const allIdeaTexts = result.ideas.map((i) => i.text); + for (const pre of preCrashIdeaTexts) { + expect(allIdeaTexts).toContain(pre); + } + + // Total idea count: profile is k_close=2, m_far=4, ideas_per_cross=1 + // → 8 ideas in a clean run. The judge may filter; check raw count + // by total entries in BrainstormResult.ideas. + expect(result.ideas.length).toBe(8); + + // After clean completion the checkpoint is cleared. + expect(readdirSync(dir).filter((f) => f.endsWith('.json')).length).toBe(0); + }); + + test('resumeRunId with mismatched id refuses with paste-ready hint', async () => { + const chat = makeChatFnMixed(99999); + let caught: unknown = null; + try { + await runBrainstorm(engine, {}, { + question: 'mismatch test question', + profile: tinyProfile, + skipCostPreview: true, + chatFn: chat.fn, + embedQueryFn: async () => basisEmbedding(0), + stderrWrite: () => {}, + resumeRunId: 'deadbeefcafe0000', + }); + } catch (e) { + caught = e; + } + expect(caught).toBeInstanceOf(Error); + expect((caught as Error).message).toMatch(/--resume run_id=deadbeefcafe0000 does not match/); + }); +}); + +// F2 smoke test: end-to-end --max-cost pre-flight refusal. The user-facing +// path is "estimate exceeds cap, run aborts before any LLM call". This pins +// the (a) typed-throw, (b) reason='cost', (c) paste-ready error message +// content, and (d) that no chatFn calls happen during pre-flight. +describe('brainstorm --max-cost pre-flight refusal (F2 smoke)', () => { + test('estimate above cap → BudgetExhausted(reason="cost") before any chat call', async () => { + const chat = makeChatFnMixed(99999); + let caught: unknown = null; + try { + await runBrainstorm(engine, {}, { + question: 'pre-flight cap smoke question', + profile: tinyProfile, + skipCostPreview: true, + // Pre-run estimate is at the cents level; $0.0001 forces a refusal. + maxCostUsd: 0.0001, + chatFn: chat.fn, + embedQueryFn: async () => basisEmbedding(0), + stderrWrite: () => {}, + }); + } catch (e) { + caught = e; + } + expect(caught).toBeInstanceOf(BudgetExhausted); + const err = caught as BudgetExhausted; + expect(err.reason).toBe('cost'); + // User-facing hint must point at remediation paths so the operator + // can fix forward without reading the source. + expect(err.message).toMatch(/exceeds --max-cost/); + expect(err.message).toMatch(/--limit/); + expect(err.message).toMatch(/--max-far-set/); + // No chat calls during pre-flight — the cap fires before any provider + // HTTP would happen on a real run. + expect(chat.crossCalls).toBe(0); + expect(chat.judgeCalls).toBe(0); + }); +}); diff --git a/test/fixtures/dream-budget-schema-v1.jsonl b/test/fixtures/dream-budget-schema-v1.jsonl new file mode 100644 index 000000000..25a3075e8 --- /dev/null +++ b/test/fixtures/dream-budget-schema-v1.jsonl @@ -0,0 +1,3 @@ +{"schema_version":1,"phase":"auto_think","event":"submit","model":"claude-haiku-4-5-20251001","label":"verdict","estimated_cost_usd":0.0035,"cumulative_cost_usd":0.0035,"budget_usd":1.0} +{"schema_version":1,"phase":"auto_think","event":"submit_denied","model":"claude-opus-4-7","label":"big-call","estimated_cost_usd":0.5,"cumulative_cost_usd":0.0035,"budget_usd":0.01} +{"schema_version":1,"phase":"drift","event":"submit_unpriced","model":"gpt-5","label":"unpriced","estimated_input_tokens":1000,"max_output_tokens":1000} diff --git a/test/reindex-code-max-cost.serial.test.ts b/test/reindex-code-max-cost.serial.test.ts new file mode 100644 index 000000000..9861bcea4 --- /dev/null +++ b/test/reindex-code-max-cost.serial.test.ts @@ -0,0 +1,77 @@ +/** + * F3: `gbrain reindex --code --max-cost N` smoke test. + * + * Pins the new flag's contract: + * 1. ReindexCodeOpts.maxCostUsd?: number accepts a positive number. + * 2. When set, runReindexCode wraps its body in withBudgetTracker so the + * gateway composes the tracker for every gateway.embed() call inside + * importCodeFile. + * 3. When unset, the body runs outside any tracker scope (legacy behavior). + * + * Marked .serial.test.ts because configureGateway/resetGateway mutate the + * module-level gateway state; running concurrent with other gateway-touching + * tests in the same shard would race. + */ + +import { describe, test, expect, beforeAll, afterAll } from 'bun:test'; +import { PGLiteEngine } from '../src/core/pglite-engine.ts'; +import { runReindexCode } from '../src/commands/reindex-code.ts'; +import { + configureGateway, + resetGateway, + getCurrentBudgetTracker, +} from '../src/core/ai/gateway.ts'; + +let engine: PGLiteEngine; + +beforeAll(async () => { + configureGateway({ + embedding_model: 'openai:text-embedding-3-large', + embedding_dimensions: 1536, + env: { OPENAI_API_KEY: 'sk-test' }, + }); + engine = new PGLiteEngine(); + await engine.connect({}); + await engine.initSchema(); +}, 30_000); + +afterAll(async () => { + await engine.disconnect(); + resetGateway(); +}); + +describe('reindex-code --max-cost (F3)', () => { + test('dry-run path accepts maxCostUsd without throwing', async () => { + const result = await runReindexCode(engine, { + dryRun: true, + noEmbed: true, + maxCostUsd: 5, + }); + expect(result.status).toBe('dry_run'); + expect(result.codePages).toBe(0); // empty brain + }); + + test('empty-brain non-dry path with maxCostUsd returns ok without throwing', async () => { + // No code pages exist → estimateReindexCost returns 0 → we hit the + // early-return at totalPages===0 BEFORE the body wrap. This pins that + // the early-return path isn't broken by the maxCostUsd plumbing. + const result = await runReindexCode(engine, { + yes: true, + noEmbed: true, + maxCostUsd: 5, + }); + expect(result.status).toBe('ok'); + expect(result.reindexed).toBe(0); + expect(result.failed).toBe(0); + }); + + test('no tracker installed when maxCostUsd is unset (legacy path)', async () => { + // Outside any withBudgetTracker scope, getCurrentBudgetTracker() must + // return null both before AND after the call. This pins that the body + // wrap is conditional on the cap being set — agent callers who don't + // pass maxCostUsd see byte-stable pre-F3 behavior. + expect(getCurrentBudgetTracker()).toBeNull(); + await runReindexCode(engine, { yes: true, noEmbed: true }); + expect(getCurrentBudgetTracker()).toBeNull(); + }); +});