Agents stop overpaying for the strong model on work a cheaper one handles just as
well — without ever silently shipping worse work. For each (provider, agent, shape, tier) cell the advisor learns a success posterior from bead lifecycle + telemetry and
recommends the cheapest tier whose quality stays within tolerance, with 95%
credibility. An implementation of Conservative Constrained Thompson
Sampling.
Quick start · How it works · The surfaces · Install · Config · Caveats · Lineage
Its sibling pack, ast-lens, cut the read tax — agents reading whole
files when an outline would do. model-advisor cuts the next line item: the cost
tax — agents running the strongest, most expensive model on every dispatch because
nobody measured whether a cheaper one would have been fine.
The naive fixes both fail. Pin everything to the frontier tier and you overpay on the lookup, the doc rewrite, the single-file edit — work a mid tier nails every time. Pin a cheaper tier to save money and you gamble: a wrong answer on an ADR or a release-gating review cascades through every bead that depended on it. The right tier is a per-task decision, and it should be made on evidence, not vibes.
That is the whole job of the advisor. It watches outcomes, learns which tiers actually
preserve quality for which (agent, shape), and recommends the cheapest one it can
prove is good enough — falling back to the known-good baseline whenever it can't.
$ advisor advise polecat implement
recommend: sonnet (claude-sonnet-…) ← cheaper than the opus baseline
why: q_lo=0.82 ≥ baseline mean 0.80 − 0.05 tol; cheapest admitted tier
saving: −$0.011 / dispatch vs opus (at the representative token budget)
$ advisor advise refinery review
recommend: opus (claude-opus-…) ← baseline held
why: no cheaper tier clears tolerance — sonnet q_lo=0.58 < 0.75 threshold
(thin evidence, n<5). Recommending the known-good tier.
Quality in gc is already encoded in the work: a clean bead close means the dispatch
landed; a reopen / escalate means it didn't. The advisor harvests that signal (plus
higher-fidelity reviewer and eval verdicts), maintains a Beta(a, b) posterior per cell,
and runs the CC-TS decision rule on every recommendation:
- Learn. A Stop/SubagentStop hook records each invocation; bead lifecycle supplies
the outcome. Each
(provider, agent, shape, tier)cell accrues a calibrated success posterior. Telemetry lands in.beads/telemetry/invocations.jsonl; the cell-store cache is rebuildable from it. - Gate (conservative). A cheaper tier is admitted only if a one-sided lower
confidence bound on its success clears
baseline mean − q_tolat 95% credibility. Thin or zero evidence ⇒ the gate rejects ⇒ the baseline tier wins.Criticalcells are never downgraded, full stop. - Choose (asymmetric loss). Among admitted tiers, pick the one minimising an asymmetric loss that penalises a wrong downgrade far more than it rewards the saved spend — scaled by the bead's downstream blast radius.
- Recommend / apply.
advisereturns the tier + a structured, auditable rationale;inspectshows the evidence;applysets the tier on the agent's config. The advisor recommends; you apply.
The conservative guarantee. Every recommended downgrade is credible at 95%; the baseline is always feasible; the cold-start recommends the baseline until a cheaper tier earns its way in. The worst case is overpaying for the safe tier — never silently shipping worse work. This is the property that makes it safe to let the advisor inform routing at all.
| Layer | What it is | File |
|---|---|---|
| Engine | the pure, stdlib-only CC-TS decision rule + cell store | modeladvisor/ |
| CLI | advise / inspect / apply / auto-apply + advanced (eval-schedule / federate / drift) |
bin/advisor |
| Skill | teaches agents to advise-before-dispatch | skills/use-model-advisor/SKILL.md |
| Prompt fragment | the cost-aware discipline, in every agent's context | template-fragments/model-advisor.template.md |
| Telemetry hook | records each invocation (Stop / SubagentStop) | overlay/…/settings.json + hooks/capture-invocation.sh |
| Config | the model roster + shape taxonomy + tolerance classes | advisor.toml |
| Scheduler | a gc order to run auto-apply on a cadence (safe unless armed) |
orders/ + docs/AUTO-APPLY.md |
| Advanced modes | 8 opt-in research extensions (conformal · empirical-Bayes · Thompson · continuous · cascade · federation · drift · eval-sched) | modeladvisor/ · §7.3 |
./setup.sh # one-time: build the engine venv (stdlib only)
./bin/advisor advise polecat implement # cost-minimal tier for this agent + shape
./bin/advisor inspect polecat implement # the evidence behind it (per-tier posterior)
./bin/advisor apply polecat # set the recommended tier on the agentEvery read surface takes --json to emit the structured reasons object for
programmatic routing.
The decision is the paper's, made gc-native. A shape is the kind of task (not the
agent): lookup, implement, judge, review, patrol — config-extensible. A tier
is one of your configured (provider, model) run targets, cost-ordered — an arbitrary
roster, not a fixed haiku/sonnet/opus three. A cell is
provider::agent::shape::tier_id, and recommend(state, cell, tol, baseline) is a
pure, deterministic function: same posteriors + config in, same tier + rationale out.
That purity is why the rule is exhaustively testable and why the reasons object — the
candidate tiers, their confidence bounds, the credible quality-drop intervals, the cost
diffs — is a first-class audit surface, not an afterthought.
It degrades, never blocks. No roster ⇒ a clear error (it can't order tiers by cost
without one). No telemetry ⇒ conservative cold-start. No token counts ⇒ a representative
budget. Thin evidence ⇒ the baseline tier. The advisor can never stall a dispatch — worst
case it recommends the safe tier. Full math: docs/DESIGN.md.
Two pure reads and one deliberate write — plus the opt-in
Advanced modes surfaces (eval-schedule, federate, drift).
| Command | Does | Mutates? |
|---|---|---|
advisor advise <agent> <shape> |
recommend the cost-minimal tier + rationale + cost diff | no |
advisor inspect <agent> <shape> |
per-tier posterior, credible quality-drop CI, the highest-value eval probe | no |
advisor apply <agent> |
set the recommended tier on the agent's model config + reload |
yes |
advisor auto-apply [--town|--rig N] |
sweep every agent and apply the evidence-strong tier (safest-across-shapes; dry-run by default) | yes |
advise/inspect never dispatch, touch a bead, or change config. apply is the per-agent
application path, on demand; auto-apply automates that sweep across all agents —
conservatively (never auto-downgrading thin-evidence or Critical cells) — and can be
scheduled with a gc order. See docs/AUTO-APPLY.md.
The eight extensions the paper designs but v1 deliberately left off (docs/DESIGN.md §7.3)
now ship — each a self-contained, stdlib-only module behind a default-off flag, so the
conservative v1 rule stays the default and turning one on is a one-line config change. v1
behavior is byte-identical when they're off.
| Mode | Flag / surface | What it adds |
|---|---|---|
| Conformal calibration | lcb_backend = "conformal" |
distribution-free split-conformal gate bound (rolling (pred, obs) buffer) in place of the asymptotic Wilson LCB |
| Full hierarchical Bayes | pooling = "empirical-bayes" |
a genuine empirical-Bayes Beta-Binomial pooler (method-of-moments hyperprior + evidence-weighted shrinkage); optional [bayes] PyMC backend |
| Thompson sampling | mode = "thompson" · advise --thompson --seed N |
seeded per-tier Beta sampling instead of the deterministic rule — still gate- and Critical-safe, reproducible per seed |
| Continuous quality | continuous_quality = true |
graded q ∈ [0,1] outcomes (reviewer score / test-pass fraction) feeding the Beta; optional Gaussian (NIG) posterior for unbounded scores |
| Critical-path cascade | advise --cascade-bead ID |
DAG-propagated effective N_dep from the bead dependency graph — an ADR that blocks N builders carries a larger blast radius than a leaf |
| Multi-tenant federation | [federation] peers · advisor federate |
share aggregates only (never raw telemetry) across repos to warm thin cells, trust-weighted so local evidence always dominates |
| Change-point drift | changepoint = true · advisor drift |
Page-Hinkley detection of upstream model drift/deprecation + recency re-weighting so the gate re-learns the new regime |
| Auto-scheduled eval | advisor eval-schedule [--apply] |
rank gating cells by CI-width × unlock-value and dispatch eval probes proportional to posterior width (dry-run default; schedulable via a gc order) |
All eight are covered by the suite (228 tests) and never weaken the Critical
never-downgrade guarantee. Math + rationale: docs/DESIGN.md §7.3.
Turn it on for one rig or the whole town, reversibly:
./setup.sh # build the engine venv (one-time)
./install.sh --rig myrig # one rig: skill + telemetry hook for its agents
./install.sh --town # city-wide: + opts the discipline fragment into every agent
./uninstall.sh --rig myrig # clean reversal (strips the merged Stop hook too)
./uninstall.sh --town --purge # …and drop the venvinstall.sh (--town | --rig <name>) [--dry-run] [--city <path>] [--no-reload]
uninstall.sh (--town | --rig <name>) [--dry-run] [--city <path>] [--purge] [--no-reload]
Both are idempotent, back up every file they edit (<file>.model-advisor.bak.<ts>),
support --dry-run (prints the plan, changes nothing), and fail loudly on
unexpected state. A local in-tree pack imports via a direct config entry — the gas-town
convention — not gc import add:
--townadds[imports.model-advisor](source = "packs/model-advisor") to the city-rootpack.toml, and opts"use-model-advisor"intocity.tomlglobal_fragments.--rig <name>adds[rigs.imports.model-advisor]under the matching[[rigs]]incity.toml(the fragment, a city-wide list, is left alone).
gc reload then materializes the Claude Stop/SubagentStop overlay; gc deep-merges
the telemetry hook into projected settings without clobbering the core hooks. Because that
merge never deletes, uninstall.sh explicitly strips our hook entry (matched by the
unique model-advisor/hooks/capture-invocation.sh command substring) from every projected
settings.json, preserving any other hook, then re-projects clean. Town and rig installs
both round-trip city.toml / pack.toml byte-identical.
Everything is config-driven; nothing is hard-coded. Edit advisor.toml:
| Knob | What it is | Default |
|---|---|---|
[[tier]] roster |
your (provider, model) run targets, cost-ordered by rank |
(required) |
baseline tier tier* |
the known-good reference all downgrades are judged against | highest rank |
| shape taxonomy | task kinds + per-agent canonical shape sets | the 5 seed shapes |
| tolerance class | per-(agent, shape) quality budget: Critical/Strict/Moderate/Lenient |
Moderate |
q_tol / M |
the per-class quality-loss tolerance + asymmetric multiplier (∞/20/5/1) |
paper defaults |
| channel weights | close:1, review:3, eval:5 |
as shown |
force_baseline |
safety hatch: pin a cell to its baseline, short-circuit all learning | off |
| advanced modes | 8 opt-in flags — lcb_backend, pooling, mode, continuous_quality, changepoint, [federation] (Advanced modes) |
all off |
The pack ships an editable starter (SAMPLE_TOML); a consumer with a characterised
landscape can also supply a priors.json for immediate convergence. Full reference:
docs/DESIGN.md §2–§3.
Honest about the scope line.
- Apply granularity. v1 ships per-agent apply — manual (
apply) and automated (auto-apply, an evidence-gated sweep you can schedule). It's coarse: one tier per agent, changeable by a config edit + reload. Per-dispatch application (the advised tier per task/shape) is the next step — the gc-core seam is now implemented and PR'd togastownhall/gascity(the reconciler binds a work bead'sgc.modelat spawn), activating once it lands. Background:docs/INTEGRATION-FEASIBILITY.md. - The reward signal is the hard part. Bead closure is the primary channel and is fully supported today; reviewer/eval verdicts are higher-fidelity secondaries when present. The advisor only ever learns from outcomes it can attribute to a recorded dispatch.
- Conservative by design ⇒ it converges deliberately. A thin cell recommends the baseline until evidence accrues (~a couple dozen observations dominate the cold-start prior). That is the point, not a defect: it never trades quality for speed of savings.
- The research extensions ship opt-in. Every feature the paper designs but v1 left off —
conformal calibration, full (empirical-Bayes) hierarchical pooling, Thompson sampling,
continuous quality, critical-path cascade, federation, change-point drift, and
auto-scheduled eval — is now implemented behind a default-off flag
(Advanced modes). The conservative deterministic rule remains the
default, and enabling any mode never weakens the
Criticalnever-downgrade guarantee. - It advises; you decide. During an incident, pin the baseline (
force_baseline) and don't explore. The advisor never drives routing on its own in v1.
model-advisor is a build of Conservative Constrained Thompson Sampling for
Cost-Aware Model-Tier
Selection,
implemented from the paper's §3 problem formulation and §5 algorithm.
The load-bearing parts are kept faithfully; the constrained decision
rule, the conservative LCB gate, the asymmetric loss with class multipliers, the
M[Critical] = ∞ hard rule, the pure-function recommend with its structured reasons —
and adapted where gc differs: an arbitrary config-driven roster (not a fixed three tiers),
a gc-native shape taxonomy, provider-keyed cells, a conservative cost-ordered cold-start in
place of an offline landscape, and bead lifecycle as the primary quality channel.
MIT © Jay German. See LICENSE.