A new cost model for code-understanding agents. TokenMaster re-engineers token economics at the harness layer — the layer that decides what the model re-reads on every single turn — for Claude Code and GitHub Copilot CLI.
The thesis in one line: the model should pay once to understand a codebase's structure, then never again. Today's harnesses violate this on every turn — they hand the model a growing transcript and let it grep its way to understanding, re-paying for the entire accumulated context turn after turn. TokenMaster makes that re-derivation economically illegal: structural questions get routed to a prebuilt code graph answered in one bounded query, so the cumulative token bill collapses instead of compounding.
It attacks the bill on two complementary fronts: a routing layer (below) that collapses the cost of re-deriving structure, and Brainspace, a compression layer that shrinks the raw tool outputs sitting in the transcript. Different token populations, so the wins compound. Routing is the headline; Brainspace is the multiplier.
Same CLI. Same model. Same task. The only variable is whether TokenMaster's routing agent is on.
-73% input tokens. 3.71x more efficient overall. Up to 7.8x on blast-radius analysis. 12 / 12 tasks answered from the graph. Zero correctness regressions.
Pooled across scikit-learn + sympy. 36 live GitHub Copilot runs. Full breakdown below.
/token-master
That one command builds the index for the current repository and turns on routing.
Start with the one fact everything else follows from: the model has no memory between turns. To continue a task, the harness re-sends the entire transcript every turn — every file read, every tool result, every message so far — and the model re-reads and re-reasons over all of it before doing anything new.
So a token is not paid for once. It is paid for every turn it stays in the context window. The cost that matters is not tokens sent but the total tokens processed, summed across every turn until the task is done.
The task: "who calls load_config?" The context carried forward grows each turn, so every turn re-pays for the ones before it.
GREP its way there GRAPH lookup
turn 1 grep 2,000 turn 1 query graph 2,500
turn 2 read auth.py 6,000 turn 2 answer 3,000
turn 3 read server 11,000 ──────────────────────────
turn 4 read handlers 17,000 TOTAL PROCESSED 5,500
turn 5 grep narrower 19,000
turn 6 read config 24,000
turn 7 answer 26,000
──────────────────────────
TOTAL PROCESSED 105,000
Same answer. 105,000 vs 5,500 tokens processed. The grep total is not 26,000 — it is the sum of all seven turns, because turn 7 re-read what turn 1 saw, six times over.
Plot context size per turn and measure the area. A turn that looks cheap on its own is a trap; what costs you is the whole shaded region.
context
│ ___________ grep: climbs AND runs long
│ __/ -> big area, expensive
│ __/
│ __/
│ __/
│ /__________ graph: stays low AND ends fast -> tiny area
└──────────────────────────► turns
Left alone, the model defaults to grep out of habit: in TokenMaster's own runs it reached for the graph 0 / 15 times until it was actively routed there. So TokenMaster does not merely offer the cheap path — it makes the prebuilt graph the default for structural questions ("who calls X," "what breaks if I change Y"). Offering saves nothing; enforcing is what collapses the area under the curve — and that collapse is exactly the -73% headline above.
First proven on Django (992 .py files, hard multi-hop tasks), then replicated on independent SWE-bench Lite repos to rule out single-repo luck. One measurement throughout: cumulative input tokens to finish a structural task, the baseline agent vs the same setup with TokenMaster on.
Baseline = the stock agent, no routing layer. It answers structural questions by reading and re-reading files, turn after turn. TokenMaster routes those same questions to a prebuilt graph. Identical model (
claude-sonnet-4.5), identical prompts, identical correctness oracle. The routing agent is the only thing that changes.
36 live GitHub Copilot runs. 2 repos x 3 tasks x 2 reps x 3 arms.
| Baseline | TokenMaster | Delta | |
|---|---|---|---|
| Cumulative input tokens | baseline | -73.1% | 3.71x fewer |
| Tasks answered from the graph | n/a | 12 / 12 | never fell back |
| Correctness vs AST oracle | pass | pass | no regression |
The harder the traversal, the bigger the collapse. Caller lookups save 3-5x; blast-radius analysis ("what breaks if I change this?") is where the baseline agent detonates, re-reading files across the whole repo to trace impact, and where one bounded graph query wins biggest:
Cumulative input tokens to finish the task (lower is better)
"Who calls X?" - reverse dependency lookup
scikit-learn Baseline █████████ 69,609
scikit-learn TokenMaster ███ 21,215 3.3x fewer
sympy Baseline ██████████████ 107,954
sympy TokenMaster ███ 21,189 5.1x fewer
"What breaks if I change this?" - blast radius
scikit-learn Baseline ███████████████████████████ 203,481
scikit-learn TokenMaster ████ 26,908 7.6x fewer
sympy Baseline ████████████████████████████ 210,214
sympy TokenMaster ████ 26,830 7.8x fewer
On blast radius, the baseline balloons past 200,000 tokens tracing impact by hand; TokenMaster answers from the graph in ~27,000 - an order-of-magnitude saving, repeated across two unrelated codebases.
A harness that wins everywhere is measuring an artifact. TokenMaster doesn't:
- Inheritor lookup on sympy ran -44% - the graph query cost more than the baseline on that one task. Reported, not hidden.
- Negative control (a question the baseline already nails in ~3 turns) correctly came out a wash - no traversal needed, no win claimed.
Provenance. Every figure above comes from the project's live A/B/C benchmark harness (
run_nav.py->score_nav.py), reproduced verbatim from its generated report. The benchmark sandbox and raw reports are kept out of this repo by design (research and scratch stay on disk). Django origin figures: -72% overall (3.5x), up to -80% (5.0x).
TokenMaster installs a routing agent that prefers graph queries over grep, backed by two interchangeable graph suppliers and the host CLI's own session memory:
| Layer | Supplier | Role |
|---|---|---|
| Semantic-spatial (default) | graphify |
Fast, no-LLM structural index. Answers callers / callees / impact / inheritors from inferred edges. The cheap default. |
| Precise-spatial (last-mile) | @colbymchenry/codegraph |
AST-resolved call edges. The precision escalation: when an inferred edge isn't trustworthy enough — precision-critical impact analysis, or sparse call graphs (common in JS/TS) where name-inference under-connects. Costs more tokens to buy exact edges. |
| Temporal | host CLI session memory | Native cross-session recall — no extra server. |
The routing layer is the product; the indexes are interchangeable suppliers. Routing is the load-bearing primitive — in early tests the model queried the graph 0/15 times without an explicit nudge and 8/8 with it. A graph the model never queries saves nothing, so TokenMaster's job is not to offer the efficient tool but to make it the path of least resistance.
Routing collapses the cost of re-deriving structure. But there is a second token population it never touches: the raw tool outputs — directory listings, test logs, file dumps — that land in the transcript and then get re-sent on every subsequent turn. Brainspace is the layer that attacks those.
It is deliberately a complement, not a competitor: routing shrinks what the model re-reads to understand the codebase; Brainspace shrinks what the model re-reads from its own tool results. Disjoint populations, so the savings compound rather than overlap. The same "area under the curve" logic from above applies — a tool output that is 6x smaller is 6x smaller every turn it lingers in context, not just once.
The risk with compressing tool output is obvious: throw away the one line that mattered. Brainspace's escape hatch is the CCR (Content-addressed, Cached, Reversible store). A compressor may drop detail from what the model sees, but only after handing the original to CCR, which stores it verbatim under sha256(content) and returns a stable placeholder:
[[BR:9f3a2b1c4d5e|log; pytest run; 412L]]
If the model ever needs the dropped detail, it calls brainspace_retrieve with that placeholder and gets the exact original bytes back. Content-addressing buys two things at once: identical content stashes once (free dedup), and the placeholder is byte-identical across turns — so it never perturbs the provider's prompt-cache prefix. (That last point is the whole reason compression and caching don't collide: a placeholder carrying a timestamp or hit-counter would silently evict the cache every turn. Brainspace's are derived only from content.)
Every figure below is the real compressor run over a real artifact — actual repo source, a live pytest -v capture, a real serialized directory listing, the real README — measured in tiktoken cl100k_base tokens (the same proxy the engine ships). Reproduce with python sandbox_brainspace/measure_impact.py.
Tokens in -> out, per content type (lower out is better)
JSON dir listing, 200 files 769 -> 92 88.0% 8.36x
log live pytest -v run 2,838 -> 475 83.3% 5.97x
code real ccr/router/mcp 5,172 -> 3,345 35.3% 1.55x
prose README + research doc 20,169 -> 20,067 0.5% 1.01x <- honest negative
The shape is the point. Structured, repetitive output (JSON, logs) compresses 6-8x because its redundancy is mechanical and safe to elide. Code compresses ~1.5x — signatures and docstrings kept, bodies stashed. Prose barely moves (0.5%): natural language has little safe-to-drop redundancy, so Brainspace defaults to near-lossless whitespace normalization and leaves the lossy ML path off. A layer that claimed to squeeze prose like it squeezes a directory listing would be lying; this one reports the 0.5% and moves on.
Benchmark it on your own code — these numbers are not a brochure, they're a command. The same harness that produced the table ships with the plugin; point it at your files, logs, or piped output:
# your own files (type auto-detected per file):
uv run --with mcp --with tiktoken python brainspace_benchmark.py path/to/big.json server.log ./src
# or pipe a real command's output straight in:
pytest -v | uv run --with mcp --with tiktoken python brainspace_benchmark.py --stdin --hint=pytestIt measures in real tiktoken tokens, verifies lossless recovery on every placeholder it stashes, and prints the per-type table plus the area-under-curve projection below. Run with no arguments for an instant demo over Brainspace's own files. --json emits machine-readable output for your own dashboards.
A single compressed output is re-sent every turn it stays in context, so its saving is multiplied by the turns it survives. The live pytest log above, left in a 10-turn task:
turns in context raw cumulative compressed tokens saved
1 2,838 475 2,363
3 8,514 1,425 7,089
5 14,190 2,375 11,815
10 28,380 4,750 23,630
One log. 23,630 tokens saved across ten turns — and it round-trips losslessly the moment the model asks for it.
| Host | Mode | What's verified |
|---|---|---|
| Claude Code | Full — auto-compress hook + model-invoked tools | PostToolUse hook rewrote a real 3,226 -> 98 token output (97.0%) with the buried ValueError preserved, via the verified updatedToolOutput field |
| GitHub Copilot CLI | MCP-only — model-invoked compress / retrieve |
The real copilot binary spawned the server via uv and invoked brainspace_compress end-to-end: 8,299 chars -> 303, tool footer 2,075 -> 55 tokens (97.3%) |
- Copilot is MCP-only. GitHub Copilot CLI has no documented output-rewriting hook, so it gets the model-invoked
brainspace_compress/brainspace_retrievetools but not transparent auto-compression. Claude Code gets the full layer. This is a real capability gap, surfaced in the installer summary rather than hidden. - Prose is the weakest lever. As the table shows, ~0%. The optional LLMLingua-style ML path exists but is off by default because it trades lossless-ness for a marginal prose gain — the wrong trade for tool output.
- Tiny inputs can't be compressed profitably. Placeholder + footer overhead would dominate, so a centralized never-expand guard returns the original unchanged whenever compression wouldn't strictly shrink it. Compression must never cost tokens.
- Token proxy, not the host tokenizer. Numbers are
tiktoken cl100k_base; the host model's exact tokenizer differs in absolute counts, but compression ratios are stable across tokenizers, and ratios are what's reported.
Provenance. Compression figures come from
brainspace_benchmark.py(the same self-serve tool shipped with the plugin — real artifacts, real tokenizer); the dual-host figures fromsandbox_brainspace/verify_dual_host.pyplus a livecopilot -prun against a sandboxedCOPILOT_HOME(the real~/.copilotis never touched). 104 unit tests cover the four compressors, the router, and the never-expand invariant (enforced in tokens, not characters — a distinction the benchmark itself surfaced).
TokenMaster supports two host CLIs — Claude Code and GitHub Copilot CLI. Install the
routing agent for whichever you use (the prerequisites below are shared). /token-master builds
the per-repo graph and installs the host-appropriate routing agent into your user-scope CLI home.
TokenMaster is distributed as a Claude Code plugin through a plugin marketplace:
/plugin marketplace add shyamsridhar123/TokenMasterX
/plugin install token-master@token-masterThen, inside any repository you want to index:
/token-masterThe installer writes the routing agent to ~/.claude/agents/token-master.md and registers the
graph MCP server in ~/.claude.json. After the first install, restart Claude Code (or start it
with claude --agent token-master) for routing to take effect.
Copilot CLI reads the same plugin marketplace as Claude Code, so installation is the same two
commands. In an interactive copilot session:
/plugin marketplace add shyamsridhar123/TokenMasterX
/plugin install token-master@token-master(Equivalently, from your shell: copilot plugin marketplace add shyamsridhar123/TokenMasterX
followed by copilot plugin install token-master@token-master.)
Then, inside any repository you want to index:
/token-masterThis builds the per-repo graph and writes the routing agent — with its MCP servers declared inline —
to ~/.copilot/agents/token-master.agent.md. After the first install, restart Copilot (or start
it with copilot --agent token-master) for routing to take effect.
If you have both Claude Code and Copilot CLI installed, the
/token-masterinstaller can't tell which CLI launched it and defaults to Claude Code. Force the Copilot target by settingTOKEN_MASTER_HOST=copilotin your environment before running/token-master. As a manual alternative you can run the installer directly and pass the host explicitly:``` python token-master-plugin/skills/token-master/setup.py --host=copilot ```
uv tool install graphifyyuv— the routing agent launches the graph server through it.node+npm(optional) — only needed for the precisecodegraphescalation backend. Without them, TokenMaster runs graphify-only and still works.
If a prerequisite is missing, /token-master tells you exactly what to install, then re-run it.
After setup, just ask structural questions normally:
- "Who calls
force_str?" - "What breaks if I change the signature of this method?"
- "What inherits from
BaseValidator?"
The agent answers them from the graph. To confirm routing is active, ask a known structural question and check that the answer comes from a graph tool call rather than a grep sweep.
Re-run /token-master whenever the code has changed enough that the graph is stale.
Note: The routing agent loads at CLI startup. After the first install, restart your host CLI (or start it with
--agent token-master) for routing to take effect. The setup summary prints the exact restart command for your host.
/token-master is conservative about your working tree:
- The code graph is stored at
.token-master/graph.jsoninside the repo. .token-master/and.codegraph/are added to the repo's.gitignore.- The routing agent and graph server are installed to your user-scope CLI home, not the repo.
- Not a universal speedup. TokenMaster wins on hard, multi-hop traversal. On short structural questions that grep answers in ~3 turns, it is correctly neutral. A harness that "wins everywhere" is measuring an artifact.
- graphify edges are inferred; codegraph is the last mile. The default backend infers call edges by name (~0.8 confidence) — fast and cheap, and on well-named Python it answers correctly the large majority of the time.
codegraphexists to buy the last mile of precision: AST-resolved edges for the cases inference can't be trusted on. That precision is not free. On the SWE-bench Lite pilot, codegraph cost ~3-4x more tokens than graphify and on the simpler caller/inheritor tasks frequently ran below the baseline; its resolved edge set diverged from graphify's inferred set on every compared cell — different, and exact, but not a free upgrade. The takeaway the data supports: graphify is the default; codegraph is the deliberate escalation when an exact edge is worth paying for, not an always-on replacement. - Sparse call graphs. On some languages (notably JavaScript/TypeScript) graphify's call graph is sparse; setup detects this and prints a warning pointing you at the
codegraphbackend. - Cumulative tokens, not dollars. TokenMaster optimizes the integral of context size over a task. Billing proxies (premium requests, total token counts) are explicitly not the metric.
token-master-plugin/ The plugin (this is the deliverable)
├── .claude-plugin/
│ └── plugin.json Plugin manifest
└── skills/token-master/
├── SKILL.md The /token-master command
├── setup.py Installer: builds the graph, installs the host agent
├── graphify_mcp.py Graph-query MCP server
├── agent.template.claude.md Routing agent template (Claude Code format)
├── agent.template.copilot.md Routing agent template (Copilot CLI format)
│
├── brainspace_setup.py Brainspace installer (dual-host: hook+MCP / MCP-only)
├── brainspace_mcp.py Compression MCP server (compress/retrieve/stats)
├── brainspace_posttooluse.py Claude Code auto-compress hook
├── brainspace_benchmark.py Self-serve benchmark — measure on your own files
└── brainspace/ The compression engine
├── ccr.py Content-addressed reversible store (the escape hatch)
├── router.py ContentRouter — type detection + never-expand guard
├── tokens.py tiktoken-or-heuristic token meter
├── benchmark.py Benchmark core (file/dir/stdin, JSON, area-under-curve)
├── compressors/ json / code / logs / prose compressors
└── tests/ 104 unit tests
sandbox_brainspace/ Isolated verification harnesses (never touch real config)
├── measure_impact.py Real-artifact compression measurement (README numbers)
├── mcp_stdio_smoke.py MCP protocol conformance via the official client
└── verify_dual_host.py Dual-host install + end-to-end checks
.claude-plugin/
└── marketplace.json Plugin marketplace manifest (the packager)
assets/
├── generate_art.py Deterministic, dependency-free SVG generator
└── tokenmaster-hero.svg The hero image above (reproducible from a seed)
The hero image is generative: it is the thesis. The faint tangle is brute-force search sprawl — context re-read turn after turn — and the single bright path is one bounded graph-routed query. Regenerate or remix it with:
python assets/generate_art.py --seed 42