📄 Read the paper (PDF) · 🌐 Project page
Routers, agents, serving stacks, and trainers look like four different engineering problems. They are not. They are four readings of one allocation problem, evaluated at four shadow prices that today no single layer can see.
The system should spend the next token on:
where
| Layer | Mechanism | Index | Prices observed |
|---|---|---|---|
| Demand | Routing as screening | model tier |
|
| Action | Agent as principal–agent | plan / act / verify |
|
| Supply | Serving as production | prefill / decode / KV |
|
| Capital | Caches & RL as investment | rollout / store |
|
The unified view turns failures into corner cases of one equation: over-routing, under-routing, over-delegation, under-verification, serving congestion, stale RL rollouts, and cache misuse.
- Token-aware evaluation — report all four prices, not just dollar cost.
- Risk-adjusted routing — publish a regret bound or an incentive-compatible menu.
- Autonomy pricing — make action class explicit; price irreversible actions higher.
- Congestion-priced serving — expose shadow prices for prefill / decode / KV.
- RL token budgeting — equalize marginal capability gain across rollouts, verifiers, and updates.
@misc{zhu2026marginaltoken,
title = {Agentic AI Systems Should Be Designed as Marginal Token Allocators},
author = {Siqi Zhu},
year = {2026},
note = {Position paper, preprint}
}The website source (HTML/CSS) is released under the MIT license. The paper itself is © 2026 Siqi Zhu, all rights reserved (preprint distribution permitted).