Inference-native Tokenmaxxing Agent Harness for Loop Engineering
Prompting is no longer the whole interface. The frontier is Loop Engineering: give the model an objective, feedback, verification, memory, and tools, then let it self-correct until the work is proven.
But every loop is also an inference workload. As turns accumulate, prompt prefixes drift, cache reuse collapses, stale evidence fills context, model routing gets harder, and serving choices start to matter.
Inferoa is an Inference-native Tokenmaxxing Agent Harness for Loop Engineering:
- Inference-native: the loop sees serving, routing, context windows, prefix cache, multimodal endpoints, and self-hosted model paths.
- Tokenmaxxing: each turn is shaped to preserve cacheable prefixes, bound mutable context, expose token pressure, and pick the right inference path.
- Loop Engineering:
/loopruns durable recursive loops that inspect, edit, test, verify, decide, remember, and continue across loop tasks.
Inferoa = Infer(Inference-native)o(Tokenmaxxing Loop)a(Agent Harness).
Inferoa gives that loop an inference-native runtime:
- Loop/rubric driven work:
/loopcarries an objective across loop tasks, verification, decisions, recovery, and completion evidence instead of stopping after the next answer. - Independent feedback surfaces: plans, tests, tool results, research metrics, and completion evidence give the loop something concrete to improve against.
- Memory and context control: compression, summaries, graph-shaped repo context, bounded history, and bounded tool output keep useful evidence in the window without letting stale state take over.
- Prefix-cache discipline: prompt epochs, deterministic tool schemas, and bounded system sections protect reusable prefixes while the loop runs.
- Serving and routing remain visible: model paths can respond to cost, safety, privacy, capability, session pressure, multimodal needs, and whether a self-hosted vLLM path is enough.
Inferoa is built on top of the vLLM ecosystem and extends tokenmaxxing across the inference stack:
| Surface | Substrate | Inferoa role | Tokenmaxxing target |
|---|---|---|---|
| Loop Engineering | Loop Mode | Recursive long-horizon loops, loop tasks, attempts, verification, decisions, completion evidence, and recovery | Keep the engineering loop running until the work is proven |
| Agent Harness | Inferoa | Sessions, tools, plans, loops, resources, evidence, and prefix-cache discipline | Give the loop a durable runtime while preserving reusable prompt prefixes |
| Context Optimization | CodeGraph, RTK | Select evidence and shrink mutable context without losing task continuity | Spend fewer prompt and tool-output tokens |
| Intelligent routing | vLLM Semantic Router | Choose model paths by cost, safety, privacy, capability, and session pressure | Avoid one expensive path for every turn |
| Model Serving | vLLM Engine, vLLM Omni | Use high-throughput, memory-efficient serving and multimodal endpoints while respecting inference-engine optimization rules | Control cost, safety, privacy, and data sovereignty when an external frontier model is unnecessary |
npm install -g inferoa@devThe @dev dist-tag tracks the latest build published from main. The npm
latest dist-tag is reserved for stable releases.
inferoa setup
inferoainferoa setup walks through endpoint, model, vault-backed API key, and Omni
configuration. inferoa opens the TUI. Pass a prompt as an argument to start a
session and submit it as the first user turn:
inferoa "Inspect this repository and list the test entrypoints."Start a recursive long-horizon loop from inside the TUI:
/loop Improve this repository and prove it with tests.
Run a single non-interactive request without opening the TUI:
inferoa --print "Summarize the README in one paragraph."- Quickstart and Architecture on the docs site for the full walk-through.
- CLI reference, Slash commands, and Configuration reference.
- The source tree under
docs/holds internal design notes (roadmap, TUI product design, vLLM-Omni validation, public-source hygiene).
Use these commands as the task grows:
/loopstarts a recursive long-horizon loop: Inferoa keeps the objective, loop tasks, attempts, verification evidence, and decisions active until the work is proven./planturns ambiguous scope into an inspectable plan before execution./tokenmaxxingshows token and cost pressure across prefix-cache reuse, context savings, recent turn usage, and model-selection pressure.
Inferoa is built for and with the vLLM ecosystem:
Thanks to the projects behind Inferoa's context optimization:
Agentic Intelligence Lab





