GitHub - agentic-in/inferoa: Inference-native Tokenmaxxing Agent Harness for Loop Engineering

Inference-native Tokenmaxxing Agent Harness for Loop Engineering

Prompting is no longer the whole interface. The frontier is Loop Engineering: give the model an objective, feedback, verification, memory, and tools, then let it self-correct until the work is proven.

But every loop is also an inference workload. As turns accumulate, prompt prefixes drift, cache reuse collapses, stale evidence fills context, model routing gets harder, and serving choices start to matter.

Inferoa is an Inference-native Tokenmaxxing Agent Harness for Loop Engineering:

Inference-native: the loop sees serving, routing, context windows, prefix cache, multimodal endpoints, and self-hosted model paths.
Tokenmaxxing: each turn is shaped to preserve cacheable prefixes, bound mutable context, expose token pressure, and pick the right inference path.
Loop Engineering: /loop runs durable recursive loops that inspect, edit, test, verify, decide, remember, and continue across loop tasks.

Loop is All You Need

Loop Mode

Code Index

Plan Mode

Loop Research

Why Inferoa

Inferoa = Infer(Inference-native)o(Tokenmaxxing Loop)a(Agent Harness).

Inferoa gives that loop an inference-native runtime:

Loop/rubric driven work: /loop carries an objective across loop tasks, verification, decisions, recovery, and completion evidence instead of stopping after the next answer.
Independent feedback surfaces: plans, tests, tool results, research metrics, and completion evidence give the loop something concrete to improve against.
Memory and context control: compression, summaries, graph-shaped repo context, bounded history, and bounded tool output keep useful evidence in the window without letting stale state take over.
Prefix-cache discipline: prompt epochs, deterministic tool schemas, and bounded system sections protect reusable prefixes while the loop runs.
Serving and routing remain visible: model paths can respond to cost, safety, privacy, capability, session pressure, multimodal needs, and whether a self-hosted vLLM path is enough.

The Tokenmaxxing Stack

Inferoa is built on top of the vLLM ecosystem and extends tokenmaxxing across the inference stack:

Surface	Substrate	Inferoa role	Tokenmaxxing target
Loop Engineering	Loop Mode	Recursive long-horizon loops, loop tasks, attempts, verification, decisions, completion evidence, and recovery	Keep the engineering loop running until the work is proven
Agent Harness	Inferoa	Sessions, tools, plans, loops, resources, evidence, and prefix-cache discipline	Give the loop a durable runtime while preserving reusable prompt prefixes
Context Optimization	CodeGraph, RTK	Select evidence and shrink mutable context without losing task continuity	Spend fewer prompt and tool-output tokens
Intelligent routing	vLLM Semantic Router	Choose model paths by cost, safety, privacy, capability, and session pressure	Avoid one expensive path for every turn
Model Serving	vLLM Engine, vLLM Omni	Use high-throughput, memory-efficient serving and multimodal endpoints while respecting inference-engine optimization rules	Control cost, safety, privacy, and data sovereignty when an external frontier model is unnecessary

/tokenmaxxing inside a session 📽️

Installation

npm install -g inferoa@dev

The @dev dist-tag tracks the latest build published from main. The npm latest dist-tag is reserved for stable releases.

Quickstart

inferoa setup
inferoa

inferoa setup walks through endpoint, model, vault-backed API key, and Omni configuration. inferoa opens the TUI. Pass a prompt as an argument to start a session and submit it as the first user turn:

inferoa "Inspect this repository and list the test entrypoints."

Start a recursive long-horizon loop from inside the TUI:

/loop Improve this repository and prove it with tests.

Run a single non-interactive request without opening the TUI:

inferoa --print "Summarize the README in one paragraph."

Documentation

Quickstart and Architecture on the docs site for the full walk-through.
CLI reference, Slash commands, and Configuration reference.
The source tree under docs/ holds internal design notes (roadmap, TUI product design, vLLM-Omni validation, public-source hygiene).

Core Slash Commands

Use these commands as the task grows:

/loop starts a recursive long-horizon loop: Inferoa keeps the objective, loop tasks, attempts, verification evidence, and decisions active until the work is proven.
/plan turns ambiguous scope into an inspectable plan before execution.
/tokenmaxxing shows token and cost pressure across prefix-cache reuse, context savings, recent turn usage, and model-selection pressure.

Acknowledgements

Inferoa is built for and with the vLLM ecosystem:

Thanks to the projects behind Inferoa's context optimization:

Contributors

Agentic Intelligence Lab

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
scripts		scripts
skills/coding-workflow		skills/coding-workflow
src		src
test		test
website		website
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
netlify.toml		netlify.toml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Loop is All You Need

Why Inferoa

The Tokenmaxxing Stack

Installation

Quickstart

Documentation

Core Slash Commands

Acknowledgements

Contributors

About

Uh oh!

Releases 16

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Loop is All You Need

Why Inferoa

The Tokenmaxxing Stack

Installation

Quickstart

Documentation

Core Slash Commands

Acknowledgements

Contributors

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages