Skip to content

agentic-in/inferoa

Repository files navigation

Inferoa

Inference-native Tokenmaxxing Agent Harness for Loop Engineering

GitHub · Docs · Blog

Prompting is no longer the whole interface. The frontier is Loop Engineering: give the model an objective, feedback, verification, memory, and tools, then let it self-correct until the work is proven.

But every loop is also an inference workload. As turns accumulate, prompt prefixes drift, cache reuse collapses, stale evidence fills context, model routing gets harder, and serving choices start to matter.

Inferoa is an Inference-native Tokenmaxxing Agent Harness for Loop Engineering:

  • Inference-native: the loop sees serving, routing, context windows, prefix cache, multimodal endpoints, and self-hosted model paths.
  • Tokenmaxxing: each turn is shaped to preserve cacheable prefixes, bound mutable context, expose token pressure, and pick the right inference path.
  • Loop Engineering: /loop runs durable recursive loops that inspect, edit, test, verify, decide, remember, and continue across loop tasks.

Loop is All You Need

Loop Mode

Inferoa loop mode

Code Index

Inferoa code index

Plan Mode

Inferoa plan mode

Loop Research

Inferoa research loop

Why Inferoa

Inferoa = Infer(Inference-native)o(Tokenmaxxing Loop)a(Agent Harness).

Why Inferoa: Inference-native Loop Tokenmaxxing

Inferoa gives that loop an inference-native runtime:

  • Loop/rubric driven work: /loop carries an objective across loop tasks, verification, decisions, recovery, and completion evidence instead of stopping after the next answer.
  • Independent feedback surfaces: plans, tests, tool results, research metrics, and completion evidence give the loop something concrete to improve against.
  • Memory and context control: compression, summaries, graph-shaped repo context, bounded history, and bounded tool output keep useful evidence in the window without letting stale state take over.
  • Prefix-cache discipline: prompt epochs, deterministic tool schemas, and bounded system sections protect reusable prefixes while the loop runs.
  • Serving and routing remain visible: model paths can respond to cost, safety, privacy, capability, session pressure, multimodal needs, and whether a self-hosted vLLM path is enough.

The Tokenmaxxing Stack

Inferoa is built on top of the vLLM ecosystem and extends tokenmaxxing across the inference stack:

Surface Substrate Inferoa role Tokenmaxxing target
Loop Engineering Loop Mode Recursive long-horizon loops, loop tasks, attempts, verification, decisions, completion evidence, and recovery Keep the engineering loop running until the work is proven
Agent Harness Inferoa Sessions, tools, plans, loops, resources, evidence, and prefix-cache discipline Give the loop a durable runtime while preserving reusable prompt prefixes
Context Optimization CodeGraph, RTK Select evidence and shrink mutable context without losing task continuity Spend fewer prompt and tool-output tokens
Intelligent routing vLLM Semantic Router Choose model paths by cost, safety, privacy, capability, and session pressure Avoid one expensive path for every turn
Model Serving vLLM Engine, vLLM Omni Use high-throughput, memory-efficient serving and multimodal endpoints while respecting inference-engine optimization rules Control cost, safety, privacy, and data sovereignty when an external frontier model is unnecessary

/tokenmaxxing inside a session 📽️

Welcome

Installation

npm install -g inferoa@dev

The @dev dist-tag tracks the latest build published from main. The npm latest dist-tag is reserved for stable releases.

Quickstart

inferoa setup
inferoa

inferoa setup walks through endpoint, model, vault-backed API key, and Omni configuration. inferoa opens the TUI. Pass a prompt as an argument to start a session and submit it as the first user turn:

inferoa "Inspect this repository and list the test entrypoints."

Start a recursive long-horizon loop from inside the TUI:

/loop Improve this repository and prove it with tests.

Run a single non-interactive request without opening the TUI:

inferoa --print "Summarize the README in one paragraph."

Documentation

Core Slash Commands

Use these commands as the task grows:

  • /loop starts a recursive long-horizon loop: Inferoa keeps the objective, loop tasks, attempts, verification evidence, and decisions active until the work is proven.
  • /plan turns ambiguous scope into an inspectable plan before execution.
  • /tokenmaxxing shows token and cost pressure across prefix-cache reuse, context savings, recent turn usage, and model-selection pressure.

Acknowledgements

Inferoa is built for and with the vLLM ecosystem:

Thanks to the projects behind Inferoa's context optimization:

Contributors

Agentic Intelligence Lab

Agentic Intelligence Lab