A JIT optimization runtime for multi-step LLM agent workloads.
Multi-agent systems cost 4-15x more tokens than single-agent baselines. Frameworks fire LLM calls and eat the cost; few systems optimize the application-level call trace before or while it runs. Agentc sits between agent frameworks and the LLM APIs, intercepts calls as they happen, and rewrites them to be cheaper without requiring application-code changes for supported adapters.
Agent Frameworks Claude Code · LangGraph · CrewAI · AutoGen
│ describe what to do
▼
Agentc intercepts calls, optimizes execution
│ decides how to do it cheaply
▼
LLM APIs Anthropic · OpenAI · Gemini
raw inference
Think of it like a compiler for agent workloads. Frameworks describe what to do. Agentc decides how to do it cheaply.
The runtime, profiler, semantic memoization layer, and JIT optimizer are all implemented and pass their test suites. The agentc CLI ships with record, traces, analyze, report, cache, and optimize subcommands.
V2 extends the per-call optimizer with a CompositionPlanner that classifies rules by cost driver (InputTokens / OutputTokens / ModelPrice / CallElimination / Structural), applies orthogonal rules in dependency order, and produces Plan::Composed audit rows. Three new rules ship in V2: PromptDedup, OutputBudget, and StructuredTruncation (last two not yet independently benchmarked). Toggle with AGENTC_COMPOSE=1 (default on).
Per-rule savings (purpose-built isolation workloads):
| Rule | Workload | n | Cost savings | Accuracy Δ | McNemar p |
|---|---|---|---|---|---|
ModelDowngrade |
gaia_router |
127 | 35.3% | −2.4pp (±3.1pp SE) | n/a (unpaired) |
ContextCompress |
long_context_qa |
100 | 34.8% input-tokens | −2pp (±3.5pp SE) | n/a (unpaired) |
StateDrop |
iterative_refiner |
50 | 6.0% cost / 9.6% input-tokens | −2pp | — |
V2 composition and comparison results:
| Experiment | n | Result | McNemar p |
|---|---|---|---|
| CC vs LLMLingua-2, HotpotQA distractor | 100 | CC: 68%→100% (FB=32, BF=0); LLMLingua-2: 68%→53% | CC: 4.7×10⁻¹⁰; LL2: 0.0013 |
| CC vs LLMLingua-2, Wikipedia natural prose | 39 | CC: 94.9%→94.9% (BF=0, FB=0, abstained); LL2: +2.6pp, 53.5% compression, 13.7s overhead | CC: 1.0; LL2: 1.0 |
| CC+StateDrop composition, multirule_qa | 30 | CC: 33.1% token savings; SD: 0.1%; CC+SD: 21.7% (gate picks CC on most calls; fixture-specific ratio) — confirmatory n=20 ablation: all-on=31.3% ≈ CC-only | all p≥0.48 |
| Planner ablation (V1 vs V2) | 50 | V1-CC+OB: −2pp (greedy wrong pick); V2-CC+OB: +0pp (gate corrects) | V2-CC: 0.0412 |
| Agent diversity (rag_summarizer + autogen_bridge) | — | CC fires 30–54% of hot calls; SD fires 9–24% | — |
| Provider generalization (Anthropic Claude, HF Llama) | 50 each | CC: 98% fire rate / 34% tok savings (HF); 0% (Anthropic single-msg); MD: 14.7% savings (Anthropic), 31.1% (HF); autogen_bridge on Llama matches OpenAI activation | — |
| StateDrop negative control (all-state-read variant) | 20 | 0/319 SD fires when all state writes have matching reads; confirms unread-state precondition | — |
| Optimizer overhead (1,818 plan decisions) | — | pass-through p50=76µs; rewrite p50=120µs; p99 tail from first-call load | — |
ParallelBranch ships and emits audit rows; the latency win currently comes from the user-side parallel_map ThreadPoolExecutor. CacheHit functions as a bridge between memoized and non-memoized callers; neither is a headline paper claim yet.
crates/ Rust workspace (7 crates)
├── agentc-core span schema, SQLite storage, hardening, embedding I/O
├── agentc-embed model2vec embeddings + LSH for semantic memoization
├── agentc-memo memoization cache: canonical keys, eviction, FFI
├── agentc-profiler PyO3 module: Python bindings to span writer
├── agentc-analyzer cost breakdown + waste detectors over stored traces
├── agentc-optimizer DAG IR, cost model, planner, rewrite rules, CompositionPlanner
└── agentc-cli `agentc` binary
python/agentc/ Python SDK
├── _patches/ wrapt-based monkey patches: anthropic, openai, google
├── _provenance_frameworks/ framework adapters: langgraph, crewai, autogen
├── _canonicalize/ per-vendor request canonicalization
└── _intercept.py optimizer entry point: plan → dispatch → observe
bench/ Evaluation harness
├── agents/ Reference agents — per-rule isolation + composition probes:
│ long_context_qa, long_context_qa_anthropic, long_context_qa_hf,
│ iterative_refiner, iterative_refiner_allread (negative control),
│ gaia_router, hotpot_qa, composition_qa, multirule_qa,
│ rag_summarizer, autogen_bridge, support_qa, swebench_planner
├── build_*_fixture.py public-dataset → JSON converters (hotpot, gaia,
│ long_context, wikipedia_qa)
├── fixtures/ (gitignored, regenerated by build_*_fixture.py)
├── optimizer_bench.py run an agent twice (optimizer off / on)
├── optimizer_ablation.py 11-config sweep per agent: shared baseline +
│ <rule>-off ×5 + <rule>-only ×5
├── paired_analysis.py McNemar exact test + bootstrap CI on per_task sidecars
├── paper_results/ committed result CSVs + summary txts
└── scripts/ driver shell scripts
specs/ Technical specifications
paper-intelligence/ Paper evidence, literature, venue, and experiment ledgers
tests/ Python unit tests (~250 tests)
Three pieces that work together in a feedback loop.
Instruments any Python agent pipeline, captures every LLM call (tokens, latency, model, cost, full prompt/response, embedding), and produces structured execution traces in SQLite. Implemented in Rust via PyO3, with wrapt-based zero-config monkey-patching of the OpenAI, Anthropic, and Google SDKs. Spec: specs/profiler.md.
Opt-in caching that deduplicates LLM inference. Exact-prompt hash lookup on the hot path; LSH over 256-dim model2vec embeddings as a secondary tier for semantically-similar prompts. Cache state piggybacks on the profiler's canonical traces.db. Spec: specs/memoization.md.
JIT runtime that intercepts LLM calls on hot call sites and applies cost-ranked rewrite rules subject to a per-rule accuracy budget. Eight rules ship across V1 and V2:
| Rule | Cost driver | What it does | Status |
|---|---|---|---|
CacheHit |
CallElimination | Replay output for a past prompt via the shared memoization cache | implemented, future benchmark |
ContextCompress |
InputTokens | Extractively drop low-attention messages from large prompts (8KB+ gate, IDF-weighted proxy) | headline-validated |
ParallelBranch |
Structural | Detect dependency-free sibling calls, emit Plan::Parallel for async dispatcher |
implemented, observability only |
ModelDowngrade |
ModelPrice | Swap to a cheaper model when the cost model says accuracy holds | headline-validated |
StateDrop |
InputTokens | Prune state-tagged messages whose keys aren't in the current read window | validated (supporting) |
PromptDedup |
InputTokens | Remove near-duplicate message segments via per-call IDF | V2, benchmarked |
OutputBudget |
OutputTokens | Cap max_output_tokens at call-site p99 to prevent runaway generation |
V2, benchmarked |
StructuredTruncation |
InputTokens | Project out unreferenced JSON tool-output fields | V2, not yet independently benchmarked |
V2 CompositionPlanner: classifies rules by CostDriver, allows orthogonal rules (different drivers = non-overlapping Call fields) to apply in a single pass as Plan::Composed. Same-driver rules are gated unless explicitly allowlisted (e.g., StateDrop → ContextCompress). Controlled by AGENTC_COMPOSE=1 (default). V1 first-match behavior available via AGENTC_COMPOSE=0.
Cold calls pass through; optimization engages after hot_threshold observations (default 3), when the empirical cost model has real per-call-site data. 2% shadow-mode sampling provides ground-truth divergence for the accuracy budget. Spec: specs/optimizer.md.
Build the Rust workspace and install the Python SDK:
cargo build --release
maturin develop --release # builds the PyO3 extension into the active venv
pip install -e ".[dev,openai]"Profile any agent script:
agentc record -- python my_agent.py
agentc traces # list recent runs
agentc analyze <trace_id> # cost breakdown + waste detection
agentc report --last 20 # aggregate across runsInspect the optimizer:
agentc optimize report # rule firing rates, savings, accuracy
agentc optimize inspect <call_site> # cost model + ablation status per call site
agentc optimize disable --rule ModelDowngrade --call-site 'app.*' --hours 24
agentc optimize bench --agent path/to/agent.pyRun the reference benchmark suite end-to-end:
# 1. Build fixtures from public datasets (HF_TOKEN required for GAIA)
python -m bench.build_hotpot_fixture
python -m bench.build_gaia_fixture
python -m bench.build_long_context_fixture
# 2. Run baseline vs. optimized for one agent
python -m bench.optimizer_bench bench.agents.rag_summarizer
# 3. 11-config per-rule ablation for one agent
python -m bench.optimizer_ablation bench.agents.long_context_qa
# 4. Reproduce paper experiments (ContextCompress + StateDrop)
bash bench/scripts/run_paper_ablation.shThe reference agents stub LLM calls when no API key is set, so the harness wires up cleanly without spending money. To generate real cost/accuracy numbers, set OPENAI_API_KEY (and HF_TOKEN for GAIA) in .env.
Models keep getting better at managing their own context and tool usage. But Agentc operates on things the model literally cannot see or control:
| Rule | Level | Can model training cover this by itself? |
|---|---|---|
CacheHit |
Memoization | Not without an external cache and invalidation policy |
ContextCompress |
Input shaping | Not without control over what gets sent to the model |
ParallelBranch |
Runtime scheduling | Not without visibility into sibling calls and side effects |
ModelDowngrade |
Routing | Partially; it overlaps with learned routing, but model choice happens outside the selected model |
StateDrop |
Memory management | Not without external state metadata and read/write policy |
KV cache scheduling, parallel dispatch, semantic memoization, and runtime model selection are infrastructure problems. The model usually sees one request, not the full execution environment.
Incoming LLM Call (intercepted via SDK monkey-patch)
│
▼
DAG Builder → adds typed node + edges to the execution graph
│
▼
Cost Model → scores token cost · latency · accuracy per strategy
│
▼
Optimizer → picks cheapest strategy within the accuracy budget
│
▼
Executor → runs the optimized call, instruments the result
│
▼
Cost Model ← feeds real execution data back (the loop closes here)
│
▼
Output
The cost model starts with conservative heuristics and is empirically calibrated with every execution. Agentc gets cheaper the more you use it.
The runtime fails open: every FFI boundary, every patch, every framework adapter is wrapped so a bug in Agentc never breaks the user's agent. The worst-case fallback is to pass the call through unmodified.
- Rust core for everything performance-critical: span writer, embeddings, memoization cache, optimizer planner, CLI.
- Python SDK and bindings for SDK instrumentation, framework adapters, evaluation harness.
- PyO3 + maturin bridges them. Single wheel, native speed where it matters.
Storage is three SQLite databases under ~/.agentc: traces.db (spans + cache entries), cost_model.db (per-call-site empirical cost model + shadow-mode divergence), and optimizer_audit.db (every plan decision the optimizer ever made, with the runtime input that produced it).
- specs/profiler.md - instrumentation, span schema, waste detectors
- specs/memoization.md - canonical keys, LSH, cache lifecycle
- specs/optimizer.md - DAG IR, cost model, rule definitions, accuracy budget
- specs/future-work.md - items intentionally out of scope
The closest system-level neighbors are Agentix/Autellix (serving-layer scheduler), Halo (batch workflow DAG optimizer over shared GPU), Murakkab (declarative resource allocation), and Cognify (offline autotuning loop) — each requires access to the serving stack, a declarative workflow format, or a labeled offline evaluator. Agentc operates at the Python SDK call site: it patches the SDK at import time and intercepts every LLM call regardless of which framework issued it, with no serving access, no application annotations beyond optional agentc.state_write(), and no offline evaluator.
Model routing (FrugalGPT, RouteLLM, LLMSelector), prompt compression (LLMLingua, LLMLingua-2, Selective Context), semantic caching (GPTCache, vCache), and parallel tool calling (LLMCompiler) are all active areas. Agentc's ModelDowngrade, ContextCompress, CacheHit, and ParallelBranch cover the same goals but as passes under one runtime policy rather than standalone systems.
On the direct compression baseline: LLMLingua-2 (token-level proxy classifier, 53% reduction) degrades accuracy from 68% to 53% on HotpotQA-distractor (McNemar p=0.0013); ContextCompress improves it 68%→100% (p=4.7×10⁻¹⁰) by operating at message granularity. On natural Wikipedia prose (no injected distractors), ContextCompress correctly abstains — identical outcomes to baseline on all 39 tasks. LLMLingua-2 still compresses 53.5% of tokens with no significant accuracy gain and 13.7s overhead per task. The difference is granularity: message-level extraction vs. token-level scoring.
Agentc: agent compiler. The -c suffix nods to the compiler toolchain tradition (rustc, gcc, clangd). Agentc occupies the same role in the agent stack that a compiler occupies in the software stack: takes a high-level specification, produces an efficient execution plan.