Agentc

A JIT optimization runtime for multi-step LLM agent workloads.

Multi-agent systems cost 4-15x more tokens than single-agent baselines. Frameworks fire LLM calls and eat the cost; few systems optimize the application-level call trace before or while it runs. Agentc sits between agent frameworks and the LLM APIs, intercepts calls as they happen, and rewrites them to be cheaper without requiring application-code changes for supported adapters.

  Agent Frameworks       Claude Code · LangGraph · CrewAI · AutoGen
         │               describe what to do
         ▼
      Agentc             intercepts calls, optimizes execution
         │               decides how to do it cheaply
         ▼
      LLM APIs           Anthropic · OpenAI · Gemini
                         raw inference

Think of it like a compiler for agent workloads. Frameworks describe what to do. Agentc decides how to do it cheaply.

Status

The runtime, profiler, semantic memoization layer, and JIT optimizer are all implemented and pass their test suites. The agentc CLI ships with record, traces, analyze, report, cache, and optimize subcommands.

V2 extends the per-call optimizer with a CompositionPlanner that classifies rules by cost driver (InputTokens / OutputTokens / ModelPrice / CallElimination / Structural), applies orthogonal rules in dependency order, and produces Plan::Composed audit rows. Three new rules ship in V2: PromptDedup, OutputBudget, and StructuredTruncation (last two not yet independently benchmarked). Toggle with AGENTC_COMPOSE=1 (default on).

Per-rule savings (purpose-built isolation workloads):

Rule	Workload	n	Cost savings	Accuracy Δ	McNemar p
`ModelDowngrade`	`gaia_router`	127	35.3%	−2.4pp (±3.1pp SE)	n/a (unpaired)
`ContextCompress`	`long_context_qa`	100	34.8% input-tokens	−2pp (±3.5pp SE)	n/a (unpaired)
`StateDrop`	`iterative_refiner`	50	6.0% cost / 9.6% input-tokens	−2pp	—

V2 composition and comparison results:

Experiment	n	Result	McNemar p
CC vs LLMLingua-2, HotpotQA distractor	100	CC: 68%→100% (FB=32, BF=0); LLMLingua-2: 68%→53%	CC: 4.7×10⁻¹⁰; LL2: 0.0013
CC vs LLMLingua-2, Wikipedia natural prose	39	CC: 94.9%→94.9% (BF=0, FB=0, abstained); LL2: +2.6pp, 53.5% compression, 13.7s overhead	CC: 1.0; LL2: 1.0
CC+StateDrop composition, multirule_qa	30	CC: 33.1% token savings; SD: 0.1%; CC+SD: 21.7% (gate picks CC on most calls; fixture-specific ratio) — confirmatory n=20 ablation: all-on=31.3% ≈ CC-only	all p≥0.48
Planner ablation (V1 vs V2)	50	V1-CC+OB: −2pp (greedy wrong pick); V2-CC+OB: +0pp (gate corrects)	V2-CC: 0.0412
Agent diversity (rag_summarizer + autogen_bridge)	—	CC fires 30–54% of hot calls; SD fires 9–24%	—
Provider generalization (Anthropic Claude, HF Llama)	50 each	CC: 98% fire rate / 34% tok savings (HF); 0% (Anthropic single-msg); MD: 14.7% savings (Anthropic), 31.1% (HF); autogen_bridge on Llama matches OpenAI activation	—
StateDrop negative control (all-state-read variant)	20	0/319 SD fires when all state writes have matching reads; confirms unread-state precondition	—
Optimizer overhead (1,818 plan decisions)	—	pass-through p50=76µs; rewrite p50=120µs; p99 tail from first-call load	—

ParallelBranch ships and emits audit rows; the latency win currently comes from the user-side parallel_map ThreadPoolExecutor. CacheHit functions as a bridge between memoized and non-memoized callers; neither is a headline paper claim yet.

crates/                      Rust workspace (7 crates)
├── agentc-core              span schema, SQLite storage, hardening, embedding I/O
├── agentc-embed             model2vec embeddings + LSH for semantic memoization
├── agentc-memo              memoization cache: canonical keys, eviction, FFI
├── agentc-profiler          PyO3 module: Python bindings to span writer
├── agentc-analyzer          cost breakdown + waste detectors over stored traces
├── agentc-optimizer         DAG IR, cost model, planner, rewrite rules, CompositionPlanner
└── agentc-cli               `agentc` binary

python/agentc/               Python SDK
├── _patches/                wrapt-based monkey patches: anthropic, openai, google
├── _provenance_frameworks/  framework adapters: langgraph, crewai, autogen
├── _canonicalize/           per-vendor request canonicalization
└── _intercept.py            optimizer entry point: plan → dispatch → observe

bench/                       Evaluation harness
├── agents/                  Reference agents — per-rule isolation + composition probes:
│                            long_context_qa, long_context_qa_anthropic, long_context_qa_hf,
│                            iterative_refiner, iterative_refiner_allread (negative control),
│                            gaia_router, hotpot_qa, composition_qa, multirule_qa,
│                            rag_summarizer, autogen_bridge, support_qa, swebench_planner
├── build_*_fixture.py       public-dataset → JSON converters (hotpot, gaia,
│                            long_context, wikipedia_qa)
├── fixtures/                (gitignored, regenerated by build_*_fixture.py)
├── optimizer_bench.py       run an agent twice (optimizer off / on)
├── optimizer_ablation.py    11-config sweep per agent: shared baseline +
│                            <rule>-off ×5 + <rule>-only ×5
├── paired_analysis.py       McNemar exact test + bootstrap CI on per_task sidecars
├── paper_results/           committed result CSVs + summary txts
└── scripts/                 driver shell scripts

specs/                       Technical specifications
paper-intelligence/          Paper evidence, literature, venue, and experiment ledgers
tests/                       Python unit tests (~250 tests)

The Three Components

Three pieces that work together in a feedback loop.

1. Profiler

Instruments any Python agent pipeline, captures every LLM call (tokens, latency, model, cost, full prompt/response, embedding), and produces structured execution traces in SQLite. Implemented in Rust via PyO3, with wrapt-based zero-config monkey-patching of the OpenAI, Anthropic, and Google SDKs. Spec: specs/profiler.md.

2. Semantic Memoization

Opt-in caching that deduplicates LLM inference. Exact-prompt hash lookup on the hot path; LSH over 256-dim model2vec embeddings as a secondary tier for semantically-similar prompts. Cache state piggybacks on the profiler's canonical traces.db. Spec: specs/memoization.md.

3. Optimizer

JIT runtime that intercepts LLM calls on hot call sites and applies cost-ranked rewrite rules subject to a per-rule accuracy budget. Eight rules ship across V1 and V2:

Rule	Cost driver	What it does	Status
`CacheHit`	CallElimination	Replay output for a past prompt via the shared memoization cache	implemented, future benchmark
`ContextCompress`	InputTokens	Extractively drop low-attention messages from large prompts (8KB+ gate, IDF-weighted proxy)	headline-validated
`ParallelBranch`	Structural	Detect dependency-free sibling calls, emit `Plan::Parallel` for async dispatcher	implemented, observability only
`ModelDowngrade`	ModelPrice	Swap to a cheaper model when the cost model says accuracy holds	headline-validated
`StateDrop`	InputTokens	Prune state-tagged messages whose keys aren't in the current read window	validated (supporting)
`PromptDedup`	InputTokens	Remove near-duplicate message segments via per-call IDF	V2, benchmarked
`OutputBudget`	OutputTokens	Cap `max_output_tokens` at call-site p99 to prevent runaway generation	V2, benchmarked
`StructuredTruncation`	InputTokens	Project out unreferenced JSON tool-output fields	V2, not yet independently benchmarked

V2 CompositionPlanner: classifies rules by CostDriver, allows orthogonal rules (different drivers = non-overlapping Call fields) to apply in a single pass as Plan::Composed. Same-driver rules are gated unless explicitly allowlisted (e.g., StateDrop → ContextCompress). Controlled by AGENTC_COMPOSE=1 (default). V1 first-match behavior available via AGENTC_COMPOSE=0.

Cold calls pass through; optimization engages after hot_threshold observations (default 3), when the empirical cost model has real per-call-site data. 2% shadow-mode sampling provides ground-truth divergence for the accuracy budget. Spec: specs/optimizer.md.

Quick Start

Build the Rust workspace and install the Python SDK:

cargo build --release
maturin develop --release            # builds the PyO3 extension into the active venv
pip install -e ".[dev,openai]"

Profile any agent script:

agentc record -- python my_agent.py
agentc traces                         # list recent runs
agentc analyze <trace_id>             # cost breakdown + waste detection
agentc report --last 20               # aggregate across runs

Inspect the optimizer:

agentc optimize report                # rule firing rates, savings, accuracy
agentc optimize inspect <call_site>   # cost model + ablation status per call site
agentc optimize disable --rule ModelDowngrade --call-site 'app.*' --hours 24
agentc optimize bench --agent path/to/agent.py

Run the reference benchmark suite end-to-end:

# 1. Build fixtures from public datasets (HF_TOKEN required for GAIA)
python -m bench.build_hotpot_fixture
python -m bench.build_gaia_fixture
python -m bench.build_long_context_fixture

# 2. Run baseline vs. optimized for one agent
python -m bench.optimizer_bench bench.agents.rag_summarizer

# 3. 11-config per-rule ablation for one agent
python -m bench.optimizer_ablation bench.agents.long_context_qa

# 4. Reproduce paper experiments (ContextCompress + StateDrop)
bash bench/scripts/run_paper_ablation.sh

The reference agents stub LLM calls when no API key is set, so the harness wires up cleanly without spending money. To generate real cost/accuracy numbers, set OPENAI_API_KEY (and HF_TOKEN for GAIA) in .env.

Why Models Can't Replace This

Models keep getting better at managing their own context and tool usage. But Agentc operates on things the model literally cannot see or control:

Rule	Level	Can model training cover this by itself?
`CacheHit`	Memoization	Not without an external cache and invalidation policy
`ContextCompress`	Input shaping	Not without control over what gets sent to the model
`ParallelBranch`	Runtime scheduling	Not without visibility into sibling calls and side effects
`ModelDowngrade`	Routing	Partially; it overlaps with learned routing, but model choice happens outside the selected model
`StateDrop`	Memory management	Not without external state metadata and read/write policy

KV cache scheduling, parallel dispatch, semantic memoization, and runtime model selection are infrastructure problems. The model usually sees one request, not the full execution environment.

Architecture

Incoming LLM Call (intercepted via SDK monkey-patch)
       │
       ▼
  DAG Builder       →   adds typed node + edges to the execution graph
       │
       ▼
  Cost Model        →   scores token cost · latency · accuracy per strategy
       │
       ▼
  Optimizer         →   picks cheapest strategy within the accuracy budget
       │
       ▼
  Executor          →   runs the optimized call, instruments the result
       │
       ▼
  Cost Model        ←   feeds real execution data back (the loop closes here)
       │
       ▼
    Output

The cost model starts with conservative heuristics and is empirically calibrated with every execution. Agentc gets cheaper the more you use it.

The runtime fails open: every FFI boundary, every patch, every framework adapter is wrapped so a bug in Agentc never breaks the user's agent. The worst-case fallback is to pass the call through unmodified.

Implementation

Rust core for everything performance-critical: span writer, embeddings, memoization cache, optimizer planner, CLI.
Python SDK and bindings for SDK instrumentation, framework adapters, evaluation harness.
PyO3 + maturin bridges them. Single wheel, native speed where it matters.

Storage is three SQLite databases under ~/.agentc: traces.db (spans + cache entries), cost_model.db (per-call-site empirical cost model + shadow-mode divergence), and optimizer_audit.db (every plan decision the optimizer ever made, with the runtime input that produced it).

Documentation

specs/profiler.md - instrumentation, span schema, waste detectors
specs/memoization.md - canonical keys, LSH, cache lifecycle
specs/optimizer.md - DAG IR, cost model, rule definitions, accuracy budget
specs/future-work.md - items intentionally out of scope

Related Work

The closest system-level neighbors are Agentix/Autellix (serving-layer scheduler), Halo (batch workflow DAG optimizer over shared GPU), Murakkab (declarative resource allocation), and Cognify (offline autotuning loop) — each requires access to the serving stack, a declarative workflow format, or a labeled offline evaluator. Agentc operates at the Python SDK call site: it patches the SDK at import time and intercepts every LLM call regardless of which framework issued it, with no serving access, no application annotations beyond optional agentc.state_write(), and no offline evaluator.

Model routing (FrugalGPT, RouteLLM, LLMSelector), prompt compression (LLMLingua, LLMLingua-2, Selective Context), semantic caching (GPTCache, vCache), and parallel tool calling (LLMCompiler) are all active areas. Agentc's ModelDowngrade, ContextCompress, CacheHit, and ParallelBranch cover the same goals but as passes under one runtime policy rather than standalone systems.

On the direct compression baseline: LLMLingua-2 (token-level proxy classifier, 53% reduction) degrades accuracy from 68% to 53% on HotpotQA-distractor (McNemar p=0.0013); ContextCompress improves it 68%→100% (p=4.7×10⁻¹⁰) by operating at message granularity. On natural Wikipedia prose (no injected distractors), ContextCompress correctly abstains — identical outcomes to baseline on all 39 tasks. LLMLingua-2 still compresses 53.5% of tokens with no significant accuracy gain and 13.7s overhead per task. The difference is granularity: message-level extraction vs. token-level scoring.

Name

Agentc: agent compiler. The -c suffix nods to the compiler toolchain tradition (rustc, gcc, clangd). Agentc occupies the same role in the agent stack that a compiler occupies in the software stack: takes a high-level specification, produces an efficient execution plan.

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
.beads		.beads
.github/workflows		.github/workflows
bench		bench
crates		crates
data		data
figures		figures
paper-intelligence		paper-intelligence
python/agentc		python/agentc
scripts		scripts
specs		specs
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
PRESUBMIT.md		PRESUBMIT.md
README.md		README.md
main.tex		main.tex
main_trimmed.tex		main_trimmed.tex
orchestration-CLAUDE.md		orchestration-CLAUDE.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentc

Status

The Three Components

1. Profiler

2. Semantic Memoization

3. Optimizer

Quick Start

Why Models Can't Replace This

Architecture

Implementation

Documentation

Related Work

Name

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentc

Status

The Three Components

1. Profiler

2. Semantic Memoization

3. Optimizer

Quick Start

Why Models Can't Replace This

Architecture

Implementation

Documentation

Related Work

Name

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages