Skip to content

AveryClapp/Agentc

Repository files navigation

Agentc

A JIT optimization runtime for multi-step LLM agent workloads.

Multi-agent systems cost 4-15x more tokens than single-agent baselines. Frameworks fire LLM calls and eat the cost; few systems optimize the application-level call trace before or while it runs. Agentc sits between agent frameworks and the LLM APIs, intercepts calls as they happen, and rewrites them to be cheaper without requiring application-code changes for supported adapters.

  Agent Frameworks       Claude Code · LangGraph · CrewAI · AutoGen
         │               describe what to do
         ▼
      Agentc             intercepts calls, optimizes execution
         │               decides how to do it cheaply
         ▼
      LLM APIs           Anthropic · OpenAI · Gemini
                         raw inference

Think of it like a compiler for agent workloads. Frameworks describe what to do. Agentc decides how to do it cheaply.


Status

The runtime, profiler, semantic memoization layer, and JIT optimizer are all implemented and pass their test suites. The agentc CLI ships with record, traces, analyze, report, cache, and optimize subcommands.

V2 extends the per-call optimizer with a CompositionPlanner that classifies rules by cost driver (InputTokens / OutputTokens / ModelPrice / CallElimination / Structural), applies orthogonal rules in dependency order, and produces Plan::Composed audit rows. Three new rules ship in V2: PromptDedup, OutputBudget, and StructuredTruncation (last two not yet independently benchmarked). Toggle with AGENTC_COMPOSE=1 (default on).

Per-rule savings (purpose-built isolation workloads):

Rule Workload n Cost savings Accuracy Δ McNemar p
ModelDowngrade gaia_router 127 35.3% −2.4pp (±3.1pp SE) n/a (unpaired)
ContextCompress long_context_qa 100 34.8% input-tokens −2pp (±3.5pp SE) n/a (unpaired)
StateDrop iterative_refiner 50 6.0% cost / 9.6% input-tokens −2pp

V2 composition and comparison results:

Experiment n Result McNemar p
CC vs LLMLingua-2, HotpotQA distractor 100 CC: 68%→100% (FB=32, BF=0); LLMLingua-2: 68%→53% CC: 4.7×10⁻¹⁰; LL2: 0.0013
CC vs LLMLingua-2, Wikipedia natural prose 39 CC: 94.9%→94.9% (BF=0, FB=0, abstained); LL2: +2.6pp, 53.5% compression, 13.7s overhead CC: 1.0; LL2: 1.0
CC+StateDrop composition, multirule_qa 30 CC: 33.1% token savings; SD: 0.1%; CC+SD: 21.7% (gate picks CC on most calls; fixture-specific ratio) — confirmatory n=20 ablation: all-on=31.3% ≈ CC-only all p≥0.48
Planner ablation (V1 vs V2) 50 V1-CC+OB: −2pp (greedy wrong pick); V2-CC+OB: +0pp (gate corrects) V2-CC: 0.0412
Agent diversity (rag_summarizer + autogen_bridge) CC fires 30–54% of hot calls; SD fires 9–24%
Provider generalization (Anthropic Claude, HF Llama) 50 each CC: 98% fire rate / 34% tok savings (HF); 0% (Anthropic single-msg); MD: 14.7% savings (Anthropic), 31.1% (HF); autogen_bridge on Llama matches OpenAI activation
StateDrop negative control (all-state-read variant) 20 0/319 SD fires when all state writes have matching reads; confirms unread-state precondition
Optimizer overhead (1,818 plan decisions) pass-through p50=76µs; rewrite p50=120µs; p99 tail from first-call load

ParallelBranch ships and emits audit rows; the latency win currently comes from the user-side parallel_map ThreadPoolExecutor. CacheHit functions as a bridge between memoized and non-memoized callers; neither is a headline paper claim yet.

crates/                      Rust workspace (7 crates)
├── agentc-core              span schema, SQLite storage, hardening, embedding I/O
├── agentc-embed             model2vec embeddings + LSH for semantic memoization
├── agentc-memo              memoization cache: canonical keys, eviction, FFI
├── agentc-profiler          PyO3 module: Python bindings to span writer
├── agentc-analyzer          cost breakdown + waste detectors over stored traces
├── agentc-optimizer         DAG IR, cost model, planner, rewrite rules, CompositionPlanner
└── agentc-cli               `agentc` binary

python/agentc/               Python SDK
├── _patches/                wrapt-based monkey patches: anthropic, openai, google
├── _provenance_frameworks/  framework adapters: langgraph, crewai, autogen
├── _canonicalize/           per-vendor request canonicalization
└── _intercept.py            optimizer entry point: plan → dispatch → observe

bench/                       Evaluation harness
├── agents/                  Reference agents — per-rule isolation + composition probes:
│                            long_context_qa, long_context_qa_anthropic, long_context_qa_hf,
│                            iterative_refiner, iterative_refiner_allread (negative control),
│                            gaia_router, hotpot_qa, composition_qa, multirule_qa,
│                            rag_summarizer, autogen_bridge, support_qa, swebench_planner
├── build_*_fixture.py       public-dataset → JSON converters (hotpot, gaia,
│                            long_context, wikipedia_qa)
├── fixtures/                (gitignored, regenerated by build_*_fixture.py)
├── optimizer_bench.py       run an agent twice (optimizer off / on)
├── optimizer_ablation.py    11-config sweep per agent: shared baseline +
│                            <rule>-off ×5 + <rule>-only ×5
├── paired_analysis.py       McNemar exact test + bootstrap CI on per_task sidecars
├── paper_results/           committed result CSVs + summary txts
└── scripts/                 driver shell scripts

specs/                       Technical specifications
paper-intelligence/          Paper evidence, literature, venue, and experiment ledgers
tests/                       Python unit tests (~250 tests)

The Three Components

Three pieces that work together in a feedback loop.

1. Profiler

Instruments any Python agent pipeline, captures every LLM call (tokens, latency, model, cost, full prompt/response, embedding), and produces structured execution traces in SQLite. Implemented in Rust via PyO3, with wrapt-based zero-config monkey-patching of the OpenAI, Anthropic, and Google SDKs. Spec: specs/profiler.md.

2. Semantic Memoization

Opt-in caching that deduplicates LLM inference. Exact-prompt hash lookup on the hot path; LSH over 256-dim model2vec embeddings as a secondary tier for semantically-similar prompts. Cache state piggybacks on the profiler's canonical traces.db. Spec: specs/memoization.md.

3. Optimizer

JIT runtime that intercepts LLM calls on hot call sites and applies cost-ranked rewrite rules subject to a per-rule accuracy budget. Eight rules ship across V1 and V2:

Rule Cost driver What it does Status
CacheHit CallElimination Replay output for a past prompt via the shared memoization cache implemented, future benchmark
ContextCompress InputTokens Extractively drop low-attention messages from large prompts (8KB+ gate, IDF-weighted proxy) headline-validated
ParallelBranch Structural Detect dependency-free sibling calls, emit Plan::Parallel for async dispatcher implemented, observability only
ModelDowngrade ModelPrice Swap to a cheaper model when the cost model says accuracy holds headline-validated
StateDrop InputTokens Prune state-tagged messages whose keys aren't in the current read window validated (supporting)
PromptDedup InputTokens Remove near-duplicate message segments via per-call IDF V2, benchmarked
OutputBudget OutputTokens Cap max_output_tokens at call-site p99 to prevent runaway generation V2, benchmarked
StructuredTruncation InputTokens Project out unreferenced JSON tool-output fields V2, not yet independently benchmarked

V2 CompositionPlanner: classifies rules by CostDriver, allows orthogonal rules (different drivers = non-overlapping Call fields) to apply in a single pass as Plan::Composed. Same-driver rules are gated unless explicitly allowlisted (e.g., StateDrop → ContextCompress). Controlled by AGENTC_COMPOSE=1 (default). V1 first-match behavior available via AGENTC_COMPOSE=0.

Cold calls pass through; optimization engages after hot_threshold observations (default 3), when the empirical cost model has real per-call-site data. 2% shadow-mode sampling provides ground-truth divergence for the accuracy budget. Spec: specs/optimizer.md.


Quick Start

Build the Rust workspace and install the Python SDK:

cargo build --release
maturin develop --release            # builds the PyO3 extension into the active venv
pip install -e ".[dev,openai]"

Profile any agent script:

agentc record -- python my_agent.py
agentc traces                         # list recent runs
agentc analyze <trace_id>             # cost breakdown + waste detection
agentc report --last 20               # aggregate across runs

Inspect the optimizer:

agentc optimize report                # rule firing rates, savings, accuracy
agentc optimize inspect <call_site>   # cost model + ablation status per call site
agentc optimize disable --rule ModelDowngrade --call-site 'app.*' --hours 24
agentc optimize bench --agent path/to/agent.py

Run the reference benchmark suite end-to-end:

# 1. Build fixtures from public datasets (HF_TOKEN required for GAIA)
python -m bench.build_hotpot_fixture
python -m bench.build_gaia_fixture
python -m bench.build_long_context_fixture

# 2. Run baseline vs. optimized for one agent
python -m bench.optimizer_bench bench.agents.rag_summarizer

# 3. 11-config per-rule ablation for one agent
python -m bench.optimizer_ablation bench.agents.long_context_qa

# 4. Reproduce paper experiments (ContextCompress + StateDrop)
bash bench/scripts/run_paper_ablation.sh

The reference agents stub LLM calls when no API key is set, so the harness wires up cleanly without spending money. To generate real cost/accuracy numbers, set OPENAI_API_KEY (and HF_TOKEN for GAIA) in .env.


Why Models Can't Replace This

Models keep getting better at managing their own context and tool usage. But Agentc operates on things the model literally cannot see or control:

Rule Level Can model training cover this by itself?
CacheHit Memoization Not without an external cache and invalidation policy
ContextCompress Input shaping Not without control over what gets sent to the model
ParallelBranch Runtime scheduling Not without visibility into sibling calls and side effects
ModelDowngrade Routing Partially; it overlaps with learned routing, but model choice happens outside the selected model
StateDrop Memory management Not without external state metadata and read/write policy

KV cache scheduling, parallel dispatch, semantic memoization, and runtime model selection are infrastructure problems. The model usually sees one request, not the full execution environment.

Architecture

Incoming LLM Call (intercepted via SDK monkey-patch)
       │
       ▼
  DAG Builder       →   adds typed node + edges to the execution graph
       │
       ▼
  Cost Model        →   scores token cost · latency · accuracy per strategy
       │
       ▼
  Optimizer         →   picks cheapest strategy within the accuracy budget
       │
       ▼
  Executor          →   runs the optimized call, instruments the result
       │
       ▼
  Cost Model        ←   feeds real execution data back (the loop closes here)
       │
       ▼
    Output

The cost model starts with conservative heuristics and is empirically calibrated with every execution. Agentc gets cheaper the more you use it.

The runtime fails open: every FFI boundary, every patch, every framework adapter is wrapped so a bug in Agentc never breaks the user's agent. The worst-case fallback is to pass the call through unmodified.


Implementation

  • Rust core for everything performance-critical: span writer, embeddings, memoization cache, optimizer planner, CLI.
  • Python SDK and bindings for SDK instrumentation, framework adapters, evaluation harness.
  • PyO3 + maturin bridges them. Single wheel, native speed where it matters.

Storage is three SQLite databases under ~/.agentc: traces.db (spans + cache entries), cost_model.db (per-call-site empirical cost model + shadow-mode divergence), and optimizer_audit.db (every plan decision the optimizer ever made, with the runtime input that produced it).


Documentation


Related Work

The closest system-level neighbors are Agentix/Autellix (serving-layer scheduler), Halo (batch workflow DAG optimizer over shared GPU), Murakkab (declarative resource allocation), and Cognify (offline autotuning loop) — each requires access to the serving stack, a declarative workflow format, or a labeled offline evaluator. Agentc operates at the Python SDK call site: it patches the SDK at import time and intercepts every LLM call regardless of which framework issued it, with no serving access, no application annotations beyond optional agentc.state_write(), and no offline evaluator.

Model routing (FrugalGPT, RouteLLM, LLMSelector), prompt compression (LLMLingua, LLMLingua-2, Selective Context), semantic caching (GPTCache, vCache), and parallel tool calling (LLMCompiler) are all active areas. Agentc's ModelDowngrade, ContextCompress, CacheHit, and ParallelBranch cover the same goals but as passes under one runtime policy rather than standalone systems.

On the direct compression baseline: LLMLingua-2 (token-level proxy classifier, 53% reduction) degrades accuracy from 68% to 53% on HotpotQA-distractor (McNemar p=0.0013); ContextCompress improves it 68%→100% (p=4.7×10⁻¹⁰) by operating at message granularity. On natural Wikipedia prose (no injected distractors), ContextCompress correctly abstains — identical outcomes to baseline on all 39 tasks. LLMLingua-2 still compresses 53.5% of tokens with no significant accuracy gain and 13.7s overhead per task. The difference is granularity: message-level extraction vs. token-level scoring.


Name

Agentc: agent compiler. The -c suffix nods to the compiler toolchain tradition (rustc, gcc, clangd). Agentc occupies the same role in the agent stack that a compiler occupies in the software stack: takes a high-level specification, produces an efficient execution plan.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors