A tool-aware model router for agent loops — plus a content-addressed run graph that makes every part of the loop inspectable, branchable, and replayable.
Status: very early. Public design exploration. APIs will break.
$ forge run --tools web_search,execute_sql,screenshot \
--model claude-sonnet-4-6 \
--after-tool web_search:claude-haiku-4-5-20251001 \
--after-tool execute_sql:openai/ft:gpt-4o-mini:org:sql-v3 \
--after-tool screenshot:gemini/gemini-2.5-flash \
--prompt "..."Every tool boundary is a swap point. After a specific tool's result lands, the next turn uses a model picked for that tool:
- Cost — cheap models digest mechanical tool output; smart models plan and synthesize
- Specialization — SQL-fine-tuned model after
execute_sql, code-fine-tuned afterrun_tests, vision afterscreenshot - Privacy — local model after a tool that returned PII so the cloud model never sees the sensitive payload
- Modality — vision-only model only when an image was just produced
The router is keyed on the most recent ToolCall in the chain, evaluated fresh each cycle. Rules are tool:[provider/]model. No rule for a tool means the default (--model) handles that turn. Cross-provider rules work because the run graph is provider-neutral; each adapter translates the prefix into its own wire format.
$ forge audit --target-model claude-haiku-4-5 --judge-model claude-sonnet-4-6
auditing 52 run(s) against claude-haiku-4-5
✓ 47 EQUIVALENT — safe to downgrade
⚠ 5 DIFFERENT — keep on the original modelforge audit measures per-prompt downgrade safety against your real recorded runs, with an LLM judge. The audit and the routing close the loop: measure → derive rules → apply.
Routing isn't only cost arbitrage. Sometimes the cheap model genuinely fails the task — and routing rescues it with a single flag.
A planted-bug fixture lives at tests/fixtures/buggy/: multiply(a, b) -> a + b instead of a * b. Three unit tests pin it down. The agent gets three tools: run_tests, read_file, apply_patch.
Pure gemini-2.5-flash-lite, no routing. The agent calls run_tests, sees cargo's test result: FAILED. 1 failed; 2 passed, and replies:
"The crate already passes all tests. There is no bug to fix."
The cheap model hallucinates success in direct contradiction of the structured tool output. No patch applied. Tests still failing.
Same agent, same prompt, same tools — one flag added:
$ forge run --agent gemini --model gemini-2.5-flash-lite \
--tools run-tests,read-file,apply-patch \
--after-tool run_tests:gemini-2.5-flash \
--after-tool read_file:gemini-2.5-flash \
--prompt "..."
routed run: 5 cycle(s)
cycle 0 [after run_tests ] flash-lite +3 step(s) ← cheap: ran the failing tests
cycle 1 [after read_file ] flash +2 step(s) ← smart: read the source
cycle 2 [after apply_patch] flash +2 step(s) ← smart: applied a + b → a * b
cycle 3 [after run_tests ] flash-lite +2 step(s) ← cheap: re-ran tests
cycle 4 [after run_tests ] flash +1 step ← smart: synthesized FIXEDFinal assistant message: FIXED. cargo test independently confirms 3 passed; 0 failed. The fix the agent applied is in the chain — apply_patch(old="a + b", new="a * b"). Reproducible end-to-end on a real codebase, not synthetic.
The smart model only handled the two cycles where structured interpretation matters (read_file → understand source, run_tests → diagnose failure). Mechanical iteration stayed on the cheap model. Same agent loop. One config flag.
Every agent execution is a DAG of content-addressed steps. Each step (prompt, tool call, response, sandbox run) is a hash-keyed node. The graph is the source of truth.
That gives you five things you can't get from append-only logs:
- Fork. Branch any run from any step. Try a different model, a different prompt, a different tool — without re-running the prefix.
- Diff. Compare two runs and see exactly where they diverged. Tool calls, outputs, semantic drift.
- Bisect.
forge bisect <good> <bad>finds the first step that caused a failure by forking withgood's value at each divergent step and asking "did this recover?" Only possible because the graph is content-addressed and forks are free. - Replay. Re-execute a run deterministically against a different backend. Catch regressions before they ship.
- Resume. Crash mid-run, resume from the last hashed node. No lost work.
LLM agents are stateful, expensive, and non-deterministic. The default observability story is "log everything, scroll through JSONL." That doesn't help you answer the question every team actually has: what changed, and why?
Forge treats agent runs the way git treats source code: a commit graph you can inspect, branch, and diff.
Cargo workspace:
| crate | what it owns |
|---|---|
forge-core |
step types, hashes, graph invariants, Agent / Matcher / Tool traits + the built-in tool library (calculator, fs ops, run_tests, apply_patch, run_command) |
forge-storage |
Storage trait — MemoryStorage, SledStorage, PostgresStorage |
forge-anthropic |
Anthropic Messages API adapter (multi-turn tool use, prompt caching, SSE streaming) |
forge-openai |
OpenAI Chat Completions adapter (multi-turn tool use, SSE streaming) |
forge-gemini |
Google Gemini adapter — public API + Vertex AI bearer auth + Gemini 3.x thinking-mode tool continuations |
forge-mcp |
Model Context Protocol JSON-RPC client + Tool adapter — drop in any MCP server's tools as native Forge tools |
forge-rig |
Record a Rig conversation as a Forge step chain |
forge-recorder |
HTTP recorder proxy (Anthropic + OpenAI, auto-threaded via content-addressed prefix detection) |
forge-cli |
the forge binary (run, runs, replay, continue, fork, diff, bisect, audit, view, serve, web) |
Storage backends: in-memory (tests), sled (default, single-process), Postgres (multi-process / multi-machine, durable resume, LISTEN/NOTIFY for live web updates).
- Workspace skeleton, core types, in-memory storage
-
forge runagainst a stub agent harness (FakeAgent) - sled backend;
forge runsandforge replayread from disk - Anthropic Messages API adapter (single-shot, no tools yet)
-
Matchertrait sketch (interception / tripwires) -
forge fork <run> --at <step> --rewrite-text(Prompt/Message kinds) -
forge diffv0: pairwise chain walk, finds first divergence - Anthropic tool-use loop (multi-turn, tool calls executed locally)
-
Tooltrait + built-inCalculatortool -
forge fork ... --continue— drive a fresh agent forward from the fork -
forge diffv1: structural alignment via signature-based LCS, modified/only markers - Fork supports every step kind (text for prompt/message, JSON for tool_call/tool_result)
- Auto-execute tool after fork-on-tool-call when
--continueis set -
forge continue <run>+--max-turnson run/continue/fork — mid-run model handoffs - OpenAI Chat Completions adapter (
--agent openai) - Cross-provider continuations: run with Claude, continue with GPT-5
- Friendly error rendering (no Rust backtrace on missing API key)
-
forge viewTUI: timeline + content panes, single-run and aligned-diff modes - Anthropic SSE streaming with text deltas to stderr
-
forge serveHTTP recorder: drop-in proxy for any agent (Anthropic + OpenAI) - Streaming pass-through in the recorder (SSE tee)
- Auto-threading via content-addressed prefix detection
-
forge webembedded HTML viewer (single binary, no build pipeline) -
examples/with runnable scripts - Postgres backend (
--db postgres://...) for multi-process / multi-machine - OpenAI SSE streaming (matches Anthropic;
--streamworks on both) - Gemini adapter (
--agent gemini, multi-turn function calling, streaming) - Anthropic prompt caching (
cache_control: ephemeralon the latest input block) -
forge webtoken-level inline diff for modified steps - Friendly Postgres connect errors (one-line messages for auth / DNS / refused)
-
--publicflag onforge serveandforge webto opt-in to non-localhost bind - Live
forge webupdates via PostgresLISTEN/NOTIFY(broadcast for sled / memory) - Web UI fork/continue actions (localhost-only until auth ships)
- Anthropic cache-hit telemetry (
cache_creation_input_tokens/cache_read_input_tokens) -
forge-rigcrate to record a Rig conversation -
forge bisect— git-bisect for agent runs (find the step that caused a failure) -
forge audit— replay recorded runs against cheaper models, judge equivalence, surface safe downgrades -
forge run --after-tool— tool-aware per-turn model routing (cost / specialization / privacy) -
forge audit --by-tool— derive routing rules empirically from per-tool downgrade safety - HTTP recorder middleware (drop-in for any agent)
- Adapter for Rig (thin shim once tool-use lands)
-
forge diffv1: LLM-judge for "why did these diverge?" - Tripwires v0 (only useful once a real agent loop is intercepting; deferred)
- Deterministic replay (seeded where APIs allow, full request/response capture)
- Postgres backend (durable resume across machines, large blob dedup)
- TUI viewer (
ratatui)
git clone https://github.com/theohmwoa/forge && cd forge
cargo install --path crates/forge-cli --force
forge --helpThe binary lands in ~/.cargo/bin/forge. Make sure that's in your PATH.
Single binary, no runtime deps.
# record a 5-step scripted run (no API key required)
forge run --agent fake
# real Anthropic call with tool use
export ANTHROPIC_API_KEY=...
forge run --agent anthropic --tools calculator --prompt "what is 47 * 53?"
# OpenAI works the same way
export OPENAI_API_KEY=...
forge run --agent openai --model gpt-5 --tools calculator --prompt "what is 47 * 53?"
# Gemini, ditto
export GEMINI_API_KEY=...
forge run --agent gemini --model gemini-2.5-flash --tools calculator --prompt "what is 47 * 53?"
# Vertex AI (bearer auth via gcloud ADC) — same flags, different env
export FORGE_VERTEX=1 FORGE_VERTEX_PROJECT=my-gcp-project
export GOOGLE_VERTEX_ACCESS_TOKEN=$(gcloud auth application-default print-access-token)
forge run --agent gemini --model gemini-3.1-pro-preview --tools calculator --prompt "..."
# point forge at any MCP server; its tools register automatically
# (alongside built-in tools — no Rust changes needed)
forge run --agent anthropic \
--mcp-server "npx -y @modelcontextprotocol/server-filesystem ." \
--prompt "..."
# autonomous bug-hunting on the planted-bug fixture — see top of README
forge run --agent gemini --model gemini-2.5-flash-lite \
--tools run-tests,read-file,apply-patch \
--after-tool run_tests:gemini-2.5-flash \
--after-tool read_file:gemini-2.5-flash \
--prompt "Run tests on tests/fixtures/buggy. If any fail, find the bug, apply_patch it, re-run. Output FIXED or BROKEN."
# list recorded runs
forge runs
# walk a run back from disk
forge replay <head-prefix>
# fork at any step. Text for prompt/message, JSON for tool_call/tool_result.
forge fork <run> --at <prompt-prefix> --rewrite "ask differently"
forge fork <run> --at <tool-call-prefix> --rewrite '{"op":"mul","a":7,"b":8}'
forge fork <run> --at <tool-result-prefix> --rewrite '99'
# fork + continue: drive a fresh agent forward from the new step.
# For tool_call rewrites, the tool runs locally first to produce a real result.
forge fork <run> --at <step> --rewrite "..." --continue --tools calculator
# stop a run early so another model can pick up
forge run --agent anthropic --model claude-haiku-4-5-20251001 --max-turns 1 \
--prompt "what is 47 * 53?" --tools calculator
forge continue <head> --model claude-sonnet-4-6 --tools calculator
# diff two runs (shared prefix is content-addressed equal, so cheap)
forge diff <head-a> <head-b>
# git-bisect for agent runs: which step caused the failure?
# each trial forks `bad` at a divergent step with `good`'s value, drives
# the agent forward, and checks whether the run recovered.
forge bisect <good-head> <bad-head> --expect "sum is 5"
# audit recorded runs for cost: which can be safely downgraded to haiku?
# forge replays each prompt against the cheaper model and uses sonnet to
# judge equivalence; trial runs persist with tag audit-<model> for review.
forge audit --tag prod --target-model claude-haiku-4-5-20251001
# interactive TUI: timeline on the left, content on the right
forge view <head>
forge view <head-a> --diff <head-b> # aligned diff with j/k navigation
# local web viewer (single embedded HTML page, localhost only by default)
forge web --port 7879
# HTTP recorder — drop-in proxy for any existing agent
forge serve --port 7878
export ANTHROPIC_BASE_URL=http://127.0.0.1:7878
./my-existing-agent.py # every API call now records into ForgeForge has three integration paths, in increasing order of friction:
- HTTP recorder (
forge serve) — point your existing agent at the proxy viaANTHROPIC_BASE_URL/OPENAI_BASE_URL. Works with the Anthropic and OpenAI SDKs in Python, TypeScript, Go, etc., and anything built on top (LangChain, LlamaIndex, Rig, raw HTTP). Streaming responses are teed; multi-turn calls are auto-threaded via content-addressed prefix detection — no session id needed. - Native Rust integration — use
forge-anthropic/forge-openaicrates directly for the full agent + recording pipeline. - Direct graph access — the run graph is just a sled DB on disk. Walk it from any language without a Forge client.
See the examples/ directory for runnable scripts covering
each path.
End-to-end example (FakeAgent emits a 5-step conversation; we fork at the assistant's first message and diff):
$ forge run --agent fake
run complete: 5 steps
0 6952335a6f prompt[fake-model-v1]
1 fadbc99433 message[assistant] "I'll use the calculator tool."
2 ff2e474b78 tool_call[calculator]
3 2cec969b11 tool_result[call-1]
4 715ac59405 message[assistant] "The sum is 5."
$ forge fork 715ac594 --at fadbc994 --rewrite-text "Let me solve this without tools."
new head: cb7329bca5...
$ forge diff 715ac594 cb7329bc
shared prefix: 1 step(s)
legend: = same ~ modified - only in A + only in B
~ message[assistant]
- I'll use the calculator tool.
+ Let me solve this without tools.
- tool_call[calculator]
- {"a":2,"b":3,"op":"add"}
- tool_result[call-1]
- 5
- message[assistant]
- The sum is 5.
MIT.