Browser automation that learns. An MCP server that makes AI agents faster at repeated browser tasks by recording what works and replaying deterministic steps via Playwright.
47.4s → 8.8s on the second run. 5.4x faster. Zero config.
Every time an AI agent browses a website, it starts from scratch. Login flows, navigation, search bars, all rediscovered through expensive LLM calls. 3 to 5 seconds per step, every single time.
Forged fixes this. It sits between your AI assistant and the browser, records successful traces, extracts reusable templates, and replays deterministic steps at millisecond speed. The LLM only handles steps that genuinely need reasoning.
Run 1: Full AI agent. Every step through Claude. 47.4s
Run 2: 9 Playwright steps + 1 agent step. 8.8s (5.4x faster)
Run 5: 80% deterministic, 20% reasoning. ~3s
git clone https://github.com/sam-siavoshian/forged.git
cd forged
bash setup_mcp.shThe wizard finds your Python, installs dependencies (mcp, httpx), and registers Forged with Claude Code at user or project scope. Works with any MCP-compatible assistant (Cursor, Windsurf, etc).
After install, start the backend + frontend dev servers:
./dev.sh # backend on :8000, dashboard on :5173Set the keys from .env.example first: ANTHROPIC_API_KEY, OPENAI_API_KEY, BROWSER_USE_API_KEY, SUPABASE_URL / SUPABASE_KEY / SUPABASE_SERVICE_ROLE_KEY / SUPABASE_DB_URL.
When your AI calls run_browser_task("Search for keyboards on Amazon"), three things happen:
- Domain extraction. Regex pulls
amazon.comfrom the task. Falls back to Claude Haiku if no URL is present ("Amazon" →amazon.com). - Action classification. Haiku classifies the task into one of 7 categories:
search,purchase,form_fill,navigate,extract,login,interact. - Vector similarity search. The task is embedded via OpenAI
text-embedding-3-large(3072 dimensions) and compared against stored templates using pgvector's<=>cosine distance operator. Filtered by domain and action type before the vector search even runs.
If similarity is high enough (≥0.50), a template is matched. If not, the full agent runs and Forged learns a template from the trace.
| Band | Similarity | What happens |
|---|---|---|
| Very High | ≥ 90% | Execute all rocket steps immediately |
| High | 75-89% | Execute all rocket steps |
| Medium | 50-74% | LLM verifies the match before executing |
| No match | < 50% | Full agent, auto-learn template |
Template confidence updates after every execution: success adds +0.1 * (1 - confidence), failure subtracts 0.2 * confidence. Templates that keep working gain confidence. Templates that break lose it fast.
After a successful agent run, Forged analyzes the trace and classifies each step:
- FIXED. Same action and target every time. Navigate to URL, click a button, press Enter. Replayed via Playwright.
- PARAMETERIZED. Same action, different value. Typing a search query into a fixed input field. Playwright fills the extracted parameter.
- DYNAMIC. Requires reasoning. Choosing a product by price, answering a CAPTCHA, making a subjective decision. Handled by the AI agent.
The handoff_index marks where Playwright stops and the agent takes over. Everything before it is deterministic. Everything after needs intelligence.
Template: "Search for {{query}} on Amazon"
Step 0: navigate → amazon.com FIXED (Playwright)
Step 1: click → search input FIXED (Playwright)
Step 2: fill → search input with {{query}} PARAMETERIZED (Playwright)
Step 3: press → Enter FIXED (Playwright)
─── handoff_index = 3 ───
Step 4: click → product matching criteria DYNAMIC (Agent)
When a template has 10 steps but the task only needs 5, Forged doesn't blindly run all 10. Claude Haiku evaluates each step against the specific task and marks them EXECUTE or SKIP. If a step's parameter is missing from the task, it gets skipped. If the task is a subset of the template, only the relevant prefix executes.
One Amazon template handles both "search for keyboards" (steps 0-3) and "search for keyboards and buy the cheapest one" (steps 0-7).
Playwright connects to a cloud browser via CDP (Chrome DevTools Protocol) and replays template steps:
- Primary selector. CSS selector with full timeout budget (capped at 8s).
- Fallback selectors. Alternative CSS selectors with reduced timeout (1/3 budget, min 800ms).
- Role-based fallback.
get_by_role("option")for native<select>dropdowns when CSS breaks. - Text/aria fallback. Extracted aria-label from CSS selectors as last resort.
If a step fails, the on_failure strategy kicks in: retry (once after 500ms), continue (skip it), try_fallback (already tried, skip), or abort (hand off to agent with context of what completed).
After Playwright finishes, the agent gets a prompt describing exactly what was done: "Playwright finished 5 of 7 steps. Navigation and search are complete. The search results page is loaded. Complete the remaining work from the current page state."
Forged uses Browser Use's BaaS API for cloud browsers instead of local Playwright because:
- Handoff. Rocket connects via CDP, executes Playwright steps, disconnects with
pw.stop()(notbrowser.close(), which would kill the remote session), then the agent reconnects to the same browser. Local Playwright can't hand off a live session between two different automation libraries. - Live debugging. Every session gets a
live_urlwhere you can watch the browser in real-time. - Isolation. Each task gets a clean browser. No cross-task state contamination.
Rate limits are handled with exponential backoff: 5s → 10s → 20s → 40s → 80s → 120s (capped), with jitter, up to 6 retries.
┌─────────────────────────────────────────────────────────┐
│ Claude Code / Cursor / Windsurf │
│ (any MCP-compatible assistant) │
└──────────────────────┬──────────────────────────────────┘
│ MCP (stdio)
┌──────────────────────▼──────────────────────────────────┐
│ Forged MCP Server │
│ 2 tools: run_browser_task, list_learned_skills │
│ Thin proxy → polls HTTP API │
└──────────────────────┬──────────────────────────────────┘
│ HTTP
┌──────────────────────▼──────────────────────────────────┐
│ Forged Backend (FastAPI) │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Matcher │ │ Rocket │ │ Learner │ │
│ │ 3-layer │ │ Playwright │ │ Trace → tpl │ │
│ │ pipeline │ │ execution │ │ extraction │ │
│ └──────┬──────┘ └──────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ ┌──────▼────────────────▼──────────────────▼───────┐ │
│ │ Supabase (PostgreSQL + pgvector) │ │
│ │ Templates, embeddings, traces, site knowledge │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────▼──────────────────────────┐ │
│ │ Browser Use BaaS (cloud browser via CDP) │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Execute any browser task. Forged handles routing automatically.
Input: task (string), e.g. "Go to news.ycombinator.com and get the top story"
Output: Result text, mode (rocket/baseline), duration, step breakdown, live browser URL
If a template matches, Playwright replays deterministic steps and the agent handles the rest. If no template exists, the full agent runs and a template is auto-learned for next time.
See what Forged has learned.
Input: domain (string, optional), filter by site (e.g. "amazon.com")
Output: Templates with confidence scores, step counts, usage stats, average speedup
PostgreSQL with pgvector on Supabase. Three tables:
task_templates. Learned browser automation templates. Each has a domain, action type, task pattern with {{param}} placeholders, step array (JSONB), handoff index, 3072-dim embedding, and confidence score that updates after every execution.
execution_traces. Every run recorded. Mode (rocket vs baseline), step-by-step actions, timing breakdown (rocket ms, agent ms, total ms), success/failure, error context. Used for analysis and debugging.
site_knowledge. Cached CSS selectors per domain. When a selector works, it's added to the fallback chain. Navigation patterns and page load signals stored for reliability.
The system uses dual connections: asyncpg pool for pgvector similarity search (fast, needs direct PG), Supabase REST client for CRUD (works through firewalls). If direct PG is blocked (campus WiFi), it falls back to REST + in-Python cosine similarity via NumPy.
src/
api.py FastAPI endpoints + session management
config.py All tunable constants (thresholds, timeouts, models)
models.py Pydantic models and dataclasses
orchestrator.py Task orchestration logic
browser/
rocket.py Playwright execution engine
agent.py Browser Use agent wrapper
agent_handoff.py Builds context-aware handoff prompts
cloud.py Cloud browser lifecycle (BaaS API)
matching/
matcher.py Three-layer matching pipeline
domain.py Domain extraction (regex + LLM)
action_type.py Action classification (LLM)
verifier.py Medium-confidence LLM verification
step_filter.py Adapts templates to specific tasks
template/
extractor.py Full extraction pipeline orchestrator
simplifier.py Trace cleanup (retry noise, dead ends)
analyzer.py LLM step classification (FIXED/PARAM/DYNAMIC)
generator.py Canonical template generation
validator.py Template validation (ERROR/WARNING checks)
db/
client.py Dual connection management (asyncpg + Supabase)
setup.py Schema creation and migration
templates.py Template CRUD + confidence learning
traces.py Execution trace recording
embeddings.py OpenAI embedding generation (3072-dim)
site_knowledge.py Selector caching per domain
frontend/ React 19 + TypeScript + Vite + Tailwind dashboard
mcp_server.py MCP server (2 tools, stdio transport)
setup_mcp.sh One-command setup wizard
dev.sh Local dev runner (backend + frontend)
tests/ pytest suite mirroring src/ layout
- Backend: Python 3.11+, FastAPI 0.115, Uvicorn 0.34
- Browser automation: Playwright 1.52 (rocket engine), Browser Use 0.2 (agent), Browser Use BaaS (cloud browsers)
- LLMs: Claude Sonnet 4.6 (agent execution, template analysis), Claude Haiku 4.5 (classification, verification, parameter extraction)
- Embeddings: OpenAI
text-embedding-3-large(3072 dimensions) - Database: Supabase PostgreSQL with pgvector
- Frontend: React 19, TypeScript, Vite 8, Tailwind 4, Bun
- MCP: Official Python MCP SDK 1.0, stdio transport
PRs welcome. See CONTRIBUTING.md. Run pytest tests/ -v before opening a PR. Open a draft early for non-trivial changes.
Built with ❤️ at Diamond Hack UCSD '26 by Saam Siavoshian.