Skip to content

AshtonVaughan/agentbrowser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentBrowser

A real, visible-cursor browser for AI agents. Real Chromium. Real humanlike physics. Real audit trail. Agents drive it the way a human would: drag, click, type, scroll - while every action is recorded, replayable, and verified.

Works with every major LLM: Anthropic Claude, OpenAI GPT, Google Gemini, Groq, Together, Fireworks, DeepInfra, Mistral, Cohere, xAI Grok, OpenRouter, Perplexity, Ollama (local), Ollama Cloud (hosted), vLLM, LM Studio, llama.cpp - or anything OpenAI-compatible. Zero lock-in.

┌──────────────────────────────────────────────────────────────────┐
│                          Agent (your code)                       │
│   "submit the payment form"  ──┐                                 │
└────────────────────────────────┼─────────────────────────────────┘
                                 │  HTTP POST /sessions/:id/plan
                                 ▼
┌──────────────────────────────────────────────────────────────────┐
│                      AgentBrowser runtime                        │
│  Planner  ──▶  findAndClick (DOM ▶ vision-LLM)  ──▶  cursor      │
│     │              │                                  │          │
│     │              ▼                                  ▼          │
│     │     verifier (DOM diff)               Bezier trajectory    │
│     │              │                          + CDP raw events   │
│     ▼              ▼                                  │          │
│  action      action.completed event ──▶  recorder (JSONL)        │
│  memory                                              │           │
│  (skip LLM                                           ▼           │
│   on visit                                  WebSocket / SSE      │
│   #2+)                                       to operator UI      │
└──────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
                     Real visible cursor moves on real Chromium

Why this exists

Every existing browser-automation tool was built for humans first and retrofitted for agents. They speak DOM operations. They produce 8000-token HTML dumps. They re-learn each site every run. They have no audit trail. They get blocked by every cookie banner.

AgentBrowser inverts this. The cursor is real and visible. All input goes through CDP raw mouse events. The API speaks in goals, not selectors. Failed actions auto-recover. Every action gets verified. The system learns each site permanently and shares knowledge across domains.

Other tools                       │   AgentBrowser
─────────────────────────────────│──────────────────────────────────
8000 tokens of HTML              │   50 tokens of structured meaning
agent guesses #submit-btn-v2     │   { goal: "submit the form" }
no replay, no audit              │   JSONL trace, deterministic replay
re-learns every visit            │   action memory, 7x faster on visit 3
blocked by every cookie wall     │   auto-dismiss + force-removal
no captcha story                 │   2Captcha / hCaptcha / Turnstile
no fingerprint defenses          │   per-context WebGL/canvas/audio noise
single integration: library      │   library + MCP + HTTP + WS + SSE + replay
locked to one LLM vendor         │   17 providers, one env var swap (Claude/GPT/
                                 │   Gemini/Ollama/vLLM/Groq/Together/Fireworks/...)

Five-minute demo

git clone https://github.com/AshtonVaughan/agentbrowser
cd agentbrowser
npm install && npx playwright install chromium && npm run build

# Pick ANY LLM provider:
ANTHROPIC_API_KEY=sk-ant-...   npm run http   # Claude (default)
# or
OPENAI_API_KEY=sk-...          npm run http   # GPT
# or
GOOGLE_API_KEY=...             npm run http   # Gemini
# or
GROQ_API_KEY=gsk_...           npm run http   # Groq (Llama on LPUs)
# or run fully local with Ollama:
ollama serve &
AGENTBROWSER_LLM_PROVIDER=ollama OLLAMA_MODEL=llama3.2-vision npm run http
# or use Ollama Cloud:
OLLAMA_CLOUD_API_KEY=... npm run http

In another terminal:

# Create a session
curl -X POST localhost:3100/api/v1/sessions
# → { "session_id": "abc123..." }

# Plan + execute a goal end-to-end
curl -X POST localhost:3100/api/v1/sessions/abc123/plan \
  -H 'content-type: application/json' \
  -d '{"goal":"go to news.ycombinator.com and click the top story"}'
# → { success: true, steps: [...], duration_ms: 4200 }

# Watch it live
open ui/operator/index.html

The agent's cursor moves humanly across the screen. Every cursor.move, click, page change streams to the operator UI in real time. Every action is recorded to ~/.agentbrowser/traces/ for replay.


What you get

Visible humanlike cursor

  • SVG cursor sprite injected via context init script
  • Bezier-curve trajectories with jitter, ease-in-out, optional overshoot
  • All input via CDP Input.dispatchMouseEvent (not Playwright locators)
  • Click ripple animation, fading 14-point cursor trail
  • Per-trajectory deterministic seed for replay reproducibility

Hybrid action layer

Layer What it does
cursor.click(x, y) Direct viewport click via CDP
cursor.clickBySelector(sel) Bbox-resolve, scroll-into-view, humanlike click. Stale-element auto-recovery via accessible-name lookup.
cursor.clickByText(text) Text disambiguation across visually similar elements
cursor.clickByRole(role, {name}) ARIA-driven targeting
findAndClick({goal, ...}) DOM selector → text → role → vision-LLM, every step verified
executor.executeAction(name) Action-memory fast path → fallback to find-and-click
planner.planAndExecute(goal) LLM goal decomposition → multi-step run with retry budget

Vision pipeline

  • extractElementBoxes(page) returns rich element catalog: id / role / tag / accessible name / value / bbox / selector / disabled
  • bboxScreenshot(sessionId) returns a viewport PNG with numbered cyan boxes drawn on every interactive target + the element list
  • VisionLLM.decide(goal, screenshot, elements) sends to Claude Sonnet, parses {element_id, action, rationale}
  • cursor.clickByBox(bbox) clicks vision-derived coordinates with the visible cursor

Self-healing

  • Action verifier snapshots ElementBoxes before every action, diffs after settle, declares verified=true on URL change / added / removed / textChanged / moved elements
  • Stale-element recovery in cursor.clickBySelector falls back to getByText(originalText) on selector failure
  • Modal interrupter detects fixed/absolute high-z-index dialogs at viewport center, classifies as blocker (cookie/consent/subscribe keywords) or user-relevant (login dialogs), auto-dismisses blockers and retries
  • Selector library learns from verified outcomes only - no entry in memory unless the click actually changed the page

Site + action memory

  • SQLite WAL for concurrent reads. Per-domain selector library + page-model cache.
  • ActionMemory - SHA-1 page signature × goal hash → selector × success/fail counters. Visit #2 to a known page costs zero LLM calls.
  • Cross-domain transfer: recallByGoal(goal, excludeDomain) returns winning selectors from OTHER domains for the same logical goal. The system learned "submit payment → button#pay-btn" on stripe.com; it tries the same selector on paddle.com as a hypothesis.
  • decay(unusedSinceMs) halves stale entry counts so the library stays healthy as sites change.

Recorder + replay

  • Every action streamed as JSONL to ~/.agentbrowser/traces/<session-id>.jsonl
  • ReplayEngine reads a trace, dispatches events to a fresh session at configurable speed
  • compactTrace() collapses 60-event cursor.move trajectories into 1, merges consecutive cursor.type events, drops micro-waits
  • Plan audit captures screenshot + element list at every step boundary - compliance-grade replay primitive

Skill library

  • A "skill" is a recorded trace with named slot tokens ($email, $password)
  • SkillLibrary.parameterize(events, slots) replaces literal values with tokens (longest-first to avoid partial-match bugs)
  • Save / load / list / delete via JSON files at ~/.agentbrowser/skills/
  • Portable .skill.json packages with format magic + version + metadata (author, license, tags) - bundle a skill with your agent code or publish to a registry; users import + bind their own credentials

CAPTCHA

  • TwoCaptchaSolver for hCaptcha + reCAPTCHA v2 + Cloudflare Turnstile via 2Captcha API
  • Page-side DETECT_CAPTCHA_SCRIPT finds sitekeys for all three types
  • solveCaptchaIfPresent(page, solver) chain: detect → solve → inject token → fire change/input events → invoke data-callback
  • Pluggable via CaptchaSolver interface (drop in AntiCaptcha, CapMonster, etc.)

Anti-fingerprinting

  • applyFingerprintShield(context) per-context init script
  • Spoofs WebGL UNMASKED_VENDOR/RENDERER (5 GPU profiles), navigator.hardwareConcurrency, navigator.deviceMemory, AudioContext (1e-7 noise), Canvas toDataURL (~0.08% pixel jitter), navigator.plugins
  • Per-context deterministic seed - fingerprint stays stable within a session, differs across sessions

Multi-tab

  • engine.newTab(sessionId, url?) opens a tab in the same context (shares cookies/auth)
  • Each tab has its own HumanCursor. switchTab / closeTab / listTabs.
  • HTTP: GET/POST/DELETE /sessions/:id/tabs, POST /tabs/:tab/switch

Universal LLM provider support (zero lock-in)

  • 17 providers wired - swap any of them in by setting one env var
  • LLMProvider interface (complete() + completeWithImage()); analyzer + vision-LLM + planner all use this abstraction, never SDKs
  • Auto-detection at startup picks the right provider from env vars
Provider Set Notes
Anthropic Claude ANTHROPIC_API_KEY default if nothing else set
OpenAI OPENAI_API_KEY gpt-4o-mini default
Google Gemini GOOGLE_API_KEY or GEMINI_API_KEY gemini-2.5-flash, vision native
Groq GROQ_API_KEY super-fast Llama/Mixtral on LPUs
Together AI TOGETHER_API_KEY open-source models
Fireworks FIREWORKS_API_KEY open-source models
DeepInfra DEEPINFRA_API_KEY open-source models
Mistral MISTRAL_API_KEY mistral-large-latest
Cohere COHERE_API_KEY command-r-plus via /compatibility
xAI Grok XAI_API_KEY grok-2 with vision
OpenRouter OPENROUTER_API_KEY 300+ models behind one API
Perplexity PERPLEXITY_API_KEY online-search models
Azure OpenAI AZURE_OPENAI_API_KEY + AZURE_OPENAI_BASE_URL enterprise tenant
Ollama (local) OLLAMA_BASE_URL (default localhost:11434) llama3.2, llama3.2-vision, qwen2.5vl, etc
Ollama Cloud OLLAMA_CLOUD_API_KEY hosted Ollama with turbo models
vLLM VLLM_BASE_URL (default localhost:8000) self-hosted production inference
LM Studio LMSTUDIO_BASE_URL (default localhost:1234) desktop GUI
llama.cpp server LLAMACPP_BASE_URL (default localhost:8080) tiny self-hosted
Anything OpenAI-compatible presets.openaiCompatible(url) drop in your URL
// Pick a provider explicitly (any of the 17):
import { AgentBrowserHttpServer, presets } from 'agentbrowser';

const server = new AgentBrowserHttpServer({
  llm_provider: presets.ollamaCloud(),   // or .openai() or .groq() etc
  // ...
});

Or just set AGENTBROWSER_LLM_PROVIDER=ollama and the server auto-wires. Override the model with <PROVIDER>_MODEL=<model-id>.

Chrome extension (drive YOUR Chrome with YOUR cookies)

  • Install extensions/chrome/ in dev mode (chrome://extensions → Load unpacked)
  • Click the AgentBrowser icon, paste server URL + API key, click Connect
  • Now an agent calling POST /api/v1/agents/<your-id>/cmd drives YOUR Chrome with YOUR cookies and login state
  • Manifest v3 + chrome.debugger for real CDP mouse events + chrome.scripting for vision/extract
  • See extensions/chrome/README.md for the full security model

Operator + recorder + memory + skills UIs

  • ui/operator/index.html - live screenshot + cursor trail overlay + event timeline + quick actions panel
  • ui/recorder/index.html - real-time WebSocket event stream + multi-lane timeline canvas + replay scrubber + JSONL export
  • ui/memory/index.html - paginated action memory browser per domain + decay control + JSON export
  • docs/pricing.html - 4-tier pricing page wired to /api/v1/billing/checkout for self-serve Stripe payments
  • docs/skills.html - skills marketplace landing with 8 curated skills (login-stripe, login-google, amazon-add-to-cart, github-create-issue, etc.)
  • All are single-file vanilla HTML/CSS/JS. No build step.

Transcription helpers (agents that "watch" videos)

  • findCaptionTracks(page) detects HTML5 <track> + YouTube playerCaptionsTracklistRenderer + custom player markup
  • parseVTT(text) / parseJSON3(json) convert standard caption formats to typed TranscriptSegment[]
  • transcribeFromCaptions(page) one-shot: detect → fetch → parse
  • transcriptToText(segments) concatenate for LLM consumption

Production runtime

  • HTTP REST API + WebSocket + SSE on Fastify with bearer-token auth + per-key rate limiting
  • 30+ endpoints, OpenAPI 3.1 spec at /api/v1/openapi.json
  • Stripe billing wired (Checkout + webhook + signature verification + license issuance)
  • Prometheus /metrics + readiness probe + dashboard summary endpoint
  • 4-tier license scaffold (free / pro / team / enterprise) with feature gates and quota tracking
  • Multi-stage Dockerfile, docker-compose with persistent volume + 1GB shm_size for Chromium
  • Python + TypeScript SDK clients

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  Operator UI  (browser-based dashboard)                         │
│   - live screenshot stream         - cursor trail viz           │
│   - action log + reasoning         - manual takeover            │
│  Recorder UI                                                    │
│   - timeline scrubber              - replay export              │
└─────────────────────────────────────────────────────────────────┘
                          ▲
                          │  WebSocket events + SSE frames + REST
┌─────────────────────────┴───────────────────────────────────────┐
│                   AgentBrowser HTTP Control Plane               │
│  ┌──────────────────┐  ┌──────────────────┐  ┌────────────────┐ │
│  │  REST + WS + SSE │  │  Bearer auth +   │  │  License +     │ │
│  │  Fastify         │  │  per-key rate    │  │  quota system  │ │
│  └──────────────────┘  └──────────────────┘  └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
                          ▲
                          │
┌─────────────────────────┴───────────────────────────────────────┐
│                    AgentBrowser Core Runtime                    │
│  ┌──────────────────┐  ┌──────────────────┐  ┌────────────────┐ │
│  │  Planner         │  │  findAndClick    │  │  Recorder +    │ │
│  │  (goal → steps)  │  │  hybrid action   │  │  Replay engine │ │
│  └──────────────────┘  └──────────────────┘  └────────────────┘ │
│  ┌──────────────────┐  ┌──────────────────┐  ┌────────────────┐ │
│  │  Verifier        │  │  Modal           │  │  Action memory │ │
│  │  (diff snapshots)│  │  interrupter     │  │  (skip LLM)    │ │
│  └──────────────────┘  └──────────────────┘  └────────────────┘ │
│  ┌──────────────────┐  ┌──────────────────┐  ┌────────────────┐ │
│  │  Vision pipeline │  │  HumanCursor     │  │  Site memory   │ │
│  │  bbox + annotate │  │  Bezier + CDP    │  │  WAL SQLite    │ │
│  │  + VisionLLM     │  │  trail + ripple  │  │  + selectors   │ │
│  └──────────────────┘  └──────────────────┘  └────────────────┘ │
│  ┌──────────────────┐  ┌──────────────────┐  ┌────────────────┐ │
│  │  Anti-fingerprint│  │  Captcha solver  │  │  LLM provider  │ │
│  │  (canvas/WebGL)  │  │  + auto-inject   │  │  (Anthropic /  │ │
│  │                  │  │                  │  │   OpenAI / etc)│ │
│  └──────────────────┘  └──────────────────┘  └────────────────┘ │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Browser Engine - Playwright + stealth + multi-tab     │    │
│  │  ↓                                                      │    │
│  │  Chromium (the real browser, with the visible cursor)  │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Quick start - 4 ways

As a TypeScript library

import { AgentBrowser } from 'agentbrowser';

const browser = new AgentBrowser({
  anthropic_api_key: process.env.ANTHROPIC_API_KEY,
  headless: false,    // watch the cursor move
  stealth: true,
});
await browser.launch();

const state = await browser.navigate('https://news.ycombinator.com');
console.log(state.page_type);          // 'listing'
console.log(state.available_actions);  // [{ name: 'navigate_to_new', ... }, ...]

await browser.action('navigate_to_new');

const data = await browser.extract({
  top_story: 'title of the top story',
  points: 'upvote count',
  author: 'submitter username',
});

await browser.close();

Via HTTP

curl -X POST http://localhost:3100/api/v1/sessions
curl -X POST http://localhost:3100/api/v1/sessions/$ID/navigate -d '{"url":"https://example.com"}'
curl -X POST http://localhost:3100/api/v1/sessions/$ID/find_and_click -d '{"goal":"submit the form"}'
curl http://localhost:3100/api/v1/sessions/$ID/screenshot/bbox

Python SDK

from agentbrowser import AgentBrowserClient

client = AgentBrowserClient("http://localhost:3100", api_key="...")
with client.create_session() as s:
    s.navigate("https://example.com")
    s.click(selector="a")
    png, elements = s.annotated_screenshot()

MCP server (Claude Code, etc.)

Add to ~/.claude.json:

{
  "mcpServers": {
    "agentbrowser": {
      "command": "node",
      "args": ["/path/to/agentbrowser/dist/server/mcp.js"],
      "env": { "ANTHROPIC_API_KEY": "sk-..." }
    }
  }
}

Repository structure

src/
├── engine/             Browser lifecycle, sessions, tabs
│   ├── browser.ts      Playwright wrapper, cursor wiring, popup handling
│   └── tabs.ts         Multi-tab manager
├── input/              Cursor + physics + fingerprint shield
│   ├── cursor.ts       HumanCursor: SVG overlay, CDP input, click/drag/type
│   ├── trajectory.ts   Bezier path generator with jitter + overshoot
│   └── fingerprint.ts  WebGL/canvas/audio anti-fingerprint patches
├── vision/             Vision pipeline
│   ├── bbox.ts         Element catalog extraction
│   ├── annotate.ts     Numbered-box screenshot annotator
│   ├── diff.ts         Snapshot diff (added/removed/moved/textChanged)
│   └── llm.ts          Claude vision integration
├── runtime/            Action execution + autonomy
│   ├── executor.ts     Action runner with action-memory hot path
│   ├── find.ts         Hybrid DOM-then-vision findAndClick
│   ├── verifier.ts     Action verification via snapshot diff
│   ├── modal-interrupter.ts  Cookie/consent/popup detection
│   ├── planner.ts      LLM goal decomposition + multi-step execution
│   ├── plan-audit.ts   Per-step screenshot + element capture
│   ├── recorder.ts     JSONL action stream
│   ├── replay.ts       Deterministic trace replay
│   ├── compactor.ts    Trace compaction (collapse cursor.move runs)
│   ├── events.ts       In-process event broker (pub/sub)
│   ├── captcha.ts      2Captcha API integration
│   └── captcha-solver.ts  Detect → solve → inject pipeline
├── memory/             Persistent storage
│   ├── store.ts        Site memory (page model cache + selector library)
│   └── action-memory.ts  Per-(page,goal) selector cache + cross-domain transfer
├── llm/                Provider abstraction
│   ├── provider.ts     LLMProvider interface
│   ├── anthropic.ts    Anthropic SDK wrapper
│   ├── openai.ts       OpenAI-compatible (works with vLLM/Ollama too)
│   └── index.ts        autoDetectProvider
├── semantic/           Page analysis
│   └── analyzer.ts     Page-to-SemanticPageModel via LLM
├── skills/             Skill library
│   ├── skills.ts       Parameterize/save/load/run
│   └── package.ts      .skill.json export/import format
├── server/             Production server surfaces
│   ├── http.ts         Fastify + REST + WS + SSE + auth + rate limit
│   ├── openapi.ts      OpenAPI 3.1 spec
│   ├── license.ts      Tier system + quota tracking
│   └── mcp.ts          MCP server for Claude Code et al.
├── bin/
│   └── http.ts         HTTP server entry point
├── util/
│   └── coords.ts       Viewport ↔ document ↔ screenshot coord conversions
├── types.ts            Shared types (SemanticPageModel, ActionDefinition, ...)
└── index.ts            Public API barrel

tests/                  Vitest suite (119 tests, 21 files)
clients/
├── python/             Python SDK (zero deps, stdlib urllib)
└── typescript/         TypeScript SDK (browser + Node compatible)
ui/
├── operator/           Live dashboard (single HTML file)
└── recorder/           Real-time trace timeline (single HTML file)
docs/                   Public docs site (single HTML file)
examples/               Working agent demos
.agentbrowser-meta/     Internal build planning + iteration log
Dockerfile              Multi-stage slim image (~700MB with Chromium)
docker-compose.yml      Local stack with persistent volume

API surface

22 HTTP endpoints, all bearer-authenticated when API keys are configured. Full OpenAPI 3.1 at /api/v1/openapi.json.

Method Path Purpose
GET /health Liveness (no auth)
POST /api/v1/sessions Create session
DELETE /api/v1/sessions/:id Destroy session
POST /api/v1/sessions/:id/navigate Go to URL
POST /api/v1/sessions/:id/cursor/{move,click,drag,scroll,type,press} Cursor primitives
POST /api/v1/sessions/:id/find_and_click Hybrid DOM → vision-LLM action
POST /api/v1/sessions/:id/plan LLM goal decomposition + multi-step execute
POST /api/v1/sessions/:id/extract Schema-driven LLM extraction
POST /api/v1/sessions/:id/fill Fill a named form
GET /api/v1/sessions/:id/screenshot Viewport PNG
GET /api/v1/sessions/:id/screenshot/bbox Annotated PNG + element list
GET /api/v1/sessions/:id/screenshot/stream SSE PNG frames
WS /api/v1/sessions/:id/screenshot/ws Binary PNG frames
GET /api/v1/sessions/:id/state Current SemanticPageModel
GET /api/v1/sessions/:id/elements ElementBox[]
GET/POST/DELETE /api/v1/sessions/:id/tabs Multi-tab control
POST /api/v1/sessions/:id/solve_captcha Detect + solve + inject
WS /api/v1/sessions/:id/events Live event stream
POST /api/v1/sessions/:id/back History back
POST /api/v1/sessions/:id/forward History forward
POST /api/v1/sessions/:id/reload Reload current page
POST /api/v1/sessions/:id/dialog Accept/dismiss next native alert/confirm/prompt
GET/POST/DELETE /api/v1/sessions/:id/cookies Read/write/clear browser cookies
GET/POST/DELETE /api/v1/sessions/:id/storage Read/write/clear localStorage or sessionStorage (?kind=local|session)
POST /api/v1/sessions/:id/upload Set files on a file input ({selector, paths})
POST /api/v1/sessions/:id/print Render current page to PDF (returns application/pdf)
POST /api/v1/sessions/:id/route/block Block all requests matching a glob pattern
POST /api/v1/sessions/:id/route/headers Inject headers into requests matching a pattern
POST /api/v1/sessions/:id/route/mock Mock response body for matching requests
DELETE /api/v1/sessions/:id/route Remove all route handlers (passthrough)
POST /api/v1/sessions/:id/geolocation Override reported coords (or null to clear)
POST /api/v1/sessions/:id/viewport Resize viewport mid-session
POST /api/v1/sessions/:id/headers Set extra HTTP headers for all requests
POST /api/v1/sessions/:id/har/start Start HAR network capture
GET /api/v1/sessions/:id/har/peek Get current entries without stopping
POST /api/v1/sessions/:id/har/stop Stop and return all captured entries
POST /api/v1/sessions/:id/console/start|peek|stop Capture console messages + uncaught errors
POST /api/v1/sessions/:id/throttle/network Throttle network (downloadThroughput/uploadThroughput/latencyMs/offline)
POST /api/v1/sessions/:id/throttle/cpu CPU slowdown multiplier (1=native, 4=4x slower)
POST /api/v1/sessions/:id/locale Override navigator.language + Accept-Language
POST /api/v1/sessions/:id/timezone Override page timezone (e.g. "Asia/Tokyo")
POST/DELETE /api/v1/sessions/:id/permissions Grant/clear browser permissions (clipboard, notifications, etc.)
POST /api/v1/sessions/:id/record/start Begin in-memory recording for skill creation
GET /api/v1/sessions/:id/record/peek Live event count while recording
POST /api/v1/sessions/:id/record/stop Stop and (optionally) save events as a skill ({name, slots, description})
POST /api/v1/sessions/:id/har/replay Re-execute HAR entries and compare statuses
GET /api/v1/sessions/:id/service-workers List active service workers
GET/POST /api/v1/sessions/:id/snapshot Export full session state (cookies + storage + IDB); POST /snapshot/restore to import
POST/GET /api/v1/sessions/:id/downloads/start|stop Auto-capture all downloads to a session-tagged dir
GET/POST /api/v1/sessions/:id/clipboard Read/write the page's clipboard via navigator.clipboard
POST /api/v1/sessions/:id/wait/selector|text|network-idle|function Smart waiters with timeouts
GET /api/v1/sessions/:id/markdown Extract clean RAG-friendly markdown from current page
GET /api/v1/skills/:name/versions List archived versions of a skill
POST /api/v1/skills/:name/rollback Restore a previous version ({version: N})
GET /api/v1/sessions/:id/activity Idle time in ms for a session
POST /api/v1/sessions/:id/touch Reset idle counter
GET /api/v1/sessions/expired List sessions past idle timeout
POST /api/v1/snapshot/diff Diff two snapshots ({a, b}) returns added/removed/changed
POST /api/v1/sessions/:id/click_by_description Vision-only click ("the blue Submit button")
POST /api/v1/pool/warmup Pre-create N empty contexts for sub-100ms session creation
GET /api/v1/pool/status Warm pool size + oldest entry age
POST /api/v1/pool/drain Close all warm contexts
POST /api/v1/sessions/:id/copilot/install Inject highlight overlay for "AI is here" hints
POST /api/v1/sessions/:id/copilot/highlight Highlight a bbox with optional label
POST /api/v1/pool/auto_refill/start|stop Background job that keeps pool topped up
POST /api/v1/action_memory/predict Predict next action from memory ({url, elements, goal?})
GET /api/v1/skills/marketplace Static skill catalog (slug, tags, quality, runs, success_rate) for GitHub Pages hosting
GET /api/v1/vision/cache/stats Vision LLM cache hits/misses + persistent_size if SQLite-backed
POST /api/v1/vision/cache/clear Empty in-memory + on-disk vision cache
GET /api/v1/sessions/:id/har/export Captured HAR exported as standard HAR 1.2 (Chrome DevTools-importable)
POST /api/v1/vision/cache/prune Evict oldest entries beyond cache_max_disk_entries + VACUUM
POST /api/v1/traces/diff Diff two trace event arrays (regression testing)
POST /api/v1/sessions/:id/shortcut Execute named keyboard shortcut (newTab/copy/find/etc)
GET /api/v1/shortcuts List available named shortcuts
GET /api/v1/dump Single-call snapshot of all server state
POST/GET/DELETE /api/v1/plans[/:slug] Save / list / load / delete reusable Plan blueprints
POST /api/v1/sessions/:id/shortcut/chain Execute multiple named shortcuts in sequence
GET /api/v1/metrics/summary Per-histogram p50/p95/p99/mean/count percentiles
GET/POST/DELETE /api/v1/rate_limits Per-domain RPS limit (token-bucket throttle on navigate)
POST /api/v1/sessions/:id/plans/:slug/run Load + execute a saved Plan blueprint
POST /api/v1/skills/:name/to_plan Convert a recorded skill into a Plan ({save?, slug?})
GET /api/v1/action_memory/search?pattern=... Search action memory by selector substring
POST /api/v1/parallel/extract Spawn N parallel sessions, extract markdown from each URL
POST /api/v1/plans/compose Chain N saved plans into a super-plan
POST /api/v1/sessions/:id/form/autofill Auto-fill form inputs by name/label/placeholder match
POST /api/v1/skills/diff Compare two skills' events ({a, b})
GET/POST/DELETE /api/v1/webhooks[/:id] Subscribe to events, POST to external URL (HMAC-signed when secret set)
POST /api/v1/webhooks/:id/test Fire a test delivery to verify connectivity
POST /api/v1/batch/csv Process a CSV: navigate per row, extract markdown, return enriched results
GET /api/v1/webhooks/queue Pending webhook retry count
POST/GET /api/v1/skills/:name/cost Record / read per-skill LLM cost (token usage × rates)
GET /api/v1/skills/cost/leaderboard Top-N most expensive skills by total cost
POST /api/v1/skills/validate Validate a skill's structure before save ({skill})
POST /api/v1/skills/:name/auto_tag LLM-suggest 3-5 tags from skill description + events
POST /api/v1/traces/render Render TraceEvent[] as a self-contained HTML timeline page
POST /api/v1/skills/:name/auto_describe LLM-write a one-line description from skill events
GET /api/v1/skills/:name/suggested_selectors Cross-skill: selectors that worked for the same goal on other domains
GET /api/v1/action_memory/export.csv Download action memory as CSV (?domain=&limit=)
POST /api/v1/action_memory/query Composite filter+sort query (domain, goal substr, selector substr, min_success_rate, min_runs, sort_by)
GET /api/v1/sessions/:id/a11y Accessibility audit (missing alt/label, heading skips, empty links, missing lang)
GET /api/v1/skills/:name/bundle Export skill as .agbpkg (skill + plan + stats + readme)
POST /api/v1/skills/bundle/import Import an .agbpkg bundle
GET /api/v1/analytics/domain/:domain Per-domain action memory analytics + top selectors
GET/POST/DELETE /api/v1/schedules[/:id] Recurring skill execution ({skill_name, spec: "every 5m", bindings})
POST /api/v1/skills/recommend Recommend skills matching a goal text ({goal, limit?, min_score?})
POST /api/v1/plan_templates/:name/to_skill Convert a built-in plan template into a runnable skill
POST/GET /api/v1/sessions/:id/network/start|peek|stop Per-session network bytes tracking
GET /api/v1/sessions/:id/contrast WCAG color contrast audit
GET /api/v1/fingerprints List browser fingerprint presets (mac-chrome, iphone-15-pro, tokyo-iphone, etc.)
POST /api/v1/sessions/:id/fingerprint Apply a preset ({preset_id}) - viewport + UA + locale + timezone
GET /api/v1/sessions/:id/memory Per-session heap/rss/external delta from session creation
POST /api/v1/sessions/:id/memory/snapshot Re-baseline the session memory snapshot
GET /api/v1/health/full Detailed system health: process, engine, scheduler, billing, auth posture
GET /api/v1/sessions/:id/cpu Per-session CPU delta (user/system/wall ms + cpu_percent)
POST /api/v1/sessions/:id/cpu/snapshot Re-baseline the session CPU snapshot
POST /api/v1/action_memory/distill Top selectors across distinct domains (ship as starter packs)
POST /api/v1/action_memory/import_patterns Seed memory with distilled patterns from another deployment
POST /api/v1/sessions/:id/skills/auto_run Match goal to best skill and execute (no LLM round-trip)
GET/POST/DELETE /api/v1/skills/ab[/:key] Register weighted A/B routes between skill versions
POST /api/v1/sessions/:id/skills/ab/:key/run Run an A/B-routed skill (weighted variant pick)
WS /api/v1/skills/events/ws Live skill outcome firehose (?skill=<name> filter)
GET/DELETE /api/v1/skills/ab/:key/stats Aggregated success/failure stats per variant
POST /api/v1/skills/ab/:key/promote Auto-promote winning variant (z-test gated)
GET/DELETE /api/v1/skills/percentiles[?skill=] Per-skill p50/p95/p99 latency histograms
GET/POST/DELETE /api/v1/skills Skill library CRUD
GET /api/v1/skills/:name/export Download .skill.json
POST /api/v1/skills/import Import .skill.json
POST /api/v1/sessions/:id/skills/:name/run Replay skill with bindings
GET /api/v1/action_memory/stats Memory stats
GET /api/v1/action_memory/by_domain/:domain What does the system know about :domain
GET /api/v1/action_memory/selectors/:domain Top selectors per domain with success/fail stats
POST /api/v1/action_memory/recall_by_goal Cross-domain selector hypotheses
POST /api/v1/action_memory/similar TF-IDF / embedding action memory search
POST /api/v1/action_memory/similar_embedded Force embedding-only search
POST /api/v1/action_memory/decay Halve stale entry counts
POST /api/v1/page_similarity Score similarity between two page snapshots
POST /api/v1/skills/discover Suggest skills relevant for the current page
POST /api/v1/skills/compose Run multiple skills in sequence
GET /api/v1/skills/stats Per-skill success/fail history (filter by ?skill= for per-domain rows)
GET /api/v1/skills/leaderboard Top-N performing skills by success_count
GET /api/v1/skills/hot High-confidence skills (success_rate ≥ 0.9, ≥ 10 runs by default)
POST /api/v1/skills/prune Remove skills below a success threshold (dry-run by default)
DELETE /api/v1/skills/:name Delete a skill by name
GET /api/v1/plan_templates List built-in plan templates
GET /api/v1/diagnose Remote health diagnostic (mirrors agb-doctor)
POST /api/v1/traces/compact Compact a trace file
GET /api/v1/openapi.json Full OpenAPI 3.1 spec
GET /api/v1/dashboard Aggregated stats: site memory + action memory + skills + provider
GET /api/v1/billing/pricing Tier list with prices + limits (powers pricing page)
POST /api/v1/billing/checkout Stripe Checkout Session for {tier, email}
POST /api/v1/billing/webhook Stripe webhook receiver (signature-verified)
GET /metrics Prometheus scrape (no auth)
GET /ready Readiness probe (503 until engine launched)
GET /api/v1/agents List currently-connected Chrome extension agents
GET /api/v1/agents/:id/poll Long-poll for next command (used by Chrome extension)
POST /api/v1/agents/:id/cmd Send a command to a connected Chrome extension
POST /api/v1/agents/:id/result Extension posts back command results
DELETE /api/v1/agents/:id Drop agent state

Production deployment

Docker

docker compose up -d
# AgentBrowser on http://localhost:3100
# Memory + traces persist in named volume

Auth + rate limit

AGENTBROWSER_API_KEYS=key1,key2 \
ANTHROPIC_API_KEY=sk-ant-... \
node dist/bin/http.js

Rate limit defaults to 600 req/min/key. Override with AGENTBROWSER_RATE_LIMIT_PER_MINUTE.

Observability

  • Every action emits a typed SessionEvent to the in-process broker
  • WebSocket clients subscribe at /api/v1/sessions/:id/events (with 200-event replay buffer for reconnect)
  • JSONL traces at ~/.agentbrowser/traces/ are the audit log
  • compactTrace() keeps long traces small without losing fidelity

Tested at

Capability Status
Type check (tsc) clean
Test suite 119 / 119 passing across 21 files
Build (npm run build) clean
Real Chromium navigation verified
Bot detection bypass passes Cloudflare interstitial, OneTrust, Cookiebot, Funding Choices, Reddit GDPR, Stack Overflow signup wall
Cursor click → DOM event verified end-to-end
Action verifier verified on URL change + element add/remove/text change
Vision pipeline annotated PNG verified by magic-byte + element list
HTTP API 13 integration tests against full Fastify stack
Replay determinism trajectory generator deterministic per seed

License

MIT. Commercial use encouraged.

For acquisition or partnership inquiries: ashtonluca@gmail.com.

About

Browser runtime purpose-built for AI agents. Semantic tools, site memory, self-healing execution, MCP server.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors