Browser automation benchmarks for Subconscious AI agents.
Runs 100 WebArena tasks (sandboxed web apps) and 30 BU Bench tasks (live internet) against the Subconscious API, producing JSON results with full reasoning traces.
All results use the tim-gpt-heavy engine with BrowserBase cloud browsers.
100 tasks across 4 sandboxed web applications. Evaluation is fully deterministic (string match, URL match, DOM content checks — no LLM judge).
| Site | Tasks | Passed | Failed | Pass Rate |
|---|---|---|---|---|
| GitLab | 25 | 20 | 5 | 80.0% ████████░░ |
| 25 | 20 | 5 | 80.0% ████████░░ | |
| Shopping Admin | 25 | 17 | 8 | 68.0% ██████▊░░░ |
| Shopping | 25 | 12 | 13 | 48.0% ████▊░░░░░ |
| Total | 100 | 69 | 31 | 69.0% ██████▉░░░ |
30 tasks on the live internet. Evaluation uses Gemini 2.5 Flash as an LLM judge with screenshots.
| Category | Tasks | Passed | Failed | Pass Rate |
|---|---|---|---|---|
| WebBenchREAD | 11 | 9 | 2 | 81.8% ████████▏░ |
| InteractionTests | 19 | 12 | 7 | 63.2% ██████▍░░░ |
| Total | 30 | 21 | 9 | 70.0% ███████░░░ |
Note: BU Bench tasks run on live websites. Results may vary slightly between runs due to changing page content, network conditions, and anti-bot measures.
| Benchmark | Tasks | Pass Rate | Evaluation |
|---|---|---|---|
| WebArena | 100 | 69.0% | Deterministic (string/URL/HTML match) |
| BU Bench | 30 | 70.0% | Gemini LLM judge + screenshots |
| Overall | 130 | 69.2% | Mixed |
- Python 3.10+
- cloudflared (
brew install cloudflared) — auto-tunnels the local controller so the Subconscious cloud API can reach it - A Subconscious API key
- A BrowserBase API key + project ID
- A Google AI API key (for BU Bench Gemini judge)
- For WebArena: self-hosted WebArena sites via WebArena Docker images
pip install -e .
playwright install chromium
cp .env.example .env # fill in your keys# Preflight check — validate all API keys, connectivity, and tunnel
subconscious-browser-bench check
# Start the browser controller (separate terminal)
# Automatically creates a cloudflared tunnel for the Subconscious API to call back
subconscious-browser-bench controller
# Run WebArena benchmark
subconscious-browser-bench run webarena # all 100 tasks
subconscious-browser-bench run webarena --tasks 0,1,2 # specific tasks
subconscious-browser-bench run webarena --range 0-25 # index range
# Run BU Bench benchmark
subconscious-browser-bench run bu-bench # all 30 tasks
subconscious-browser-bench run bu-bench --range 0-10 # index range
# View results
subconscious-browser-bench results # list result files
subconscious-browser-bench results <filename> # show summary| Flag | Description |
|---|---|
--port |
Controller port (default: 7391) |
--no-tunnel |
Disable automatic cloudflared tunnel |
| Flag | Description |
|---|---|
--model |
Model name (default: tim-gpt-heavy) |
--timeout |
Max seconds per task (default: 900 for WebArena, 1800 for BU Bench) |
--append-to |
Append results to an existing JSON file |
[CLI] ── run tasks ──> [Runner]
│
POST /v1/runs/stream
│
[Subconscious API]
(cloud)
│
tool calls via ──────────┐
cloudflared tunnel │
│
[Controller :7391] <──────────────────────┘
│ (FastAPI, runs locally)
│
└──> [BrowserBase cloud browser]
│
[WebArena sites / Live internet]
The controller runs locally and is automatically exposed to the internet via a cloudflared quick tunnel. When the Subconscious cloud API needs to call browser tools (navigate, click, type, etc.), it reaches the local controller through this tunnel. No manual URL configuration needed.
Full reasoning traces for all 130 tasks are available on Google Drive:
View Traces — Individual JSON files with complete Subconscious reasoning steps, tool calls, agent answers, and evaluation details.
Results are saved to results/ as JSON files. Each file contains:
- Run metadata — model, timestamps, pass rate, total/completed/error counts
- Per-task results — task intent, agent answer, score, duration
- Reasoning traces — full Subconscious reasoning steps for every task
- Evaluation details — WebArena: deterministic string/URL/HTML match; BU Bench: Gemini judge verdict with reasoning
WebArena (Zhou et al., 2023) is a benchmark of realistic web tasks on self-hosted sandboxed websites. This repo includes a curated set of 100 tasks across 4 sites (Shopping, Shopping Admin, Reddit, GitLab) using only deterministic evaluators — no LLM-based evaluation.
Evaluation methods:
string_match— compare agent's final answer to reference stringsurl_match— check the browser's final URLprogram_html— navigate to target URLs and check DOM content via Playwright
BU Bench (Browser-Use Benchmark V1) is a benchmark of web tasks on the live internet. This repo includes 30 tasks from two categories:
- InteractionTests (19 tasks) — form filling, DOM manipulation, drag interactions, multi-step challenges
- WebBenchREAD (11 tasks) — reading and extracting information from live websites
Evaluation: Gemini 2.5 Flash LLM judge receives the task intent, agent's final answer, reasoning trace, and a screenshot of the final browser state.
| Service | Purpose | Required |
|---|---|---|
| Subconscious API | AI agent reasoning engine | Yes |
| BrowserBase | Cloud Chromium browsers with proxies + CAPTCHA solving | Yes |
| Google Gemini API | LLM judge for BU Bench scoring | BU Bench only |
| WebArena Docker images | Self-hosted sandboxed web apps | WebArena only |
| cloudflared | Tunnel for local controller | Install manually: brew install cloudflared |
Built by the Subconscious team:
- Wei Fang
- Hongyin Luo
- Owen Stepan
This benchmark builds on the work of:
- WebArena — Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024.
- Browser-Use — The BU Bench V1 task suite for evaluating browser automation agents on live internet tasks.
- BrowserBase — Cloud browser infrastructure with stealth mode, residential proxies, and CAPTCHA solving.
MIT