subconscious-browser-bench

Browser automation benchmarks for Subconscious AI agents.

Runs 100 WebArena tasks (sandboxed web apps) and 30 BU Bench tasks (live internet) against the Subconscious API, producing JSON results with full reasoning traces.

Results

All results use the tim-gpt-heavy engine with BrowserBase cloud browsers.

WebArena — 69% pass rate

100 tasks across 4 sandboxed web applications. Evaluation is fully deterministic (string match, URL match, DOM content checks — no LLM judge).

Site	Tasks	Passed	Failed	Pass Rate
GitLab	25	20	5	80.0% ████████░░
Reddit	25	20	5	80.0% ████████░░
Shopping Admin	25	17	8	68.0% ██████▊░░░
Shopping	25	12	13	48.0% ████▊░░░░░
Total	100	69	31	69.0% ██████▉░░░

BU Bench — 70% pass rate

30 tasks on the live internet. Evaluation uses Gemini 2.5 Flash as an LLM judge with screenshots.

Category	Tasks	Passed	Failed	Pass Rate
WebBenchREAD	11	9	2	81.8% ████████▏░
InteractionTests	19	12	7	63.2% ██████▍░░░
Total	30	21	9	70.0% ███████░░░

Note: BU Bench tasks run on live websites. Results may vary slightly between runs due to changing page content, network conditions, and anti-bot measures.

Combined

Benchmark	Tasks	Pass Rate	Evaluation
WebArena	100	69.0%	Deterministic (string/URL/HTML match)
BU Bench	30	70.0%	Gemini LLM judge + screenshots
Overall	130	69.2%	Mixed

Prerequisites

Python 3.10+
cloudflared (brew install cloudflared) — auto-tunnels the local controller so the Subconscious cloud API can reach it
A Subconscious API key
A BrowserBase API key + project ID
A Google AI API key (for BU Bench Gemini judge)
For WebArena: self-hosted WebArena sites via WebArena Docker images

Setup

pip install -e .
playwright install chromium
cp .env.example .env   # fill in your keys

Usage

# Preflight check — validate all API keys, connectivity, and tunnel
subconscious-browser-bench check

# Start the browser controller (separate terminal)
# Automatically creates a cloudflared tunnel for the Subconscious API to call back
subconscious-browser-bench controller

# Run WebArena benchmark
subconscious-browser-bench run webarena                  # all 100 tasks
subconscious-browser-bench run webarena --tasks 0,1,2    # specific tasks
subconscious-browser-bench run webarena --range 0-25     # index range

# Run BU Bench benchmark
subconscious-browser-bench run bu-bench                  # all 30 tasks
subconscious-browser-bench run bu-bench --range 0-10     # index range

# View results
subconscious-browser-bench results                       # list result files
subconscious-browser-bench results <filename>            # show summary

Controller Options

Flag	Description
`--port`	Controller port (default: `7391`)
`--no-tunnel`	Disable automatic cloudflared tunnel

Run Options

Flag	Description
`--model`	Model name (default: `tim-gpt-heavy`)
`--timeout`	Max seconds per task (default: 900 for WebArena, 1800 for BU Bench)
`--append-to`	Append results to an existing JSON file

Architecture

[CLI] ── run tasks ──> [Runner]
                          │
                  POST /v1/runs/stream
                          │
                 [Subconscious API]
                   (cloud)
                          │
                  tool calls via ──────────┐
                  cloudflared tunnel       │
                                           │
[Controller :7391]  <──────────────────────┘
     │ (FastAPI, runs locally)
     │
     └──> [BrowserBase cloud browser]
                  │
          [WebArena sites / Live internet]

The controller runs locally and is automatically exposed to the internet via a cloudflared quick tunnel. When the Subconscious cloud API needs to call browser tools (navigate, click, type, etc.), it reaches the local controller through this tunnel. No manual URL configuration needed.

Reasoning Traces

Full reasoning traces for all 130 tasks are available on Google Drive:

View Traces — Individual JSON files with complete Subconscious reasoning steps, tool calls, agent answers, and evaluation details.

Result Format

Results are saved to results/ as JSON files. Each file contains:

Run metadata — model, timestamps, pass rate, total/completed/error counts
Per-task results — task intent, agent answer, score, duration
Reasoning traces — full Subconscious reasoning steps for every task
Evaluation details — WebArena: deterministic string/URL/HTML match; BU Bench: Gemini judge verdict with reasoning

Benchmarks

WebArena

WebArena (Zhou et al., 2023) is a benchmark of realistic web tasks on self-hosted sandboxed websites. This repo includes a curated set of 100 tasks across 4 sites (Shopping, Shopping Admin, Reddit, GitLab) using only deterministic evaluators — no LLM-based evaluation.

Evaluation methods:

string_match — compare agent's final answer to reference strings
url_match — check the browser's final URL
program_html — navigate to target URLs and check DOM content via Playwright

BU Bench

BU Bench (Browser-Use Benchmark V1) is a benchmark of web tasks on the live internet. This repo includes 30 tasks from two categories:

InteractionTests (19 tasks) — form filling, DOM manipulation, drag interactions, multi-step challenges
WebBenchREAD (11 tasks) — reading and extracting information from live websites

Evaluation: Gemini 2.5 Flash LLM judge receives the task intent, agent's final answer, reasoning trace, and a screenshot of the final browser state.

External Dependencies

Service	Purpose	Required
Subconscious API	AI agent reasoning engine	Yes
BrowserBase	Cloud Chromium browsers with proxies + CAPTCHA solving	Yes
Google Gemini API	LLM judge for BU Bench scoring	BU Bench only
WebArena Docker images	Self-hosted sandboxed web apps	WebArena only
cloudflared	Tunnel for local controller	Install manually: `brew install cloudflared`

Credits

Built by the Subconscious team:

Wei Fang
Hongyin Luo
Owen Stepan

Acknowledgments

This benchmark builds on the work of:

WebArena — Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024.
Browser-Use — The BU Bench V1 task suite for evaluating browser automation agents on live internet tasks.
BrowserBase — Cloud browser infrastructure with stealth mode, residential proxies, and CAPTCHA solving.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src/subconscious_bench		src/subconscious_bench
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

subconscious-browser-bench

Results

WebArena — 69% pass rate

BU Bench — 70% pass rate

Combined

Prerequisites

Setup

Usage

Controller Options

Run Options

Architecture

Reasoning Traces

Result Format

Benchmarks

WebArena

BU Bench

External Dependencies

Credits

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

subconscious-browser-bench

Results

WebArena — 69% pass rate

BU Bench — 70% pass rate

Combined

Prerequisites

Setup

Usage

Controller Options

Run Options

Architecture

Reasoning Traces

Result Format

Benchmarks

WebArena

BU Bench

External Dependencies

Credits

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages