Skip to content

subconscious-systems/subconscious-browser-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

subconscious-browser-bench

Browser automation benchmarks for Subconscious AI agents.

Runs 100 WebArena tasks (sandboxed web apps) and 30 BU Bench tasks (live internet) against the Subconscious API, producing JSON results with full reasoning traces.


Results

All results use the tim-gpt-heavy engine with BrowserBase cloud browsers.

WebArena — 69% pass rate

100 tasks across 4 sandboxed web applications. Evaluation is fully deterministic (string match, URL match, DOM content checks — no LLM judge).

Site Tasks Passed Failed Pass Rate
GitLab 25 20 5 80.0% ████████░░
Reddit 25 20 5 80.0% ████████░░
Shopping Admin 25 17 8 68.0% ██████▊░░░
Shopping 25 12 13 48.0% ████▊░░░░░
Total 100 69 31 69.0% ██████▉░░░

BU Bench — 70% pass rate

30 tasks on the live internet. Evaluation uses Gemini 2.5 Flash as an LLM judge with screenshots.

Category Tasks Passed Failed Pass Rate
WebBenchREAD 11 9 2 81.8% ████████▏░
InteractionTests 19 12 7 63.2% ██████▍░░░
Total 30 21 9 70.0% ███████░░░

Note: BU Bench tasks run on live websites. Results may vary slightly between runs due to changing page content, network conditions, and anti-bot measures.

Combined

Benchmark Tasks Pass Rate Evaluation
WebArena 100 69.0% Deterministic (string/URL/HTML match)
BU Bench 30 70.0% Gemini LLM judge + screenshots
Overall 130 69.2% Mixed

Prerequisites

Setup

pip install -e .
playwright install chromium
cp .env.example .env   # fill in your keys

Usage

# Preflight check — validate all API keys, connectivity, and tunnel
subconscious-browser-bench check

# Start the browser controller (separate terminal)
# Automatically creates a cloudflared tunnel for the Subconscious API to call back
subconscious-browser-bench controller

# Run WebArena benchmark
subconscious-browser-bench run webarena                  # all 100 tasks
subconscious-browser-bench run webarena --tasks 0,1,2    # specific tasks
subconscious-browser-bench run webarena --range 0-25     # index range

# Run BU Bench benchmark
subconscious-browser-bench run bu-bench                  # all 30 tasks
subconscious-browser-bench run bu-bench --range 0-10     # index range

# View results
subconscious-browser-bench results                       # list result files
subconscious-browser-bench results <filename>            # show summary

Controller Options

Flag Description
--port Controller port (default: 7391)
--no-tunnel Disable automatic cloudflared tunnel

Run Options

Flag Description
--model Model name (default: tim-gpt-heavy)
--timeout Max seconds per task (default: 900 for WebArena, 1800 for BU Bench)
--append-to Append results to an existing JSON file

Architecture

[CLI] ── run tasks ──> [Runner]
                          │
                  POST /v1/runs/stream
                          │
                 [Subconscious API]
                   (cloud)
                          │
                  tool calls via ──────────┐
                  cloudflared tunnel       │
                                           │
[Controller :7391]  <──────────────────────┘
     │ (FastAPI, runs locally)
     │
     └──> [BrowserBase cloud browser]
                  │
          [WebArena sites / Live internet]

The controller runs locally and is automatically exposed to the internet via a cloudflared quick tunnel. When the Subconscious cloud API needs to call browser tools (navigate, click, type, etc.), it reaches the local controller through this tunnel. No manual URL configuration needed.

Reasoning Traces

Full reasoning traces for all 130 tasks are available on Google Drive:

View Traces — Individual JSON files with complete Subconscious reasoning steps, tool calls, agent answers, and evaluation details.

Result Format

Results are saved to results/ as JSON files. Each file contains:

  • Run metadata — model, timestamps, pass rate, total/completed/error counts
  • Per-task results — task intent, agent answer, score, duration
  • Reasoning traces — full Subconscious reasoning steps for every task
  • Evaluation details — WebArena: deterministic string/URL/HTML match; BU Bench: Gemini judge verdict with reasoning

Benchmarks

WebArena

WebArena (Zhou et al., 2023) is a benchmark of realistic web tasks on self-hosted sandboxed websites. This repo includes a curated set of 100 tasks across 4 sites (Shopping, Shopping Admin, Reddit, GitLab) using only deterministic evaluators — no LLM-based evaluation.

Evaluation methods:

  • string_match — compare agent's final answer to reference strings
  • url_match — check the browser's final URL
  • program_html — navigate to target URLs and check DOM content via Playwright

BU Bench

BU Bench (Browser-Use Benchmark V1) is a benchmark of web tasks on the live internet. This repo includes 30 tasks from two categories:

  • InteractionTests (19 tasks) — form filling, DOM manipulation, drag interactions, multi-step challenges
  • WebBenchREAD (11 tasks) — reading and extracting information from live websites

Evaluation: Gemini 2.5 Flash LLM judge receives the task intent, agent's final answer, reasoning trace, and a screenshot of the final browser state.

External Dependencies

Service Purpose Required
Subconscious API AI agent reasoning engine Yes
BrowserBase Cloud Chromium browsers with proxies + CAPTCHA solving Yes
Google Gemini API LLM judge for BU Bench scoring BU Bench only
WebArena Docker images Self-hosted sandboxed web apps WebArena only
cloudflared Tunnel for local controller Install manually: brew install cloudflared

Credits

Built by the Subconscious team:

  • Wei Fang
  • Hongyin Luo
  • Owen Stepan

Acknowledgments

This benchmark builds on the work of:

  • WebArena — Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024.
  • Browser-Use — The BU Bench V1 task suite for evaluating browser automation agents on live internet tasks.
  • BrowserBase — Cloud browser infrastructure with stealth mode, residential proxies, and CAPTCHA solving.

License

MIT

About

Browser automation benchmarks for evaluating AI agents on web tasks. 130 tasks across WebArena (sandboxed sites) and BU Bench (live internet), with deterministic and LLM-based evaluation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages