Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ The common unit is not "coding task" or "office task". The common unit is:
task family + fixtures + allowed tools + expected artifact/state + scorer + run comparison
```

The public v0/v0.5 implementation includes a small starter suite and five hardened task-family patterns. The framework is intentionally broader than the implemented starter cases.
The public v0/v0.7 implementation includes a small starter suite, five hardened task-family patterns, lifecycle gates, and a public-safe research radar. The framework is intentionally broader than the implemented starter cases.

## Relationship to consumer applications

Expand Down Expand Up @@ -71,6 +71,20 @@ make hardening-check

See [Benchmark lifecycle](docs/16-benchmark-lifecycle.md), [Mutation and exploit gates](docs/17-mutation-and-exploit-gates.md), [Suite strategy](docs/18-suite-strategy.md), and [Report schema v1 guidance](docs/19-report-schema-v1.md).

## Research Radar

Research Radar keeps Agent Bench Lab aligned with external benchmark and eval methodology without turning the repo into a news feed.

It tracks benchmark mechanics: oracles, hidden splits, replay, trace policy, scoring contracts, exploitability, contamination, standards, and eval-framework changes.

```text
research/
```

Public `research/` files contain watchlists, source maps, queries, and daily/weekly templates only. Raw feeds, private notes, customer observations, private holdouts, and protected scorer details stay out of the public repo.

See [Research Radar](docs/20-research-radar.md) and [research/README.md](research/README.md).

## Current status

This repository is a **v0 public starter**. It contains:
Expand All @@ -81,7 +95,7 @@ This repository is a **v0 public starter**. It contains:
- minimal Python CLI scaffolding;
- sample public fixtures;
- sample scorers plus hardened IF-01, DATA-01, DOC-01, SUP-01, and API-01 artifact/state-based scorers;
- documentation for benchmark design, metrics, anti-overfitting, lifecycle status, and hardening gates.
- documentation for benchmark design, metrics, anti-overfitting, lifecycle status, hardening gates, and research radar process.

It intentionally does **not** contain private holdout tasks, production secrets, personal data, or benchmark answers for real evaluation runs.

Expand Down
102 changes: 102 additions & 0 deletions docs/20-research-radar.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Research Radar

Agent Bench Lab needs a research radar because benchmark methodology changes quickly. Static
research snapshots are useful, but they are not enough to keep a benchmark standard current.

The radar is a public-safe process for tracking benchmark mechanics, not a generic AI-news feed.

## What To Monitor

Monitor sources that can change Agent Bench Lab design:

- verified splits and benchmark repair;
- deterministic and audited scoring;
- state-diff and trace-policy oracles;
- replay and snapshot environments;
- private holdout and redacted-feedback methods;
- benchmark contamination and exploitability;
- prompt-injection and tool-output trust-boundary evals;
- cost, latency, pass^k, and repeatability reporting;
- eval-framework and standards updates.

## Cadence

| Loop | Timebox | Output |
|---|---:|---|
| Daily radar | 15 minutes | short triage brief |
| Weekly synthesis | 45 minutes | roadmap decision or explicit no-change |
| Monthly pruning | 30 minutes | watchlist cleanup |

Daily radar should answer: did anything important change?

Weekly synthesis should answer: should Agent Bench Lab change roadmap, open issues, or run
follow-up research?

## Public And Private Boundary

Public-safe:

- watchlists;
- public source maps;
- query sets;
- templates;
- curated public summaries;
- decision logs.

Do not commit:

- raw feeds;
- dedupe caches;
- private eval material;
- hidden holdouts;
- answer keys;
- customer data;
- private rubrics;
- protected scorer configs;
- personal notes;
- consumer-application observations.

## Action Categories

Every item should end in one of:

- `ignore`
- `read later`
- `add to watchlist`
- `open issue`
- `update roadmap`
- `run follow-up research`
- `prototype after review`

Most items should be ignored or queued. The radar prevents stale decisions; it should not create
constant churn.

## When To Open An Issue

Open an issue when a source introduces:

- a benchmark-hardening method Agent Bench Lab should adopt;
- a new scorer/oracle pattern;
- a verified split or benchmark repair lesson;
- a private/public split pattern;
- a replay/snapshot method;
- a concrete exploit or contamination risk;
- a report schema or lifecycle convention worth standardizing.

Do not open issues for leaderboard movement or generic model news unless the evaluation method
changes.

## Files

Research Radar files live under:

```text
research/
```

Start with:

- `research/watchlist.md`
- `research/source-map.csv`
- `research/daily-brief-template.md`
- `research/weekly-synthesis-template.md`
7 changes: 4 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Start here:
21. [Mutation and exploit gates](17-mutation-and-exploit-gates.md)
22. [Suite strategy](18-suite-strategy.md)
23. [Report schema v1 guidance](19-report-schema-v1.md)
24. [v0 roadmap](roadmap-v0.md)
25. [Public release checklist](public-release-checklist.md)
26. [Decision log template](decision-log-template.md)
24. [Research Radar](20-research-radar.md)
25. [v0 roadmap](roadmap-v0.md)
26. [Public release checklist](public-release-checklist.md)
27. [Decision log template](decision-log-template.md)
64 changes: 64 additions & 0 deletions research/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Research Radar

Research Radar is the public-safe benchmark intelligence layer for Agent Bench Lab.

It is not a generic AI-news feed. It tracks external work that can change benchmark design:

- scoring and oracle patterns;
- private/public split methods;
- replay and snapshot environments;
- trace policy and tool-use checks;
- benchmark hardening, exploitability, and contamination;
- cost, latency, pass^k, and repeatability methods;
- standards and eval-framework updates.

The goal is to turn external benchmark/eval signals into roadmap decisions without chasing hype.

## Cadence

| Loop | Timebox | Purpose |
|---|---:|---|
| Daily radar | 15 minutes | Collect and triage high-signal source changes |
| Weekly synthesis | 45 minutes | Decide whether the roadmap changes |
| Monthly pruning | 30 minutes | Remove noisy sources and add better ones |

Daily briefs are triage artifacts. Weekly synthesis is where roadmap decisions happen.

## Public-Safe Rule

Public `research/` files may contain:

- public watchlists;
- public source maps;
- public search queries;
- daily and weekly templates;
- curated public benchmark notes;
- public weekly summaries.

Do not commit:

- raw feeds or crawler caches;
- private eval material;
- private holdouts;
- hidden answer keys;
- protected scorer configs;
- customer data;
- personal notes;
- private roadmap doubts;
- consumer-application observations.

Private or noisy working notes belong outside the public repo.

## Action Categories

Every radar item should end in one action:

- `ignore`
- `read later`
- `add to watchlist`
- `open issue`
- `update roadmap`
- `run follow-up research`
- `prototype after review`

Most days should not change the roadmap.
61 changes: 61 additions & 0 deletions research/daily-brief-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Agent Bench Lab Daily Research Brief

Date:

Window:

## Executive Summary

-
-
-

## New Or Updated Papers

| Item | Source | Why it matters | Action |
|---|---|---|---|
| | | | ignore / read later / add to watchlist / open issue / update roadmap / run follow-up research / prototype after review |

## New Or Updated Repos

| Item | Source | Change | Action |
|---|---|---|---|
| | | | |

## Methodology Changes

- Scoring:
- Replay:
- Private split:
- Canary:
- Exploit:
- Trace policy:
- Reporting:

## Risks

- Contamination:
- Eval awareness:
- Leakage:
- Live-service drift:
- LLM judge overuse:

## Recommended Actions

| Action | Target | Reason | Owner |
|---|---|---|---|
| ignore / read later / add to watchlist / open issue / update roadmap / run follow-up research / prototype after review | | | |

## Watchlist Updates

- Add:
- Remove:
- Change cadence:

## Bottom Line

Should Agent Bench Lab change direction today?

Answer:

Reason:
19 changes: 19 additions & 0 deletions research/people.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# People And Teams

Research Radar should not over-index on influencers. Monitor people and teams because they publish
benchmarks, eval frameworks, harnesses, or hardening work.

| Name or team | Affiliation | Why watch | Projects | Cadence | Tags |
|---|---|---|---|---|---|
| SWE-bench maintainers | Princeton / broader open-source ecosystem | Verified splits and coding-agent benchmark repair | SWE-bench, SWE-bench Verified | weekly | code, verified |
| ServiceNow Frontier AI / BrowserGym team | ServiceNow Research | Browser and workplace-agent benchmark harnesses | BrowserGym, AgentLab, WorkArena | weekly | browser, workflow |
| WebArena maintainers | Academic open benchmark ecosystem | Browser workflow environments and verification updates | WebArena, WebArena-Verified | weekly | browser, verified |
| OSWorld / XLANG team | XLANG / academic collaborators | Desktop-agent environments and verified task repair | OSWorld, OSWorld-Verified | weekly | os, desktop |
| AppWorld authors | Stony Brook / AI2-linked research ecosystem | API/app simulation and state-diff benchmark design | AppWorld | weekly | api, state-diff |
| AgentDojo / ETH security authors | ETH Zurich security research | Prompt injection, tool-output trust, agent-safety evals | AgentDojo | weekly | security, injection |
| MCP benchmark authors | Public MCP/tool-use benchmark ecosystem | Tool registry, protocol, and tool-routing eval patterns | MCP-Bench, MCP-Atlas, MCP-Universe searches | weekly | mcp, tools |
| Deep research benchmark authors | Public research-agent benchmark ecosystem | Citation, source grounding, claim checking, long-horizon research | DeepResearch Bench, LiveDRBench, DRBench, PaperBench searches | weekly | research, citations |
| UK AISI Inspect AI team | UK AI Security Institute | Eval harness, scorer, and audit patterns | Inspect AI | weekly | framework, safety |
| NIST AI RMF and eval standards teams | NIST | Public risk/evaluation governance signals | AI RMF, standards updates | monthly | standards, governance |

Add individuals only when a named author repeatedly produces high-signal benchmark mechanics work.
16 changes: 16 additions & 0 deletions research/queries/arxiv.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"agent benchmark"
"tool-use benchmark"
"MCP benchmark"
"browser agent benchmark"
"computer-use agent"
"deep research benchmark"
"research agent benchmark"
"prompt injection benchmark" AND agent
"benchmark contamination" AND agent
"benchmark exploitation" AND agent
"reward hacking" AND ("agent" OR "tool use")
"private holdout" AND evaluation
"state-diff scoring" AND agent
"trace-policy scoring" AND agent
"LLM judge" AND benchmark AND reliability
"verified benchmark" AND agent
14 changes: 14 additions & 0 deletions research/queries/github.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
agent benchmark language:Python
tool-use benchmark agent
MCP benchmark OR MCP-Bench OR MCP-Universe OR MCP-Atlas
browser agent benchmark OR computer-use agent
deep research benchmark OR research agent benchmark
prompt injection benchmark agent
agent safety benchmark tool use
benchmark contamination agent
benchmark exploit reward hacking
private holdout evaluation benchmark
state diff scorer agent benchmark
trace policy scorer tool calls
verified benchmark agent tasks
replay environment agent benchmark
14 changes: 14 additions & 0 deletions research/queries/scholar.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
"agent benchmark" "tool use"
"browser agent benchmark"
"computer-use agent benchmark"
"deep research benchmark"
"research agent benchmark" citations
"prompt injection benchmark" "tool use"
"benchmark contamination" "large language models"
"benchmark exploitation" "large language models"
"reward hacking" "agent"
"private holdout" "evaluation"
"state-based evaluation" "agent"
"trace policy" "tool use"
"verified benchmark" "agent"
"LLM-as-a-judge" reliability benchmark
8 changes: 8 additions & 0 deletions research/reading-queue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Reading Queue

Public-safe reading queue only. Do not add private notes, customer observations, hidden holdout
details, private scorer configs, or raw feeds.

| Priority | Title | Source | Type | Why read | Status | Owner | Next action |
|---|---|---|---|---|---|---|---|
| | | | paper / repo / standard / issue / release | | queued / reading / summarized / ignored | | |
25 changes: 25 additions & 0 deletions research/repos.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Repos And Projects

Watch modes:

- `releases`: stable updates only;
- `issues`: validation, exploit, and benchmark-repair discussion;
- `commits`: task/scorer methodology changes;
- `all activity`: high-signal core source.

| Project or repo | Why watch | Watch mode | Priority | Tags |
|---|---|---|---|---|
| SWE-bench / SWE-bench Verified | Coding benchmark repair, verified splits, hidden-test methodology | releases, issues | P0 | code, verified |
| Terminal-Bench | Terminal workflow benchmark and executable task design | releases, issues | P0 | terminal, execution |
| ServiceNow/BrowserGym | Browser-agent harness and replay patterns | releases, issues | P0 | browser, harness |
| ServiceNow/WorkArena | Workplace-style browser workflow tasks | releases, issues | P0 | browser, workflow |
| WebArena / WebArena-Verified | Browser task validation and environment issues | releases, issues | P0 | browser, verified |
| OSWorld / OSWorld-Verified | Desktop tasks, environment reproducibility, verified repairs | releases, issues | P0 | os, desktop |
| AppWorld | Local app/API simulation and state-diff evaluation | releases, issues | P0 | api, state-diff |
| AgentDojo | Prompt-injection and tool-output trust boundary tasks | releases, issues | P0 | security, injection |
| Inspect AI | Eval framework, scorer patterns, reporting mechanics | releases | P0 | framework, scoring |
| MCP benchmark repos | MCP/tool registry benchmark patterns | releases, issues | P0 | mcp, tools |
| Deep research benchmark repos | Source-grounded research, citation, and claim scoring | releases, issues | P0 | research, citations |
| Benchmark hardening paper repos | Exploit, contamination, and reward-hacking test harnesses | releases, issues | P1 | hardening, exploit |

Do not mirror repos or cache raw feeds in the public tree.
16 changes: 16 additions & 0 deletions research/source-map.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
priority,source_type,name,why_monitor,channel_or_url,cadence,signal_quality,relevance,tags
P0,repo/site,SWE-bench ecosystem,"Verified splits, coding task curation, hidden-test methodology",https://www.swebench.com/,daily,high,high,"code;verified;hidden-tests"
P0,repo/site,Terminal-Bench,"Executable terminal tasks and shell workflow scoring",https://www.tbench.ai/,daily,high,high,"terminal;execution"
P0,repo,ServiceNow BrowserGym,"Browser-agent harness and replay-oriented task design",https://github.com/ServiceNow/BrowserGym,daily,high,high,"browser;harness;replay"
P0,repo,ServiceNow WorkArena,"Workplace/browser workflow benchmark patterns",https://github.com/ServiceNow/WorkArena,daily,high,high,"browser;workflow"
P0,site,WebArena / WebArena-Verified,"Browser workflow tasks, verification updates, environment lessons",https://webarena.dev/,daily,high,high,"browser;verified"
P0,repo/site,OSWorld / OSWorld-Verified,"OS and desktop-agent tasks, replay and verification patterns",https://os-world.github.io/,daily,high,high,"os;desktop;verified"
P0,repo,AppWorld,"API/app simulation and state-diff scoring patterns",https://github.com/stonybrooknlp/appworld,daily,high,high,"api;state-diff;tools"
P0,search cluster,MCP benchmark cluster,"MCP and tool-registry benchmark methods","GitHub/arXiv search: MCP benchmark, MCP-Bench, MCP-Universe",daily,medium,high,"mcp;tool-use"
P0,search cluster,Deep Research benchmark cluster,"Fixed-corpus research, citation scoring, claim checks","GitHub/arXiv search: DeepResearch Bench, LiveDRBench, DRBench, PaperBench",daily,medium,high,"research;citations"
P0,repo,AgentDojo,"Prompt-injection and tool-output trust-boundary benchmark design",https://github.com/ethz-spylab/agentdojo,daily,high,high,"security;injection"
P0,standards,NIST AI Risk Management Framework,"Evaluation governance and risk framing","NIST AI RMF official page",weekly,high,medium,"standards;governance"
P0,framework,Inspect AI,"Public eval harness and scoring framework patterns",https://inspect.aisi.org.uk/,weekly,high,high,"framework;scoring"
P0,search cluster,Benchmark hardening papers,"Contamination, exploitability, reward hacking, eval awareness","arXiv/Semantic Scholar query set",daily,medium,high,"hardening;exploit;contamination"
P1,conference,Agent/eval/safety workshops,"New accepted papers and benchmark tracks","NeurIPS, ICML, ICLR, ACL workshop pages",weekly,medium,medium,"conference;workshop"
P1,framework,Evals and harness releases,"Report, trace, scoring, and lifecycle ideas","GitHub release watchlist",weekly,medium,medium,"framework;reports"
Loading
Loading