diff --git a/README.md b/README.md index e46cf79..6999e51 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ The common unit is not "coding task" or "office task". The common unit is: task family + fixtures + allowed tools + expected artifact/state + scorer + run comparison ``` -The public v0/v0.5 implementation includes a small starter suite and five hardened task-family patterns. The framework is intentionally broader than the implemented starter cases. +The public v0/v0.7 implementation includes a small starter suite, five hardened task-family patterns, lifecycle gates, and a public-safe research radar. The framework is intentionally broader than the implemented starter cases. ## Relationship to consumer applications @@ -71,6 +71,20 @@ make hardening-check See [Benchmark lifecycle](docs/16-benchmark-lifecycle.md), [Mutation and exploit gates](docs/17-mutation-and-exploit-gates.md), [Suite strategy](docs/18-suite-strategy.md), and [Report schema v1 guidance](docs/19-report-schema-v1.md). +## Research Radar + +Research Radar keeps Agent Bench Lab aligned with external benchmark and eval methodology without turning the repo into a news feed. + +It tracks benchmark mechanics: oracles, hidden splits, replay, trace policy, scoring contracts, exploitability, contamination, standards, and eval-framework changes. + +```text +research/ +``` + +Public `research/` files contain watchlists, source maps, queries, and daily/weekly templates only. Raw feeds, private notes, customer observations, private holdouts, and protected scorer details stay out of the public repo. + +See [Research Radar](docs/20-research-radar.md) and [research/README.md](research/README.md). + ## Current status This repository is a **v0 public starter**. It contains: @@ -81,7 +95,7 @@ This repository is a **v0 public starter**. It contains: - minimal Python CLI scaffolding; - sample public fixtures; - sample scorers plus hardened IF-01, DATA-01, DOC-01, SUP-01, and API-01 artifact/state-based scorers; -- documentation for benchmark design, metrics, anti-overfitting, lifecycle status, and hardening gates. +- documentation for benchmark design, metrics, anti-overfitting, lifecycle status, hardening gates, and research radar process. It intentionally does **not** contain private holdout tasks, production secrets, personal data, or benchmark answers for real evaluation runs. diff --git a/docs/20-research-radar.md b/docs/20-research-radar.md new file mode 100644 index 0000000..661a1ff --- /dev/null +++ b/docs/20-research-radar.md @@ -0,0 +1,102 @@ +# Research Radar + +Agent Bench Lab needs a research radar because benchmark methodology changes quickly. Static +research snapshots are useful, but they are not enough to keep a benchmark standard current. + +The radar is a public-safe process for tracking benchmark mechanics, not a generic AI-news feed. + +## What To Monitor + +Monitor sources that can change Agent Bench Lab design: + +- verified splits and benchmark repair; +- deterministic and audited scoring; +- state-diff and trace-policy oracles; +- replay and snapshot environments; +- private holdout and redacted-feedback methods; +- benchmark contamination and exploitability; +- prompt-injection and tool-output trust-boundary evals; +- cost, latency, pass^k, and repeatability reporting; +- eval-framework and standards updates. + +## Cadence + +| Loop | Timebox | Output | +|---|---:|---| +| Daily radar | 15 minutes | short triage brief | +| Weekly synthesis | 45 minutes | roadmap decision or explicit no-change | +| Monthly pruning | 30 minutes | watchlist cleanup | + +Daily radar should answer: did anything important change? + +Weekly synthesis should answer: should Agent Bench Lab change roadmap, open issues, or run +follow-up research? + +## Public And Private Boundary + +Public-safe: + +- watchlists; +- public source maps; +- query sets; +- templates; +- curated public summaries; +- decision logs. + +Do not commit: + +- raw feeds; +- dedupe caches; +- private eval material; +- hidden holdouts; +- answer keys; +- customer data; +- private rubrics; +- protected scorer configs; +- personal notes; +- consumer-application observations. + +## Action Categories + +Every item should end in one of: + +- `ignore` +- `read later` +- `add to watchlist` +- `open issue` +- `update roadmap` +- `run follow-up research` +- `prototype after review` + +Most items should be ignored or queued. The radar prevents stale decisions; it should not create +constant churn. + +## When To Open An Issue + +Open an issue when a source introduces: + +- a benchmark-hardening method Agent Bench Lab should adopt; +- a new scorer/oracle pattern; +- a verified split or benchmark repair lesson; +- a private/public split pattern; +- a replay/snapshot method; +- a concrete exploit or contamination risk; +- a report schema or lifecycle convention worth standardizing. + +Do not open issues for leaderboard movement or generic model news unless the evaluation method +changes. + +## Files + +Research Radar files live under: + +```text +research/ +``` + +Start with: + +- `research/watchlist.md` +- `research/source-map.csv` +- `research/daily-brief-template.md` +- `research/weekly-synthesis-template.md` diff --git a/docs/README.md b/docs/README.md index d8c06cb..473b8de 100644 --- a/docs/README.md +++ b/docs/README.md @@ -25,6 +25,7 @@ Start here: 21. [Mutation and exploit gates](17-mutation-and-exploit-gates.md) 22. [Suite strategy](18-suite-strategy.md) 23. [Report schema v1 guidance](19-report-schema-v1.md) -24. [v0 roadmap](roadmap-v0.md) -25. [Public release checklist](public-release-checklist.md) -26. [Decision log template](decision-log-template.md) +24. [Research Radar](20-research-radar.md) +25. [v0 roadmap](roadmap-v0.md) +26. [Public release checklist](public-release-checklist.md) +27. [Decision log template](decision-log-template.md) diff --git a/research/README.md b/research/README.md new file mode 100644 index 0000000..5fd2ae1 --- /dev/null +++ b/research/README.md @@ -0,0 +1,64 @@ +# Research Radar + +Research Radar is the public-safe benchmark intelligence layer for Agent Bench Lab. + +It is not a generic AI-news feed. It tracks external work that can change benchmark design: + +- scoring and oracle patterns; +- private/public split methods; +- replay and snapshot environments; +- trace policy and tool-use checks; +- benchmark hardening, exploitability, and contamination; +- cost, latency, pass^k, and repeatability methods; +- standards and eval-framework updates. + +The goal is to turn external benchmark/eval signals into roadmap decisions without chasing hype. + +## Cadence + +| Loop | Timebox | Purpose | +|---|---:|---| +| Daily radar | 15 minutes | Collect and triage high-signal source changes | +| Weekly synthesis | 45 minutes | Decide whether the roadmap changes | +| Monthly pruning | 30 minutes | Remove noisy sources and add better ones | + +Daily briefs are triage artifacts. Weekly synthesis is where roadmap decisions happen. + +## Public-Safe Rule + +Public `research/` files may contain: + +- public watchlists; +- public source maps; +- public search queries; +- daily and weekly templates; +- curated public benchmark notes; +- public weekly summaries. + +Do not commit: + +- raw feeds or crawler caches; +- private eval material; +- private holdouts; +- hidden answer keys; +- protected scorer configs; +- customer data; +- personal notes; +- private roadmap doubts; +- consumer-application observations. + +Private or noisy working notes belong outside the public repo. + +## Action Categories + +Every radar item should end in one action: + +- `ignore` +- `read later` +- `add to watchlist` +- `open issue` +- `update roadmap` +- `run follow-up research` +- `prototype after review` + +Most days should not change the roadmap. diff --git a/research/daily-brief-template.md b/research/daily-brief-template.md new file mode 100644 index 0000000..0500e15 --- /dev/null +++ b/research/daily-brief-template.md @@ -0,0 +1,61 @@ +# Agent Bench Lab Daily Research Brief + +Date: + +Window: + +## Executive Summary + +- +- +- + +## New Or Updated Papers + +| Item | Source | Why it matters | Action | +|---|---|---|---| +| | | | ignore / read later / add to watchlist / open issue / update roadmap / run follow-up research / prototype after review | + +## New Or Updated Repos + +| Item | Source | Change | Action | +|---|---|---|---| +| | | | | + +## Methodology Changes + +- Scoring: +- Replay: +- Private split: +- Canary: +- Exploit: +- Trace policy: +- Reporting: + +## Risks + +- Contamination: +- Eval awareness: +- Leakage: +- Live-service drift: +- LLM judge overuse: + +## Recommended Actions + +| Action | Target | Reason | Owner | +|---|---|---|---| +| ignore / read later / add to watchlist / open issue / update roadmap / run follow-up research / prototype after review | | | | + +## Watchlist Updates + +- Add: +- Remove: +- Change cadence: + +## Bottom Line + +Should Agent Bench Lab change direction today? + +Answer: + +Reason: diff --git a/research/people.md b/research/people.md new file mode 100644 index 0000000..3d62c7a --- /dev/null +++ b/research/people.md @@ -0,0 +1,19 @@ +# People And Teams + +Research Radar should not over-index on influencers. Monitor people and teams because they publish +benchmarks, eval frameworks, harnesses, or hardening work. + +| Name or team | Affiliation | Why watch | Projects | Cadence | Tags | +|---|---|---|---|---|---| +| SWE-bench maintainers | Princeton / broader open-source ecosystem | Verified splits and coding-agent benchmark repair | SWE-bench, SWE-bench Verified | weekly | code, verified | +| ServiceNow Frontier AI / BrowserGym team | ServiceNow Research | Browser and workplace-agent benchmark harnesses | BrowserGym, AgentLab, WorkArena | weekly | browser, workflow | +| WebArena maintainers | Academic open benchmark ecosystem | Browser workflow environments and verification updates | WebArena, WebArena-Verified | weekly | browser, verified | +| OSWorld / XLANG team | XLANG / academic collaborators | Desktop-agent environments and verified task repair | OSWorld, OSWorld-Verified | weekly | os, desktop | +| AppWorld authors | Stony Brook / AI2-linked research ecosystem | API/app simulation and state-diff benchmark design | AppWorld | weekly | api, state-diff | +| AgentDojo / ETH security authors | ETH Zurich security research | Prompt injection, tool-output trust, agent-safety evals | AgentDojo | weekly | security, injection | +| MCP benchmark authors | Public MCP/tool-use benchmark ecosystem | Tool registry, protocol, and tool-routing eval patterns | MCP-Bench, MCP-Atlas, MCP-Universe searches | weekly | mcp, tools | +| Deep research benchmark authors | Public research-agent benchmark ecosystem | Citation, source grounding, claim checking, long-horizon research | DeepResearch Bench, LiveDRBench, DRBench, PaperBench searches | weekly | research, citations | +| UK AISI Inspect AI team | UK AI Security Institute | Eval harness, scorer, and audit patterns | Inspect AI | weekly | framework, safety | +| NIST AI RMF and eval standards teams | NIST | Public risk/evaluation governance signals | AI RMF, standards updates | monthly | standards, governance | + +Add individuals only when a named author repeatedly produces high-signal benchmark mechanics work. diff --git a/research/queries/arxiv.txt b/research/queries/arxiv.txt new file mode 100644 index 0000000..ffdea0b --- /dev/null +++ b/research/queries/arxiv.txt @@ -0,0 +1,16 @@ +"agent benchmark" +"tool-use benchmark" +"MCP benchmark" +"browser agent benchmark" +"computer-use agent" +"deep research benchmark" +"research agent benchmark" +"prompt injection benchmark" AND agent +"benchmark contamination" AND agent +"benchmark exploitation" AND agent +"reward hacking" AND ("agent" OR "tool use") +"private holdout" AND evaluation +"state-diff scoring" AND agent +"trace-policy scoring" AND agent +"LLM judge" AND benchmark AND reliability +"verified benchmark" AND agent diff --git a/research/queries/github.txt b/research/queries/github.txt new file mode 100644 index 0000000..42b7460 --- /dev/null +++ b/research/queries/github.txt @@ -0,0 +1,14 @@ +agent benchmark language:Python +tool-use benchmark agent +MCP benchmark OR MCP-Bench OR MCP-Universe OR MCP-Atlas +browser agent benchmark OR computer-use agent +deep research benchmark OR research agent benchmark +prompt injection benchmark agent +agent safety benchmark tool use +benchmark contamination agent +benchmark exploit reward hacking +private holdout evaluation benchmark +state diff scorer agent benchmark +trace policy scorer tool calls +verified benchmark agent tasks +replay environment agent benchmark diff --git a/research/queries/scholar.txt b/research/queries/scholar.txt new file mode 100644 index 0000000..6e26756 --- /dev/null +++ b/research/queries/scholar.txt @@ -0,0 +1,14 @@ +"agent benchmark" "tool use" +"browser agent benchmark" +"computer-use agent benchmark" +"deep research benchmark" +"research agent benchmark" citations +"prompt injection benchmark" "tool use" +"benchmark contamination" "large language models" +"benchmark exploitation" "large language models" +"reward hacking" "agent" +"private holdout" "evaluation" +"state-based evaluation" "agent" +"trace policy" "tool use" +"verified benchmark" "agent" +"LLM-as-a-judge" reliability benchmark diff --git a/research/reading-queue.md b/research/reading-queue.md new file mode 100644 index 0000000..2ebef21 --- /dev/null +++ b/research/reading-queue.md @@ -0,0 +1,8 @@ +# Reading Queue + +Public-safe reading queue only. Do not add private notes, customer observations, hidden holdout +details, private scorer configs, or raw feeds. + +| Priority | Title | Source | Type | Why read | Status | Owner | Next action | +|---|---|---|---|---|---|---|---| +| | | | paper / repo / standard / issue / release | | queued / reading / summarized / ignored | | | diff --git a/research/repos.md b/research/repos.md new file mode 100644 index 0000000..0b88e02 --- /dev/null +++ b/research/repos.md @@ -0,0 +1,25 @@ +# Repos And Projects + +Watch modes: + +- `releases`: stable updates only; +- `issues`: validation, exploit, and benchmark-repair discussion; +- `commits`: task/scorer methodology changes; +- `all activity`: high-signal core source. + +| Project or repo | Why watch | Watch mode | Priority | Tags | +|---|---|---|---|---| +| SWE-bench / SWE-bench Verified | Coding benchmark repair, verified splits, hidden-test methodology | releases, issues | P0 | code, verified | +| Terminal-Bench | Terminal workflow benchmark and executable task design | releases, issues | P0 | terminal, execution | +| ServiceNow/BrowserGym | Browser-agent harness and replay patterns | releases, issues | P0 | browser, harness | +| ServiceNow/WorkArena | Workplace-style browser workflow tasks | releases, issues | P0 | browser, workflow | +| WebArena / WebArena-Verified | Browser task validation and environment issues | releases, issues | P0 | browser, verified | +| OSWorld / OSWorld-Verified | Desktop tasks, environment reproducibility, verified repairs | releases, issues | P0 | os, desktop | +| AppWorld | Local app/API simulation and state-diff evaluation | releases, issues | P0 | api, state-diff | +| AgentDojo | Prompt-injection and tool-output trust boundary tasks | releases, issues | P0 | security, injection | +| Inspect AI | Eval framework, scorer patterns, reporting mechanics | releases | P0 | framework, scoring | +| MCP benchmark repos | MCP/tool registry benchmark patterns | releases, issues | P0 | mcp, tools | +| Deep research benchmark repos | Source-grounded research, citation, and claim scoring | releases, issues | P0 | research, citations | +| Benchmark hardening paper repos | Exploit, contamination, and reward-hacking test harnesses | releases, issues | P1 | hardening, exploit | + +Do not mirror repos or cache raw feeds in the public tree. diff --git a/research/source-map.csv b/research/source-map.csv new file mode 100644 index 0000000..f32ba50 --- /dev/null +++ b/research/source-map.csv @@ -0,0 +1,16 @@ +priority,source_type,name,why_monitor,channel_or_url,cadence,signal_quality,relevance,tags +P0,repo/site,SWE-bench ecosystem,"Verified splits, coding task curation, hidden-test methodology",https://www.swebench.com/,daily,high,high,"code;verified;hidden-tests" +P0,repo/site,Terminal-Bench,"Executable terminal tasks and shell workflow scoring",https://www.tbench.ai/,daily,high,high,"terminal;execution" +P0,repo,ServiceNow BrowserGym,"Browser-agent harness and replay-oriented task design",https://github.com/ServiceNow/BrowserGym,daily,high,high,"browser;harness;replay" +P0,repo,ServiceNow WorkArena,"Workplace/browser workflow benchmark patterns",https://github.com/ServiceNow/WorkArena,daily,high,high,"browser;workflow" +P0,site,WebArena / WebArena-Verified,"Browser workflow tasks, verification updates, environment lessons",https://webarena.dev/,daily,high,high,"browser;verified" +P0,repo/site,OSWorld / OSWorld-Verified,"OS and desktop-agent tasks, replay and verification patterns",https://os-world.github.io/,daily,high,high,"os;desktop;verified" +P0,repo,AppWorld,"API/app simulation and state-diff scoring patterns",https://github.com/stonybrooknlp/appworld,daily,high,high,"api;state-diff;tools" +P0,search cluster,MCP benchmark cluster,"MCP and tool-registry benchmark methods","GitHub/arXiv search: MCP benchmark, MCP-Bench, MCP-Universe",daily,medium,high,"mcp;tool-use" +P0,search cluster,Deep Research benchmark cluster,"Fixed-corpus research, citation scoring, claim checks","GitHub/arXiv search: DeepResearch Bench, LiveDRBench, DRBench, PaperBench",daily,medium,high,"research;citations" +P0,repo,AgentDojo,"Prompt-injection and tool-output trust-boundary benchmark design",https://github.com/ethz-spylab/agentdojo,daily,high,high,"security;injection" +P0,standards,NIST AI Risk Management Framework,"Evaluation governance and risk framing","NIST AI RMF official page",weekly,high,medium,"standards;governance" +P0,framework,Inspect AI,"Public eval harness and scoring framework patterns",https://inspect.aisi.org.uk/,weekly,high,high,"framework;scoring" +P0,search cluster,Benchmark hardening papers,"Contamination, exploitability, reward hacking, eval awareness","arXiv/Semantic Scholar query set",daily,medium,high,"hardening;exploit;contamination" +P1,conference,Agent/eval/safety workshops,"New accepted papers and benchmark tracks","NeurIPS, ICML, ICLR, ACL workshop pages",weekly,medium,medium,"conference;workshop" +P1,framework,Evals and harness releases,"Report, trace, scoring, and lifecycle ideas","GitHub release watchlist",weekly,medium,medium,"framework;reports" diff --git a/research/watchlist.md b/research/watchlist.md new file mode 100644 index 0000000..8839be8 --- /dev/null +++ b/research/watchlist.md @@ -0,0 +1,54 @@ +# Research Radar Watchlist + +Priorities: + +- `P0 daily`: high-signal benchmark mechanics sources. +- `P1 weekly`: useful secondary sources and teams. +- `P2 monthly`: standards, conference tracks, and lower-cadence sources. + +## P0 Daily + +| Source group | Why monitor | Tags | +|---|---|---| +| SWE-bench ecosystem | Verified splits, issue curation, coding-agent evaluation methodology | code, verified, hidden-tests | +| Terminal-Bench | Terminal workflow tasks, executable environments, shell-based scoring | terminal, execution, replay | +| BrowserGym / AgentLab / WorkArena / WebArena-Verified | Browser and workplace-agent benchmarks, replay and verified environment patterns | browser, workflow, verified | +| OSWorld / OSWorld-Verified | Desktop/OS agent tasks, environment replay, verification issues | os, desktop, replay | +| AppWorld | API/app simulation, state-diff scoring, tool-use planning | api, state-diff, tools | +| MCP benchmark cluster | MCP/tool registry benchmarks and local tool-use evaluation patterns | mcp, tool-use, registry | +| Deep Research benchmark cluster | Fixed-corpus research, citation checks, claim scoring, long-horizon research tasks | research, citations, claims | +| AgentDojo / security cluster | Prompt injection, tool-output trust boundary, agent safety evals | security, injection, leaks | +| NIST / Inspect AI / standards | Eval framework practices, auditability, safety and policy measurement | standards, harnesses | +| Benchmark-hardening and exploit papers | Contamination, reward hacking, eval awareness, public benchmark exploitation | hardening, exploit, contamination | + +## P1 Weekly + +| Source group | Why monitor | Tags | +|---|---|---| +| Official benchmark issue trackers | Failure reports, validation fixes, exploit reports | bugs, validation | +| Eval framework releases | Harness, scoring, report, and trace-format ideas | framework, reporting | +| Author pages for benchmark maintainers | New papers before repos are updated | papers, authors | +| Workshop pages for agents/evals/safety | New accepted papers and benchmark tracks | conference, workshop | +| Model-system eval reports | Useful methodology, even when benchmark code is unavailable | methodology, reports | + +## P2 Monthly + +| Source group | Why monitor | Tags | +|---|---|---| +| Conference proceedings | Broad benchmark trends and new task families | survey, papers | +| Standards organizations | Slow-moving guidance for eval governance and reporting | standards, governance | +| Public leaderboards | Detect benchmark saturation or exploit incentives | leaderboard, integrity | +| Archived benchmark repos | Deprecation lessons and reproducibility failures | lifecycle, deprecation | + +## Alert Rules + +Open an issue when a source introduces: + +- a new verified split; +- a new exploit or contamination finding; +- a new deterministic/state-based oracle pattern; +- a new replay/snapshot environment pattern; +- a new redacted-feedback or private-holdout method; +- a benchmark lifecycle or deprecation policy worth adapting. + +Do not open issues for generic model score changes unless the method changes. diff --git a/research/weekly-synthesis-template.md b/research/weekly-synthesis-template.md new file mode 100644 index 0000000..10531b2 --- /dev/null +++ b/research/weekly-synthesis-template.md @@ -0,0 +1,69 @@ +# Agent Bench Lab Weekly Research Synthesis + +Week: + +## What Actually Changed? + +- +- +- + +## Evidence By Area + +| Area | Evidence | Source | Confidence | +|---|---|---|---| +| Scoring/oracles | | | | +| Private/public split | | | | +| Replay/snapshot environments | | | | +| Trace policy/tool use | | | | +| Benchmark hardening/exploits | | | | +| Reporting/lifecycle | | | | + +## Roadmap Impact + +| Decision | Rationale | Action | +|---|---|---| +| Keep current roadmap / update roadmap / run follow-up research / prototype after review | | | + +## Issues To Open + +| Title | Why now | Scope | +|---|---|---| +| | | | + +## Issues To Close Or Deprioritize + +| Issue | Reason | +|---|---| +| | | + +## Experiments To Run + +| Experiment | Expected learning | Timebox | +|---|---|---| +| | | | + +## Source Quality Review + +| Source | Keep / downgrade / remove | Reason | +|---|---|---| +| | | | + +## Proposed Decision Log Entries + +```text +Date: +Project: Agent Bench Lab +Decision: +Reason: +What this prevents: +Review date: +``` + +## Next Week Focus + +Maximum three: + +1. +2. +3.