heurema · t3chn · May 25, 2026 · May 25, 2026
diff --git a/README.md b/README.md
@@ -33,7 +33,7 @@ The common unit is not "coding task" or "office task". The common unit is:
 task family + fixtures + allowed tools + expected artifact/state + scorer + run comparison
 ```
 
-The public v0/v0.5 implementation includes a small starter suite and five hardened task-family patterns. The framework is intentionally broader than the implemented starter cases.
+The public v0/v0.7 implementation includes a small starter suite, five hardened task-family patterns, lifecycle gates, and a public-safe research radar. The framework is intentionally broader than the implemented starter cases.
 
 ## Relationship to consumer applications
 
@@ -71,6 +71,20 @@ make hardening-check
 
 See [Benchmark lifecycle](docs/16-benchmark-lifecycle.md), [Mutation and exploit gates](docs/17-mutation-and-exploit-gates.md), [Suite strategy](docs/18-suite-strategy.md), and [Report schema v1 guidance](docs/19-report-schema-v1.md).
 
+## Research Radar
+
+Research Radar keeps Agent Bench Lab aligned with external benchmark and eval methodology without turning the repo into a news feed.
+
+It tracks benchmark mechanics: oracles, hidden splits, replay, trace policy, scoring contracts, exploitability, contamination, standards, and eval-framework changes.
+
+```text
+research/
+```
+
+Public `research/` files contain watchlists, source maps, queries, and daily/weekly templates only. Raw feeds, private notes, customer observations, private holdouts, and protected scorer details stay out of the public repo.
+
+See [Research Radar](docs/20-research-radar.md) and [research/README.md](research/README.md).
+
 ## Current status
 
 This repository is a **v0 public starter**. It contains:
@@ -81,7 +95,7 @@ This repository is a **v0 public starter**. It contains:
 - minimal Python CLI scaffolding;
 - sample public fixtures;
 - sample scorers plus hardened IF-01, DATA-01, DOC-01, SUP-01, and API-01 artifact/state-based scorers;
-- documentation for benchmark design, metrics, anti-overfitting, lifecycle status, and hardening gates.
+- documentation for benchmark design, metrics, anti-overfitting, lifecycle status, hardening gates, and research radar process.
 
 It intentionally does **not** contain private holdout tasks, production secrets, personal data, or benchmark answers for real evaluation runs.
 

diff --git a/docs/20-research-radar.md b/docs/20-research-radar.md
@@ -0,0 +1,102 @@
+# Research Radar
+
+Agent Bench Lab needs a research radar because benchmark methodology changes quickly. Static
+research snapshots are useful, but they are not enough to keep a benchmark standard current.
+
+The radar is a public-safe process for tracking benchmark mechanics, not a generic AI-news feed.
+
+## What To Monitor
+
+Monitor sources that can change Agent Bench Lab design:
+
+- verified splits and benchmark repair;
+- deterministic and audited scoring;
+- state-diff and trace-policy oracles;
+- replay and snapshot environments;
+- private holdout and redacted-feedback methods;
+- benchmark contamination and exploitability;
+- prompt-injection and tool-output trust-boundary evals;
+- cost, latency, pass^k, and repeatability reporting;
+- eval-framework and standards updates.
+
+## Cadence
+
+| Loop | Timebox | Output |
+|---|---:|---|
+| Daily radar | 15 minutes | short triage brief |
+| Weekly synthesis | 45 minutes | roadmap decision or explicit no-change |
+| Monthly pruning | 30 minutes | watchlist cleanup |
+
+Daily radar should answer: did anything important change?
+
+Weekly synthesis should answer: should Agent Bench Lab change roadmap, open issues, or run
+follow-up research?
+
+## Public And Private Boundary
+
+Public-safe:
+
+- watchlists;
+- public source maps;
+- query sets;
+- templates;
+- curated public summaries;
+- decision logs.
+
+Do not commit:
+
+- raw feeds;
+- dedupe caches;
+- private eval material;
+- hidden holdouts;
+- answer keys;
+- customer data;
+- private rubrics;
+- protected scorer configs;
+- personal notes;
+- consumer-application observations.
+
+## Action Categories
+
+Every item should end in one of:
+
+- `ignore`
+- `read later`
+- `add to watchlist`
+- `open issue`
+- `update roadmap`
+- `run follow-up research`
+- `prototype after review`
+
+Most items should be ignored or queued. The radar prevents stale decisions; it should not create
+constant churn.
+
+## When To Open An Issue
+
+Open an issue when a source introduces:
+
+- a benchmark-hardening method Agent Bench Lab should adopt;
+- a new scorer/oracle pattern;
+- a verified split or benchmark repair lesson;
+- a private/public split pattern;
+- a replay/snapshot method;
+- a concrete exploit or contamination risk;
+- a report schema or lifecycle convention worth standardizing.
+
+Do not open issues for leaderboard movement or generic model news unless the evaluation method
+changes.
+
+## Files
+
+Research Radar files live under:
+
+```text
+research/
+```
+
+Start with:
+
+- `research/watchlist.md`
+- `research/source-map.csv`
+- `research/daily-brief-template.md`
+- `research/weekly-synthesis-template.md`
diff --git a/docs/README.md b/docs/README.md
@@ -25,6 +25,7 @@ Start here:
 21. [Mutation and exploit gates](17-mutation-and-exploit-gates.md)
 22. [Suite strategy](18-suite-strategy.md)
 23. [Report schema v1 guidance](19-report-schema-v1.md)
-24. [v0 roadmap](roadmap-v0.md)
-25. [Public release checklist](public-release-checklist.md)
-26. [Decision log template](decision-log-template.md)
+24. [Research Radar](20-research-radar.md)
+25. [v0 roadmap](roadmap-v0.md)
+26. [Public release checklist](public-release-checklist.md)
+27. [Decision log template](decision-log-template.md)
diff --git a/research/README.md b/research/README.md
@@ -0,0 +1,64 @@
+# Research Radar
+
+Research Radar is the public-safe benchmark intelligence layer for Agent Bench Lab.
+
+It is not a generic AI-news feed. It tracks external work that can change benchmark design:
+
+- scoring and oracle patterns;
+- private/public split methods;
+- replay and snapshot environments;
+- trace policy and tool-use checks;
+- benchmark hardening, exploitability, and contamination;
+- cost, latency, pass^k, and repeatability methods;
+- standards and eval-framework updates.
+
+The goal is to turn external benchmark/eval signals into roadmap decisions without chasing hype.
+
+## Cadence
+
+| Loop | Timebox | Purpose |
+|---|---:|---|
+| Daily radar | 15 minutes | Collect and triage high-signal source changes |
+| Weekly synthesis | 45 minutes | Decide whether the roadmap changes |
+| Monthly pruning | 30 minutes | Remove noisy sources and add better ones |
+
+Daily briefs are triage artifacts. Weekly synthesis is where roadmap decisions happen.
+
+## Public-Safe Rule
+
+Public `research/` files may contain:
+
+- public watchlists;
+- public source maps;
+- public search queries;
+- daily and weekly templates;
+- curated public benchmark notes;
+- public weekly summaries.
+
+Do not commit:
+
+- raw feeds or crawler caches;
+- private eval material;
+- private holdouts;
+- hidden answer keys;
+- protected scorer configs;
+- customer data;
+- personal notes;
+- private roadmap doubts;
+- consumer-application observations.
+
+Private or noisy working notes belong outside the public repo.
+
+## Action Categories
+
+Every radar item should end in one action:
+
+- `ignore`
+- `read later`
+- `add to watchlist`
+- `open issue`
+- `update roadmap`
+- `run follow-up research`
+- `prototype after review`
+
+Most days should not change the roadmap.
diff --git a/research/daily-brief-template.md b/research/daily-brief-template.md
@@ -0,0 +1,61 @@
+# Agent Bench Lab Daily Research Brief
+
+Date:
+
+Window:
+
+## Executive Summary
+
+-
+-
+-
+
+## New Or Updated Papers
+
+| Item | Source | Why it matters | Action |
+|---|---|---|---|
+|  |  |  | ignore / read later / add to watchlist / open issue / update roadmap / run follow-up research / prototype after review |
+
+## New Or Updated Repos
+
+| Item | Source | Change | Action |
+|---|---|---|---|
+|  |  |  |  |
+
+## Methodology Changes
+
+- Scoring:
+- Replay:
+- Private split:
+- Canary:
+- Exploit:
+- Trace policy:
+- Reporting:
+
+## Risks
+
+- Contamination:
+- Eval awareness:
+- Leakage:
+- Live-service drift:
+- LLM judge overuse:
+
+## Recommended Actions
+
+| Action | Target | Reason | Owner |
+|---|---|---|---|
+| ignore / read later / add to watchlist / open issue / update roadmap / run follow-up research / prototype after review |  |  |  |
+
+## Watchlist Updates
+
+- Add:
+- Remove:
+- Change cadence:
+
+## Bottom Line
+
+Should Agent Bench Lab change direction today?
+
+Answer:
+
+Reason:
diff --git a/research/people.md b/research/people.md
@@ -0,0 +1,19 @@
+# People And Teams
+
+Research Radar should not over-index on influencers. Monitor people and teams because they publish
+benchmarks, eval frameworks, harnesses, or hardening work.
+
+| Name or team | Affiliation | Why watch | Projects | Cadence | Tags |
+|---|---|---|---|---|---|
+| SWE-bench maintainers | Princeton / broader open-source ecosystem | Verified splits and coding-agent benchmark repair | SWE-bench, SWE-bench Verified | weekly | code, verified |
+| ServiceNow Frontier AI / BrowserGym team | ServiceNow Research | Browser and workplace-agent benchmark harnesses | BrowserGym, AgentLab, WorkArena | weekly | browser, workflow |
+| WebArena maintainers | Academic open benchmark ecosystem | Browser workflow environments and verification updates | WebArena, WebArena-Verified | weekly | browser, verified |
+| OSWorld / XLANG team | XLANG / academic collaborators | Desktop-agent environments and verified task repair | OSWorld, OSWorld-Verified | weekly | os, desktop |
+| AppWorld authors | Stony Brook / AI2-linked research ecosystem | API/app simulation and state-diff benchmark design | AppWorld | weekly | api, state-diff |
+| AgentDojo / ETH security authors | ETH Zurich security research | Prompt injection, tool-output trust, agent-safety evals | AgentDojo | weekly | security, injection |
+| MCP benchmark authors | Public MCP/tool-use benchmark ecosystem | Tool registry, protocol, and tool-routing eval patterns | MCP-Bench, MCP-Atlas, MCP-Universe searches | weekly | mcp, tools |
+| Deep research benchmark authors | Public research-agent benchmark ecosystem | Citation, source grounding, claim checking, long-horizon research | DeepResearch Bench, LiveDRBench, DRBench, PaperBench searches | weekly | research, citations |
+| UK AISI Inspect AI team | UK AI Security Institute | Eval harness, scorer, and audit patterns | Inspect AI | weekly | framework, safety |
+| NIST AI RMF and eval standards teams | NIST | Public risk/evaluation governance signals | AI RMF, standards updates | monthly | standards, governance |
+
+Add individuals only when a named author repeatedly produces high-signal benchmark mechanics work.
diff --git a/research/queries/arxiv.txt b/research/queries/arxiv.txt
@@ -0,0 +1,16 @@
+"agent benchmark"
+"tool-use benchmark"
+"MCP benchmark"
+"browser agent benchmark"
+"computer-use agent"
+"deep research benchmark"
+"research agent benchmark"
+"prompt injection benchmark" AND agent
+"benchmark contamination" AND agent
+"benchmark exploitation" AND agent
+"reward hacking" AND ("agent" OR "tool use")
+"private holdout" AND evaluation
+"state-diff scoring" AND agent
+"trace-policy scoring" AND agent
+"LLM judge" AND benchmark AND reliability
+"verified benchmark" AND agent
diff --git a/research/queries/github.txt b/research/queries/github.txt
@@ -0,0 +1,14 @@
+agent benchmark language:Python
+tool-use benchmark agent
+MCP benchmark OR MCP-Bench OR MCP-Universe OR MCP-Atlas
+browser agent benchmark OR computer-use agent
+deep research benchmark OR research agent benchmark
+prompt injection benchmark agent
+agent safety benchmark tool use
+benchmark contamination agent
+benchmark exploit reward hacking
+private holdout evaluation benchmark
+state diff scorer agent benchmark
+trace policy scorer tool calls
+verified benchmark agent tasks
+replay environment agent benchmark
diff --git a/research/queries/scholar.txt b/research/queries/scholar.txt
@@ -0,0 +1,14 @@
+"agent benchmark" "tool use"
+"browser agent benchmark"
+"computer-use agent benchmark"
+"deep research benchmark"
+"research agent benchmark" citations
+"prompt injection benchmark" "tool use"
+"benchmark contamination" "large language models"
+"benchmark exploitation" "large language models"
+"reward hacking" "agent"
+"private holdout" "evaluation"
+"state-based evaluation" "agent"
+"trace policy" "tool use"
+"verified benchmark" "agent"
+"LLM-as-a-judge" reliability benchmark
diff --git a/research/reading-queue.md b/research/reading-queue.md
@@ -0,0 +1,8 @@
+# Reading Queue
+
+Public-safe reading queue only. Do not add private notes, customer observations, hidden holdout
+details, private scorer configs, or raw feeds.
+
+| Priority | Title | Source | Type | Why read | Status | Owner | Next action |
+|---|---|---|---|---|---|---|---|
+|  |  |  | paper / repo / standard / issue / release |  | queued / reading / summarized / ignored |  |  |
diff --git a/research/repos.md b/research/repos.md
@@ -0,0 +1,25 @@
+# Repos And Projects
+
+Watch modes:
+
+- `releases`: stable updates only;
+- `issues`: validation, exploit, and benchmark-repair discussion;
+- `commits`: task/scorer methodology changes;
+- `all activity`: high-signal core source.
+
+| Project or repo | Why watch | Watch mode | Priority | Tags |
+|---|---|---|---|---|
+| SWE-bench / SWE-bench Verified | Coding benchmark repair, verified splits, hidden-test methodology | releases, issues | P0 | code, verified |
+| Terminal-Bench | Terminal workflow benchmark and executable task design | releases, issues | P0 | terminal, execution |
+| ServiceNow/BrowserGym | Browser-agent harness and replay patterns | releases, issues | P0 | browser, harness |
+| ServiceNow/WorkArena | Workplace-style browser workflow tasks | releases, issues | P0 | browser, workflow |
+| WebArena / WebArena-Verified | Browser task validation and environment issues | releases, issues | P0 | browser, verified |
+| OSWorld / OSWorld-Verified | Desktop tasks, environment reproducibility, verified repairs | releases, issues | P0 | os, desktop |
+| AppWorld | Local app/API simulation and state-diff evaluation | releases, issues | P0 | api, state-diff |
+| AgentDojo | Prompt-injection and tool-output trust boundary tasks | releases, issues | P0 | security, injection |
+| Inspect AI | Eval framework, scorer patterns, reporting mechanics | releases | P0 | framework, scoring |
+| MCP benchmark repos | MCP/tool registry benchmark patterns | releases, issues | P0 | mcp, tools |
+| Deep research benchmark repos | Source-grounded research, citation, and claim scoring | releases, issues | P0 | research, citations |
+| Benchmark hardening paper repos | Exploit, contamination, and reward-hacking test harnesses | releases, issues | P1 | hardening, exploit |
+
+Do not mirror repos or cache raw feeds in the public tree.
diff --git a/research/source-map.csv b/research/source-map.csv
@@ -0,0 +1,16 @@
+priority,source_type,name,why_monitor,channel_or_url,cadence,signal_quality,relevance,tags
+P0,repo/site,SWE-bench ecosystem,"Verified splits, coding task curation, hidden-test methodology",https://www.swebench.com/,daily,high,high,"code;verified;hidden-tests"
+P0,repo/site,Terminal-Bench,"Executable terminal tasks and shell workflow scoring",https://www.tbench.ai/,daily,high,high,"terminal;execution"
+P0,repo,ServiceNow BrowserGym,"Browser-agent harness and replay-oriented task design",https://github.com/ServiceNow/BrowserGym,daily,high,high,"browser;harness;replay"
+P0,repo,ServiceNow WorkArena,"Workplace/browser workflow benchmark patterns",https://github.com/ServiceNow/WorkArena,daily,high,high,"browser;workflow"
+P0,site,WebArena / WebArena-Verified,"Browser workflow tasks, verification updates, environment lessons",https://webarena.dev/,daily,high,high,"browser;verified"
+P0,repo/site,OSWorld / OSWorld-Verified,"OS and desktop-agent tasks, replay and verification patterns",https://os-world.github.io/,daily,high,high,"os;desktop;verified"
+P0,repo,AppWorld,"API/app simulation and state-diff scoring patterns",https://github.com/stonybrooknlp/appworld,daily,high,high,"api;state-diff;tools"
+P0,search cluster,MCP benchmark cluster,"MCP and tool-registry benchmark methods","GitHub/arXiv search: MCP benchmark, MCP-Bench, MCP-Universe",daily,medium,high,"mcp;tool-use"
+P0,search cluster,Deep Research benchmark cluster,"Fixed-corpus research, citation scoring, claim checks","GitHub/arXiv search: DeepResearch Bench, LiveDRBench, DRBench, PaperBench",daily,medium,high,"research;citations"
+P0,repo,AgentDojo,"Prompt-injection and tool-output trust-boundary benchmark design",https://github.com/ethz-spylab/agentdojo,daily,high,high,"security;injection"
+P0,standards,NIST AI Risk Management Framework,"Evaluation governance and risk framing","NIST AI RMF official page",weekly,high,medium,"standards;governance"
+P0,framework,Inspect AI,"Public eval harness and scoring framework patterns",https://inspect.aisi.org.uk/,weekly,high,high,"framework;scoring"
+P0,search cluster,Benchmark hardening papers,"Contamination, exploitability, reward hacking, eval awareness","arXiv/Semantic Scholar query set",daily,medium,high,"hardening;exploit;contamination"
+P1,conference,Agent/eval/safety workshops,"New accepted papers and benchmark tracks","NeurIPS, ICML, ICLR, ACL workshop pages",weekly,medium,medium,"conference;workshop"
+P1,framework,Evals and harness releases,"Report, trace, scoring, and lifecycle ideas","GitHub release watchlist",weekly,medium,medium,"framework;reports"