Skip to content

Releases: heurema/agent-bench-lab

v0.7.1 Harden Local Agent Runner contracts

26 May 11:52
a483127

Choose a tag to compare

Hardens the Local Agent Runner runtime contracts with golden run/trace/score fixtures, schema validation, task-packet visibility tests, and redacted/bounded command-output checks. No new task family, provider adapter, browser/MCP runner, private bundle runtime, or v0.8 roadmap decision is included.

v0.7.0 Research Radar and Benchmark Intelligence

25 May 14:35
b88bcfe

Choose a tag to compare

Adds a public-safe Research Radar layer for tracking AI-agent benchmark and evaluation work. Includes watchlists, source maps, query sets, daily brief and weekly synthesis templates, and research process documentation. No automation, scraping, private eval material, or new task family is added.

v0.6.0 Benchmark lifecycle and hardening gates

25 May 11:52
0224c23

Choose a tag to compare

Adds lifecycle statuses, decision-grade and verified criteria, mutation smoke gates, exploit gate declarations, suite strategy guidance, report schema v1 guidance, and lightweight validation scripts. No new task family is added.

v0.5.0 API-01 decision-grade task family

25 May 07:55
351adfc

Choose a tag to compare

Adds API-01 as the fifth decision-grade task family for local internal API/tool-registry workflows. API-01 uses synthetic fixtures, scorer-side API simulation, deterministic state-diff checks, call-order validation, forbidden endpoint detection, and a tools-local suite without live SaaS or MCP dependencies.

v0.4.0 SUP-01 decision-grade task family

25 May 07:15
b596cc7

Choose a tag to compare

Adds SUP-01 as the fourth decision-grade task family for synthetic support inbox triage, policy-compliant draft replies, escalation decisions, and customer-style workflow evaluation. Introduces ops-local while keeping core focused.

v0.3.0 DOC-01 decision-grade task family

25 May 06:34
3e8ec69

Choose a tag to compare

Adds DOC-01 as the third decision-grade task family for fixed-corpus document QA, grounded answers, citation validation, unsupported-claim detection, and stale/distractor source checks using synthetic public fixtures and deterministic scoring.

v0.2.0 DATA-01 decision-grade task family

25 May 05:48
31ff201

Choose a tag to compare

Adds DATA-01 as the second decision-grade task family, using the standard scorer-contract model for exact metrics, schema validation, factual report checks, and deterministic chart-spec validation with synthetic public fixtures.

v0.1.2 Harden public reporting and leak gates

25 May 05:44
d733fbd

Choose a tag to compare

Adds lightweight redaction for public compare reports and strengthens public leak checks with tracked-file denylist scanning. Keeps raw local score records stable while reducing risk of exposing scorer-only/private evaluation content through generated public reports.

v0.1.1 Standard-layer boundary docs

25 May 05:13
a913087

Choose a tag to compare

Adds product-neutral Private Eval Layer guidance, scorer type contracts, visibility matrix, decision-grade graduation criteria, and redacted feedback rules. This prepares DATA-01 and future task families to use shared scoring contracts instead of one-off scorers.

v0.1.0 IF-01 decision-grade task family

24 May 15:48
15e6c3c

Choose a tag to compare

Adds IF-01 as the first deterministic decision-grade task family for strict instruction following and artifact-contract compliance. Includes public synthetic cases, config-driven scoring, mutation support, tests, documentation, and if01-smoke.