Releases: heurema/agent-bench-lab
v0.7.1 Harden Local Agent Runner contracts
Hardens the Local Agent Runner runtime contracts with golden run/trace/score fixtures, schema validation, task-packet visibility tests, and redacted/bounded command-output checks. No new task family, provider adapter, browser/MCP runner, private bundle runtime, or v0.8 roadmap decision is included.
v0.7.0 Research Radar and Benchmark Intelligence
Adds a public-safe Research Radar layer for tracking AI-agent benchmark and evaluation work. Includes watchlists, source maps, query sets, daily brief and weekly synthesis templates, and research process documentation. No automation, scraping, private eval material, or new task family is added.
v0.6.0 Benchmark lifecycle and hardening gates
Adds lifecycle statuses, decision-grade and verified criteria, mutation smoke gates, exploit gate declarations, suite strategy guidance, report schema v1 guidance, and lightweight validation scripts. No new task family is added.
v0.5.0 API-01 decision-grade task family
Adds API-01 as the fifth decision-grade task family for local internal API/tool-registry workflows. API-01 uses synthetic fixtures, scorer-side API simulation, deterministic state-diff checks, call-order validation, forbidden endpoint detection, and a tools-local suite without live SaaS or MCP dependencies.
v0.4.0 SUP-01 decision-grade task family
Adds SUP-01 as the fourth decision-grade task family for synthetic support inbox triage, policy-compliant draft replies, escalation decisions, and customer-style workflow evaluation. Introduces ops-local while keeping core focused.
v0.3.0 DOC-01 decision-grade task family
Adds DOC-01 as the third decision-grade task family for fixed-corpus document QA, grounded answers, citation validation, unsupported-claim detection, and stale/distractor source checks using synthetic public fixtures and deterministic scoring.
v0.2.0 DATA-01 decision-grade task family
Adds DATA-01 as the second decision-grade task family, using the standard scorer-contract model for exact metrics, schema validation, factual report checks, and deterministic chart-spec validation with synthetic public fixtures.
v0.1.2 Harden public reporting and leak gates
Adds lightweight redaction for public compare reports and strengthens public leak checks with tracked-file denylist scanning. Keeps raw local score records stable while reducing risk of exposing scorer-only/private evaluation content through generated public reports.
v0.1.1 Standard-layer boundary docs
Adds product-neutral Private Eval Layer guidance, scorer type contracts, visibility matrix, decision-grade graduation criteria, and redacted feedback rules. This prepares DATA-01 and future task families to use shared scoring contracts instead of one-off scorers.
v0.1.0 IF-01 decision-grade task family
Adds IF-01 as the first deterministic decision-grade task family for strict instruction following and artifact-contract compliance. Includes public synthetic cases, config-driven scoring, mutation support, tests, documentation, and if01-smoke.