Releases · heurema/agent-bench-lab

26 May 11:52

t3chn

v0.7.1

a483127

v0.7.1 Harden Local Agent Runner contracts Latest

Latest

Hardens the Local Agent Runner runtime contracts with golden run/trace/score fixtures, schema validation, task-packet visibility tests, and redacted/bounded command-output checks. No new task family, provider adapter, browser/MCP runner, private bundle runtime, or v0.8 roadmap decision is included.

Assets 2

25 May 14:35

t3chn

v0.7.0

b88bcfe

v0.7.0 Research Radar and Benchmark Intelligence

Adds a public-safe Research Radar layer for tracking AI-agent benchmark and evaluation work. Includes watchlists, source maps, query sets, daily brief and weekly synthesis templates, and research process documentation. No automation, scraping, private eval material, or new task family is added.

Assets 2

25 May 11:52

t3chn

v0.6.0

0224c23

v0.6.0 Benchmark lifecycle and hardening gates

Adds lifecycle statuses, decision-grade and verified criteria, mutation smoke gates, exploit gate declarations, suite strategy guidance, report schema v1 guidance, and lightweight validation scripts. No new task family is added.

Assets 2

25 May 07:55

t3chn

v0.5.0

351adfc

v0.5.0 API-01 decision-grade task family

Adds API-01 as the fifth decision-grade task family for local internal API/tool-registry workflows. API-01 uses synthetic fixtures, scorer-side API simulation, deterministic state-diff checks, call-order validation, forbidden endpoint detection, and a tools-local suite without live SaaS or MCP dependencies.

Assets 2

25 May 07:15

t3chn

v0.4.0

b596cc7

v0.4.0 SUP-01 decision-grade task family

Adds SUP-01 as the fourth decision-grade task family for synthetic support inbox triage, policy-compliant draft replies, escalation decisions, and customer-style workflow evaluation. Introduces ops-local while keeping core focused.

Assets 2

25 May 06:34

t3chn

v0.3.0

3e8ec69

v0.3.0 DOC-01 decision-grade task family

Adds DOC-01 as the third decision-grade task family for fixed-corpus document QA, grounded answers, citation validation, unsupported-claim detection, and stale/distractor source checks using synthetic public fixtures and deterministic scoring.

Assets 2

25 May 05:48

t3chn

v0.2.0

31ff201

v0.2.0 DATA-01 decision-grade task family

Adds DATA-01 as the second decision-grade task family, using the standard scorer-contract model for exact metrics, schema validation, factual report checks, and deterministic chart-spec validation with synthetic public fixtures.

Assets 2

25 May 05:44

t3chn

v0.1.2

d733fbd

v0.1.2 Harden public reporting and leak gates

Adds lightweight redaction for public compare reports and strengthens public leak checks with tracked-file denylist scanning. Keeps raw local score records stable while reducing risk of exposing scorer-only/private evaluation content through generated public reports.

Assets 2

25 May 05:13

t3chn

v0.1.1

a913087

v0.1.1 Standard-layer boundary docs

Adds product-neutral Private Eval Layer guidance, scorer type contracts, visibility matrix, decision-grade graduation criteria, and redacted feedback rules. This prepares DATA-01 and future task families to use shared scoring contracts instead of one-off scorers.

Assets 2

24 May 15:48

t3chn

v0.1.0

15e6c3c

v0.1.0 IF-01 decision-grade task family

Adds IF-01 as the first deterministic decision-grade task family for strict instruction following and artifact-contract compliance. Includes public synthetic cases, config-driven scoring, mutation support, tests, documentation, and if01-smoke.

Assets 2

Releases: heurema/agent-bench-lab

v0.7.1 Harden Local Agent Runner contracts

Uh oh!

v0.7.0 Research Radar and Benchmark Intelligence

Uh oh!

v0.6.0 Benchmark lifecycle and hardening gates

Uh oh!

v0.5.0 API-01 decision-grade task family

Uh oh!

v0.4.0 SUP-01 decision-grade task family

Uh oh!

v0.3.0 DOC-01 decision-grade task family

Uh oh!

v0.2.0 DATA-01 decision-grade task family

Uh oh!

v0.1.2 Harden public reporting and leak gates

Uh oh!

v0.1.1 Standard-layer boundary docs

Uh oh!

v0.1.0 IF-01 decision-grade task family

Uh oh!