feat(runner): add local command runner MVP by t3chn · Pull Request #23 · heurema/agent-bench-lab

t3chn · 2026-05-25T16:58:12Z

Summary

Closes Add Local Agent Runner MVP #22.
Adds agent-bench run as a product-neutral command adapter for external local agent setups.
Creates agent-visible task_packet/, artifacts/, run.json, trace.jsonl, and score.json per run.
Keeps scorer-only files such as check_config.json, answer keys, hidden labels, canaries, private scorer configs, and expected values out of the task packet.
Adds a mock smoke agent, make run-smoke, CI wiring, runner tests, and local runner docs.

Non-goals

No new benchmark task family.
No provider-specific adapter.
No OpenAI/Anthropic integration.
No browser or MCP runner.
No private bundle runtime.
No scheduled evals or repo-level automation.

Test plan

make validate
make test
make smoke
make compare-smoke
make if01-smoke
make data01-smoke
make doc01-smoke
make sup01-smoke
make api01-smoke
make lifecycle-check
make mutation-smoke
make hardening-check
make run-smoke
make leak-check
python3 -m ruff check .
git diff --check
tracked-file audit for private/generated/sensitive paths

Review focus

src/agent_bench_lab/runner.py: task packet visibility boundary and command execution flow.
tests/test_runner.py: timeout, missing artifacts, redaction, CLI, and scorer-only exclusion coverage.
docs/21-local-agent-runner.md: ensure this is framed as a command adapter, not a provider/runtime layer.

Risks

Moderate: this intentionally executes a local command supplied by the user. It should remain a local harness primitive, not a provider/browser/MCP adapter.
Future adapters should preserve the task_packet != scorer fixture boundary.

Breaking changes

None.

Follow-ups

Decide after the Research Radar weekly synthesis whether this lands as v0.7.1 or becomes part of a later milestone.

Merge strategy recommendation

Recommended: squash.
Reason: this is one logical infrastructure slice with one commit.

Why: - Agent Bench Lab needs a product-neutral way to run real external agent setups against existing task families, not only score prebuilt sample artifacts. - The runner must preserve the benchmark visibility boundary by separating agent-visible task packets from scorer-visible fixtures. What changed: - Add `agent-bench run` and a local command runner that creates `task_packet/`, `artifacts/`, `run.json`, `trace.jsonl`, and `score.json`. - Exclude scorer-only/private-looking files from task packets, including `check_config.json`, while scoring against the original fixture and produced artifacts. - Add a mock smoke agent, runner tests, `make run-smoke`, CI wiring, schema event/status updates, and docs. Testing: - `make validate` - `make test` - `make smoke` - `make compare-smoke` - `make if01-smoke` - `make data01-smoke` - `make doc01-smoke` - `make sup01-smoke` - `make api01-smoke` - `make lifecycle-check` - `make mutation-smoke` - `make hardening-check` - `make run-smoke` - `make leak-check` - `python3 -m ruff check .` - `git diff --check` - tracked-file audit for private/generated/sensitive paths Risk: - moderate - command execution is intentionally local and generic; future provider/browser/MCP adapters should not bypass the task-packet visibility boundary. Related: #22

t3chn merged commit a868016 into main May 25, 2026
1 check passed

t3chn deleted the feat/local-agent-runner-mvp branch May 25, 2026 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runner): add local command runner MVP#23

feat(runner): add local command runner MVP#23
t3chn merged 1 commit into
mainfrom
feat/local-agent-runner-mvp

t3chn commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

t3chn commented May 25, 2026

Summary

Non-goals

Test plan

Review focus

Risks

Breaking changes

Follow-ups

Merge strategy recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant