Skip to content

feat(runner): add local command runner MVP#23

Merged
t3chn merged 1 commit into
mainfrom
feat/local-agent-runner-mvp
May 25, 2026
Merged

feat(runner): add local command runner MVP#23
t3chn merged 1 commit into
mainfrom
feat/local-agent-runner-mvp

Conversation

@t3chn
Copy link
Copy Markdown
Contributor

@t3chn t3chn commented May 25, 2026

Summary

  • Closes Add Local Agent Runner MVP #22.
  • Adds agent-bench run as a product-neutral command adapter for external local agent setups.
  • Creates agent-visible task_packet/, artifacts/, run.json, trace.jsonl, and score.json per run.
  • Keeps scorer-only files such as check_config.json, answer keys, hidden labels, canaries, private scorer configs, and expected values out of the task packet.
  • Adds a mock smoke agent, make run-smoke, CI wiring, runner tests, and local runner docs.

Non-goals

  • No new benchmark task family.
  • No provider-specific adapter.
  • No OpenAI/Anthropic integration.
  • No browser or MCP runner.
  • No private bundle runtime.
  • No scheduled evals or repo-level automation.

Test plan

  • make validate
  • make test
  • make smoke
  • make compare-smoke
  • make if01-smoke
  • make data01-smoke
  • make doc01-smoke
  • make sup01-smoke
  • make api01-smoke
  • make lifecycle-check
  • make mutation-smoke
  • make hardening-check
  • make run-smoke
  • make leak-check
  • python3 -m ruff check .
  • git diff --check
  • tracked-file audit for private/generated/sensitive paths

Review focus

  • src/agent_bench_lab/runner.py: task packet visibility boundary and command execution flow.
  • tests/test_runner.py: timeout, missing artifacts, redaction, CLI, and scorer-only exclusion coverage.
  • docs/21-local-agent-runner.md: ensure this is framed as a command adapter, not a provider/runtime layer.

Risks

  • Moderate: this intentionally executes a local command supplied by the user. It should remain a local harness primitive, not a provider/browser/MCP adapter.
  • Future adapters should preserve the task_packet != scorer fixture boundary.

Breaking changes

  • None.

Follow-ups

  • Decide after the Research Radar weekly synthesis whether this lands as v0.7.1 or becomes part of a later milestone.

Merge strategy recommendation

  • Recommended: squash.
  • Reason: this is one logical infrastructure slice with one commit.

Why:
- Agent Bench Lab needs a product-neutral way to run real external agent setups against existing task families, not only score prebuilt sample artifacts.
- The runner must preserve the benchmark visibility boundary by separating agent-visible task packets from scorer-visible fixtures.

What changed:
- Add `agent-bench run` and a local command runner that creates `task_packet/`, `artifacts/`, `run.json`, `trace.jsonl`, and `score.json`.
- Exclude scorer-only/private-looking files from task packets, including `check_config.json`, while scoring against the original fixture and produced artifacts.
- Add a mock smoke agent, runner tests, `make run-smoke`, CI wiring, schema event/status updates, and docs.

Testing:
- `make validate`
- `make test`
- `make smoke`
- `make compare-smoke`
- `make if01-smoke`
- `make data01-smoke`
- `make doc01-smoke`
- `make sup01-smoke`
- `make api01-smoke`
- `make lifecycle-check`
- `make mutation-smoke`
- `make hardening-check`
- `make run-smoke`
- `make leak-check`
- `python3 -m ruff check .`
- `git diff --check`
- tracked-file audit for private/generated/sensitive paths

Risk:
- moderate - command execution is intentionally local and generic; future provider/browser/MCP adapters should not bypass the task-packet visibility boundary.

Related: #22
@t3chn t3chn merged commit a868016 into main May 25, 2026
1 check passed
@t3chn t3chn deleted the feat/local-agent-runner-mvp branch May 25, 2026 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Local Agent Runner MVP

1 participant