Skip to content

feat(agentic): agentic capability suite (Slice 3): Foreman runner + records + report#2

Open
Defilan wants to merge 10 commits into
mainfrom
feat/agentic-bench-runner
Open

feat(agentic): agentic capability suite (Slice 3): Foreman runner + records + report#2
Defilan wants to merge 10 commits into
mainfrom
feat/agentic-bench-runner

Conversation

@Defilan

@Defilan Defilan commented Jun 26, 2026

Copy link
Copy Markdown
Member

What

Adds an agentic/ Python package to llmkube-bench: a thin end-to-end runner that takes a fixed set of (model x task) pairs, runs each task through Foreman on one hardware tier, emits a structured JSON record per run, runs a client-side perf load sweep per model, and renders a markdown comparison report.

This is the agentic capability companion to the existing serving bake-off at the repo root (llama.cpp vs vLLM throughput). That suite answers "how fast does the stack go?"; this answers "is the model actually good at my work, as deployed in a real harness?"

Modules (all under agentic/runner/):

  • config.py typed YAML config (Hardware/Model/Task/Perf) with a duplicate-model guard
  • record.py the canonical RunRecord schema + contamination_free flag (task filed after model cutoff)
  • behavior.py Foreman transcript + AgenticTask status -> turns, tool-call mix, files edited, gate-fix attempts, failure-mode classification
  • perf.py streaming concurrency sweep (TTFT, per-request tok/s, p50/p95 latency); HTTP call behind an injectable seam
  • cluster.py a Cluster protocol with a KubectlCluster (scale/serve, repoint the ModelRouter coder backend, dispatch AgenticTask, collect transcript) and a FakeCluster for tests
  • orchestrate.py run_matrix: serve each model alone (GPU serialization), perf sweep, dispatch each corpus task on a distinct branch, assemble records
  • report.py perf table + capability matrix + contamination-free count
  • cli.py run and report subcommands

Why

A benchmark that is contamination-free (own/post-cutoff issues), harness-in-the-loop (Foreman is a real versioned harness, not a bare model), agentic on a real maintained codebase with a real gate, and reproducible. The serving suite already covers throughput; capability was the missing axis. Design and reference data live in the private internal specs/research.

How

TDD, one module per commit. Pure logic (config, record, behavior, perf stats, report) is unit-tested with fixtures and a FakeCluster; no cluster or credentials needed to run the suite. The cluster-driving paths shell out to kubectl/gh and are validated against a live Strix by hand (single-model perf smoke + a full matrix run), not in CI. The runner reproduces the qualitative story observed by hand: the clean issue lands a gate-verified GO, the gotcha issue is contained as INCOMPLETE.

cd agentic
python3 -m pip install -r requirements.txt
python3 -m pytest -q          # 14 passed, no cluster needed
python -m runner.cli report --records records --out reports/latest.md

Checklist

  • Tests pass locally (python3 -m pytest -q -> 14 passed)
  • No cluster/credentials required for the test suite
  • Existing serving bake-off untouched (additive agentic/ subtree)
  • Live Strix smoke (gateway JWT + GPU): manual, run before relying on the live paths

@Defilan Defilan changed the title feat(agentic): agentic capability suite (Slice 3) — Foreman runner + records + report feat(agentic): agentic capability suite (Slice 3): Foreman runner + records + report Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant