feat(agentic): agentic capability suite (Slice 3): Foreman runner + records + report by Defilan · Pull Request #2 · defilantech/llmkube-bench

Defilan · 2026-06-26T16:49:19Z

What

Adds an agentic/ Python package to llmkube-bench: a thin end-to-end runner that takes a fixed set of (model x task) pairs, runs each task through Foreman on one hardware tier, emits a structured JSON record per run, runs a client-side perf load sweep per model, and renders a markdown comparison report.

This is the agentic capability companion to the existing serving bake-off at the repo root (llama.cpp vs vLLM throughput). That suite answers "how fast does the stack go?"; this answers "is the model actually good at my work, as deployed in a real harness?"

Modules (all under agentic/runner/):

config.py typed YAML config (Hardware/Model/Task/Perf) with a duplicate-model guard
record.py the canonical RunRecord schema + contamination_free flag (task filed after model cutoff)
behavior.py Foreman transcript + AgenticTask status -> turns, tool-call mix, files edited, gate-fix attempts, failure-mode classification
perf.py streaming concurrency sweep (TTFT, per-request tok/s, p50/p95 latency); HTTP call behind an injectable seam
cluster.py a Cluster protocol with a KubectlCluster (scale/serve, repoint the ModelRouter coder backend, dispatch AgenticTask, collect transcript) and a FakeCluster for tests
orchestrate.py run_matrix: serve each model alone (GPU serialization), perf sweep, dispatch each corpus task on a distinct branch, assemble records
report.py perf table + capability matrix + contamination-free count
cli.py run and report subcommands

Why

A benchmark that is contamination-free (own/post-cutoff issues), harness-in-the-loop (Foreman is a real versioned harness, not a bare model), agentic on a real maintained codebase with a real gate, and reproducible. The serving suite already covers throughput; capability was the missing axis. Design and reference data live in the private internal specs/research.

How

TDD, one module per commit. Pure logic (config, record, behavior, perf stats, report) is unit-tested with fixtures and a FakeCluster; no cluster or credentials needed to run the suite. The cluster-driving paths shell out to kubectl/gh and are validated against a live Strix by hand (single-model perf smoke + a full matrix run), not in CI. The runner reproduces the qualitative story observed by hand: the clean issue lands a gate-verified GO, the gotcha issue is contained as INCOMPLETE.

cd agentic
python3 -m pip install -r requirements.txt
python3 -m pytest -q          # 14 passed, no cluster needed
python -m runner.cli report --records records --out reports/latest.md

Checklist

Tests pass locally (python3 -m pytest -q -> 14 passed)
No cluster/credentials required for the test suite
Existing serving bake-off untouched (additive agentic/ subtree)
Live Strix smoke (gateway JWT + GPU): manual, run before relying on the live paths

…+ distinct branches

…ion)

…eats

Defilan added 10 commits June 26, 2026 09:24

feat(agentic): scaffold the agentic bench runner package

41c73b3

feat(agentic): config model + example

92ddb57

feat(agentic): run-record schema with contamination flag

0df45ba

feat(agentic): transcript+status -> behavior extraction

70e0c79

feat(agentic): client-side perf load sweep + stats

026796d

feat(agentic): cluster seam (kubectl/gh) + fake

5870271

feat(agentic): per-(model,task) orchestration with GPU serialization …

f83d52a

…+ distinct branches

feat(agentic): markdown report (perf + capability matrix + contaminat…

b0c8785

…ion)

feat(agentic): CLI run + report

a908522

docs(agentic): usage, config walkthrough, JWT + GPU-serialization cav…

0a4f5b6

…eats

Defilan changed the title ~~feat(agentic): agentic capability suite (Slice 3) — Foreman runner + records + report~~ feat(agentic): agentic capability suite (Slice 3): Foreman runner + records + report Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(agentic): agentic capability suite (Slice 3): Foreman runner + records + report#2

feat(agentic): agentic capability suite (Slice 3): Foreman runner + records + report#2
Defilan wants to merge 10 commits into
mainfrom
feat/agentic-bench-runner

Defilan commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Defilan commented Jun 26, 2026

What

Why

How

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant