feat(agentic): agentic capability suite (Slice 3): Foreman runner + records + report#2
Open
Defilan wants to merge 10 commits into
Open
feat(agentic): agentic capability suite (Slice 3): Foreman runner + records + report#2Defilan wants to merge 10 commits into
Defilan wants to merge 10 commits into
Conversation
…+ distinct branches
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an
agentic/Python package to llmkube-bench: a thin end-to-end runner that takes a fixed set of (model x task) pairs, runs each task through Foreman on one hardware tier, emits a structured JSON record per run, runs a client-side perf load sweep per model, and renders a markdown comparison report.This is the agentic capability companion to the existing serving bake-off at the repo root (llama.cpp vs vLLM throughput). That suite answers "how fast does the stack go?"; this answers "is the model actually good at my work, as deployed in a real harness?"
Modules (all under
agentic/runner/):config.pytyped YAML config (Hardware/Model/Task/Perf) with a duplicate-model guardrecord.pythe canonicalRunRecordschema +contamination_freeflag (task filed after model cutoff)behavior.pyForeman transcript + AgenticTask status -> turns, tool-call mix, files edited, gate-fix attempts, failure-mode classificationperf.pystreaming concurrency sweep (TTFT, per-request tok/s, p50/p95 latency); HTTP call behind an injectable seamcluster.pyaClusterprotocol with aKubectlCluster(scale/serve, repoint the ModelRouter coder backend, dispatch AgenticTask, collect transcript) and aFakeClusterfor testsorchestrate.pyrun_matrix: serve each model alone (GPU serialization), perf sweep, dispatch each corpus task on a distinct branch, assemble recordsreport.pyperf table + capability matrix + contamination-free countcli.pyrunandreportsubcommandsWhy
A benchmark that is contamination-free (own/post-cutoff issues), harness-in-the-loop (Foreman is a real versioned harness, not a bare model), agentic on a real maintained codebase with a real gate, and reproducible. The serving suite already covers throughput; capability was the missing axis. Design and reference data live in the private internal specs/research.
How
TDD, one module per commit. Pure logic (config, record, behavior, perf stats, report) is unit-tested with fixtures and a
FakeCluster; no cluster or credentials needed to run the suite. The cluster-driving paths shell out tokubectl/ghand are validated against a live Strix by hand (single-model perf smoke + a full matrix run), not in CI. The runner reproduces the qualitative story observed by hand: the clean issue lands a gate-verified GO, the gotcha issue is contained as INCOMPLETE.Checklist
python3 -m pytest -q-> 14 passed)agentic/subtree)