feat: runnable three-arm benchmark harness by forkadarshp · Pull Request #5 · forkadarshp/MPort

forkadarshp · 2026-05-29T00:30:24Z

What

Adds a runnable, closed-loop benchmark harness under harness/ that
operationalizes the methodology in references/benchmarking.md (previously docs
only). Run a migration scenario, read the leaderboard + failing cases, tighten
the enhanced system, re-run.

Three arms, same eval set + graders

A baseline — old model + baseline prompt
B naive swap — new model + baseline prompt (raw model delta)
C ModelPort — new model + enhanced prompt (skill's added value)

Components

run.py — orchestrates the arms, prints the leaderboard + attribution (model delta / skill delta / net)
graders.py — provider-agnostic scoring (contract, tool, task) that runs on real output strings
providers.py — SimProvider (offline, deterministic) + AnthropicProvider (real Messages API)
scenarios/support_triage.json — bundled fixture
tests/ — 9 unit/smoke tests, now run in CI

The loop, demonstrated

Tuning the bundled scenario, a vague enhanced prompt scored a negative skill
delta (cost up, quality flat). Making the contract explicit moved ModelPort
from last place to a clear win (composite 0.57 → 0.75; skill delta +0.15). The
harness even surfaced a flaw in its own composite (min-max over near-equal costs
swamped quality) — now fixed with baseline-relative, bounded normalization.

Notes

The sim provider's numbers are illustrative (prompt-explicitness × per-model literalness); the grading pipeline is real, so --provider anthropic yields measured results unchanged.
Stdlib only for the offline path; anthropic imported lazily for real runs.

Validation

python3 scripts/validate_skill.py . → valid
markdownlint-cli2 "**/*.md" → 0 errors
python3 -m unittest discover -s harness/tests → 9 passed

A closed-loop evaluator that operationalizes references/benchmarking.md: runs baseline / naive-swap / ModelPort-enhanced arms over an eval set, grades output-contract conformance, tool-calling accuracy, and task success, and prints a leaderboard with attribution (model delta, skill delta, net). - harness/run.py — orchestrates the three arms + leaderboard - harness/graders.py — provider-agnostic scoring (real, runs on actual output) - harness/providers.py — SimProvider (offline, deterministic) + AnthropicProvider (real Messages API; needs ANTHROPIC_API_KEY) - harness/scenarios/support_triage.json — bundled scenario fixture - harness/tests/ — 9 unit/smoke tests, wired into CI - harness/README.md — usage + the iterate-on-failures loop The simulator's numbers are illustrative (driven by prompt explicitness × a per-model literalness knob); the grading/scoring pipeline is real, so the Anthropic provider yields measured results with no other changes.

forkadarshp merged commit e2283af into main May 29, 2026
1 check passed

forkadarshp mentioned this pull request May 29, 2026

fix: sync remaining harness + README work into main #7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: runnable three-arm benchmark harness#5

feat: runnable three-arm benchmark harness#5
forkadarshp merged 1 commit into
mainfrom
feat/benchmark-harness

forkadarshp commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

forkadarshp commented May 29, 2026

What

Three arms, same eval set + graders

Components

The loop, demonstrated

Notes

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant