Skip to content

feat: runnable three-arm benchmark harness#5

Merged
forkadarshp merged 1 commit into
mainfrom
feat/benchmark-harness
May 29, 2026
Merged

feat: runnable three-arm benchmark harness#5
forkadarshp merged 1 commit into
mainfrom
feat/benchmark-harness

Conversation

@forkadarshp

Copy link
Copy Markdown
Owner

What

Adds a runnable, closed-loop benchmark harness under harness/ that
operationalizes the methodology in references/benchmarking.md (previously docs
only). Run a migration scenario, read the leaderboard + failing cases, tighten
the enhanced system, re-run.

Three arms, same eval set + graders

  • A baseline — old model + baseline prompt
  • B naive swap — new model + baseline prompt (raw model delta)
  • C ModelPort — new model + enhanced prompt (skill's added value)

Components

  • run.py — orchestrates the arms, prints the leaderboard + attribution (model delta / skill delta / net)
  • graders.py — provider-agnostic scoring (contract, tool, task) that runs on real output strings
  • providers.pySimProvider (offline, deterministic) + AnthropicProvider (real Messages API)
  • scenarios/support_triage.json — bundled fixture
  • tests/ — 9 unit/smoke tests, now run in CI

The loop, demonstrated

Tuning the bundled scenario, a vague enhanced prompt scored a negative skill
delta
(cost up, quality flat). Making the contract explicit moved ModelPort
from last place to a clear win (composite 0.57 → 0.75; skill delta +0.15). The
harness even surfaced a flaw in its own composite (min-max over near-equal costs
swamped quality) — now fixed with baseline-relative, bounded normalization.

Notes

  • The sim provider's numbers are illustrative (prompt-explicitness × per-model literalness); the grading pipeline is real, so --provider anthropic yields measured results unchanged.
  • Stdlib only for the offline path; anthropic imported lazily for real runs.

Validation

  • python3 scripts/validate_skill.py . → valid
  • markdownlint-cli2 "**/*.md" → 0 errors
  • python3 -m unittest discover -s harness/tests → 9 passed

A closed-loop evaluator that operationalizes references/benchmarking.md:
runs baseline / naive-swap / ModelPort-enhanced arms over an eval set, grades
output-contract conformance, tool-calling accuracy, and task success, and
prints a leaderboard with attribution (model delta, skill delta, net).

- harness/run.py — orchestrates the three arms + leaderboard
- harness/graders.py — provider-agnostic scoring (real, runs on actual output)
- harness/providers.py — SimProvider (offline, deterministic) + AnthropicProvider
  (real Messages API; needs ANTHROPIC_API_KEY)
- harness/scenarios/support_triage.json — bundled scenario fixture
- harness/tests/ — 9 unit/smoke tests, wired into CI
- harness/README.md — usage + the iterate-on-failures loop

The simulator's numbers are illustrative (driven by prompt explicitness × a
per-model literalness knob); the grading/scoring pipeline is real, so the
Anthropic provider yields measured results with no other changes.
@forkadarshp forkadarshp merged commit e2283af into main May 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant