Skip to content

feat: LLM-as-judge, agentic-trace evals, and benchmarking (v0.2.0)#1

Merged
LesterALeong merged 1 commit into
masterfrom
feat/v0.2.0-judge-agentic-bench
Jun 2, 2026
Merged

feat: LLM-as-judge, agentic-trace evals, and benchmarking (v0.2.0)#1
LesterALeong merged 1 commit into
masterfrom
feat/v0.2.0-judge-agentic-bench

Conversation

@LesterALeong

Copy link
Copy Markdown
Owner

Summary

Expands llm-evalgate from deterministic eval gates into a full eval surface. Everything uses the existing Dimension interface, so the new pieces compose in one harness next to the gates.

LLM-as-judge + jury (judge/)

  • JudgeDimension: model-backed grading via an injected complete() callable. Offline-testable, no model SDK imported at package load, and it fails closed (a judge error scores 0.0 rather than crashing the pipeline).
  • JuryDimension: aggregate several judges (mean / median / majority) and report their agreement.
  • anthropic_judge(): optional live adapter behind the new [judge] extra (lazy import).

Agentic evals (agentic/)

  • AgentTrace / AgentStep / ToolCall models + AgentEvalHarness.
  • Dimensions: tool selection, argument validity, step efficiency, goal completion, and judge-backed trajectory coherence. Scores the trajectory, not just the final answer.

Benchmarking (bench/)

  • Metrics: accuracy, precision/recall/f1, Cohen's kappa, regression-catch-rate.
  • BenchmarkRunner over a bundled 24-sample labeled golden set.
  • Result: deterministic gates alone catch 0.667 of regressions (they miss the semantic ones); adding an LLM judge closes it to 1.000. Pinned by a CI test.

Quality

  • 81 tests, 97% coverage, ruff clean.
  • Wheel + sdist build at 0.2.0 with the golden dataset bundled (verified loadable from an installed, non-source-tree context).
  • Backwards compatible: existing gates and reliability primitives unchanged.

Packaging

  • Version 0.1.1 -> 0.2.0
  • New optional extra: pip install "llm-evalgate[judge]"

Three new layers on top of the deterministic eval gates, all using the
existing Dimension interface so they compose in one harness:

- judge/: JudgeDimension + JuryDimension. Model-backed grading via an
  injected complete() callable (offline-testable, no SDK at import time),
  agreement signals across a jury, and graceful failure (a judge error
  scores 0.0 instead of crashing the pipeline). Lazy Anthropic adapter
  behind the new [judge] extra.

- agentic/: AgentTrace/AgentStep/ToolCall models + AgentEvalHarness with
  ToolSelection, ToolArgValidity, StepEfficiency, GoalCompletion, and
  judge-backed TrajectoryCoherence dimensions. Scores the trajectory,
  not just the final answer.

- bench/: metrics (accuracy, precision/recall/f1, Cohen's kappa,
  regression-catch-rate), BenchmarkRunner over labeled data, and a
  bundled 24-sample golden dataset. Deterministic gates alone catch
  0.667 of regressions; adding a judge closes it to 1.000. That number
  is pinned by a CI test so it cannot silently drift.

Packaging: version 0.2.0, [judge] optional extra, golden dataset bundled
into the wheel and sdist. 81 tests, 97% coverage, ruff clean.
@LesterALeong LesterALeong merged commit 5e7e651 into master Jun 2, 2026
4 checks passed
@LesterALeong LesterALeong deleted the feat/v0.2.0-judge-agentic-bench branch June 2, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant