feat: LLM-as-judge, agentic-trace evals, and benchmarking (v0.2.0) by LesterALeong · Pull Request #1 · LesterALeong/llm-evalgate

LesterALeong · 2026-06-02T18:50:35Z

Summary

Expands llm-evalgate from deterministic eval gates into a full eval surface. Everything uses the existing Dimension interface, so the new pieces compose in one harness next to the gates.

LLM-as-judge + jury (`judge/`)

JudgeDimension: model-backed grading via an injected complete() callable. Offline-testable, no model SDK imported at package load, and it fails closed (a judge error scores 0.0 rather than crashing the pipeline).
JuryDimension: aggregate several judges (mean / median / majority) and report their agreement.
anthropic_judge(): optional live adapter behind the new [judge] extra (lazy import).

Agentic evals (`agentic/`)

AgentTrace / AgentStep / ToolCall models + AgentEvalHarness.
Dimensions: tool selection, argument validity, step efficiency, goal completion, and judge-backed trajectory coherence. Scores the trajectory, not just the final answer.

Benchmarking (`bench/`)

Metrics: accuracy, precision/recall/f1, Cohen's kappa, regression-catch-rate.
BenchmarkRunner over a bundled 24-sample labeled golden set.
Result: deterministic gates alone catch 0.667 of regressions (they miss the semantic ones); adding an LLM judge closes it to 1.000. Pinned by a CI test.

Quality

81 tests, 97% coverage, ruff clean.
Wheel + sdist build at 0.2.0 with the golden dataset bundled (verified loadable from an installed, non-source-tree context).
Backwards compatible: existing gates and reliability primitives unchanged.

Packaging

Version 0.1.1 -> 0.2.0
New optional extra: pip install "llm-evalgate[judge]"

Three new layers on top of the deterministic eval gates, all using the existing Dimension interface so they compose in one harness: - judge/: JudgeDimension + JuryDimension. Model-backed grading via an injected complete() callable (offline-testable, no SDK at import time), agreement signals across a jury, and graceful failure (a judge error scores 0.0 instead of crashing the pipeline). Lazy Anthropic adapter behind the new [judge] extra. - agentic/: AgentTrace/AgentStep/ToolCall models + AgentEvalHarness with ToolSelection, ToolArgValidity, StepEfficiency, GoalCompletion, and judge-backed TrajectoryCoherence dimensions. Scores the trajectory, not just the final answer. - bench/: metrics (accuracy, precision/recall/f1, Cohen's kappa, regression-catch-rate), BenchmarkRunner over labeled data, and a bundled 24-sample golden dataset. Deterministic gates alone catch 0.667 of regressions; adding a judge closes it to 1.000. That number is pinned by a CI test so it cannot silently drift. Packaging: version 0.2.0, [judge] optional extra, golden dataset bundled into the wheel and sdist. 81 tests, 97% coverage, ruff clean.

LesterALeong merged commit 5e7e651 into master Jun 2, 2026
4 checks passed

LesterALeong deleted the feat/v0.2.0-judge-agentic-bench branch June 2, 2026 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LLM-as-judge, agentic-trace evals, and benchmarking (v0.2.0)#1

feat: LLM-as-judge, agentic-trace evals, and benchmarking (v0.2.0)#1
LesterALeong merged 1 commit into
masterfrom
feat/v0.2.0-judge-agentic-bench

LesterALeong commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LesterALeong commented Jun 2, 2026

Summary

LLM-as-judge + jury (judge/)

Agentic evals (agentic/)

Benchmarking (bench/)

Quality

Packaging

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LLM-as-judge + jury (`judge/`)

Agentic evals (`agentic/`)

Benchmarking (`bench/`)