feat: LLM-as-judge, agentic-trace evals, and benchmarking (v0.2.0)#1
Merged
Merged
Conversation
Three new layers on top of the deterministic eval gates, all using the existing Dimension interface so they compose in one harness: - judge/: JudgeDimension + JuryDimension. Model-backed grading via an injected complete() callable (offline-testable, no SDK at import time), agreement signals across a jury, and graceful failure (a judge error scores 0.0 instead of crashing the pipeline). Lazy Anthropic adapter behind the new [judge] extra. - agentic/: AgentTrace/AgentStep/ToolCall models + AgentEvalHarness with ToolSelection, ToolArgValidity, StepEfficiency, GoalCompletion, and judge-backed TrajectoryCoherence dimensions. Scores the trajectory, not just the final answer. - bench/: metrics (accuracy, precision/recall/f1, Cohen's kappa, regression-catch-rate), BenchmarkRunner over labeled data, and a bundled 24-sample golden dataset. Deterministic gates alone catch 0.667 of regressions; adding a judge closes it to 1.000. That number is pinned by a CI test so it cannot silently drift. Packaging: version 0.2.0, [judge] optional extra, golden dataset bundled into the wheel and sdist. 81 tests, 97% coverage, ruff clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Expands llm-evalgate from deterministic eval gates into a full eval surface. Everything uses the existing
Dimensioninterface, so the new pieces compose in one harness next to the gates.LLM-as-judge + jury (
judge/)JudgeDimension: model-backed grading via an injectedcomplete()callable. Offline-testable, no model SDK imported at package load, and it fails closed (a judge error scores 0.0 rather than crashing the pipeline).JuryDimension: aggregate several judges (mean / median / majority) and report their agreement.anthropic_judge(): optional live adapter behind the new[judge]extra (lazy import).Agentic evals (
agentic/)AgentTrace/AgentStep/ToolCallmodels +AgentEvalHarness.Benchmarking (
bench/)BenchmarkRunnerover a bundled 24-sample labeled golden set.Quality
Packaging
pip install "llm-evalgate[judge]"