feat: add LabelModelGrader support for OpenAI Evals backend#137
Open
mesutoezdil wants to merge 1 commit intoagentevals-dev:mainfrom
Open
feat: add LabelModelGrader support for OpenAI Evals backend#137mesutoezdil wants to merge 1 commit intoagentevals-dev:mainfrom
mesutoezdil wants to merge 1 commit intoagentevals-dev:mainfrom
Conversation
Adds label_model grader type alongside text_similarity. Config validates model, input, labels, and passing_labels fields. Items sent to OpenAI only include actual_response for label_model, since the expected behavior is encoded in labels and passing_labels. Details in results include model and passing_labels instead of evaluation_metric. Closes agentevals-dev#97
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #97
Adds label_model as a second grader type next to text_similarity.
Config validates model, input, labels, and passing_labels. Items sent to the Evals API only include actual_response for label_model, since expected behavior is encoded in the grader config. The expected_invocations check is now type-aware, so label_model runs work without a golden set. Result details include model and passing_labels instead of evaluation_metric.
Example config:
The diff shows 184 insertions but 137 of those are the new test file. The actual code change is about 33 lines across config.py and openai_eval_backend.py.
Tests cover config validation, criteria shape, jsonl item structure, schema selection, and the evaluate flow with and without expected invocations.