-
Notifications
You must be signed in to change notification settings - Fork 19
🔧(evals) add run_eval management command #481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
maxenceh
wants to merge
1
commit into
main
Choose a base branch
from
maxenceh/setup-eval-llm
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+945
−1
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,184 @@ | ||
| # Behavioral Evals | ||
|
|
||
| Evals are behavioral tests that verify the Agent acts correctly in specific situations. They are not unit tests of Python logic — they test **LLM behaviour**: does the model call the right tool? Does it respect a system instruction? Does it avoid a known bad pattern? | ||
|
|
||
| A failing eval means the model (or a change to its configuration, instructions, or tools) has regressed on a documented behaviour. Think of evals as executable specifications for how the agent should behave. | ||
|
|
||
| ## Structure | ||
|
|
||
| ```text | ||
| chat/evals/ | ||
| ├── configs/ | ||
| │ ├── __init__.py # REGISTRY — maps dataset name → EvalConfig | ||
| │ ├── base.py # EvalConfig dataclass | ||
| │ ├── url_hallucination.py # Config for the URL hallucination dataset | ||
| │ └── self_documentation.py# Config for the self_documentation dataset | ||
| ├── datasets/ | ||
| │ ├── url_hallucination.yaml | ||
| │ └── self_documentation.yaml | ||
| ├── evaluators/ | ||
| │ ├── __init__.py | ||
| │ └── url_regex.py # UrlRegexEvaluator — deterministic URL check | ||
| └── __init__.py # EvalInputs, EvalMetadata Pydantic models | ||
| ``` | ||
|
|
||
| ## Existing datasets | ||
|
|
||
| | Dataset | What it tests | Evaluators | | ||
| |---|---|---| | ||
| | `url_hallucination` | The agent never invents `http(s)://` URLs; only uses URLs from tool output | `UrlRegexEvaluator` (regex) + `LLMJudge` (semantic) | | ||
| | `self_documentation` | The `self_documentation` tool is called when and only when the user asks about the assistant itself | `HasMatchingSpan` per case (span-based) | | ||
|
Comment on lines
+27
to
+30
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dataset description is stricter than the implemented evaluator. The table says URLs must come only from tool output, but the evaluator also allows URLs present in 🤖 Prompt for AI Agents |
||
|
|
||
| ## Running evals | ||
|
|
||
| All evals run inside Docker via `make eval`. | ||
|
|
||
| ```bash | ||
| # Run all datasets | ||
| make eval | ||
|
|
||
| # Run a single dataset | ||
| make eval EVAL_ARGS="--dataset url_hallucination" | ||
| make eval EVAL_ARGS="--dataset self_documentation" | ||
|
|
||
| # Run a single test case by name | ||
| make eval EVAL_ARGS="--dataset url_hallucination --case easy_docs_link" | ||
|
|
||
| # Run each case N times (default: 1) | ||
| make eval EVAL_ARGS="--dataset self_documentation --runs 3" | ||
|
|
||
| # Show full model input and response in the report | ||
| make eval EVAL_ARGS="--dataset url_hallucination --verbose" | ||
|
|
||
| # Skip the LLM judge (use when the model endpoint does not support structured output) | ||
| make eval EVAL_ARGS="--no-llm-judge" | ||
| ``` | ||
|
|
||
| ### Debugging | ||
|
|
||
| ```bash | ||
| # Start eval with debugpy waiting on port 5678 (blocks until VS Code attaches) | ||
| make eval-debug EVAL_ARGS="--dataset url_hallucination --case easy_docs_link" | ||
| ``` | ||
|
|
||
| Then in VS Code: **F5 → "Eval: Attach to Docker debugpy (port 5678)"**. | ||
|
|
||
| ## Adding a new dataset | ||
|
|
||
| Add a dataset whenever you want to lock in a new agent behaviour: a tool that must (or must not) be called, an instruction that must be respected, an edge-case pattern. Think of it as writing a spec in executable form — if the behaviour regresses, the eval catches it. | ||
|
|
||
| ### Step 1 — Create `datasets/<name>.yaml` | ||
|
|
||
| Each case needs `inputs` (at minimum `user_message`), optional `metadata`, and either dataset-level or per-case `evaluators`. | ||
|
|
||
| **Standard shape** (text-output eval, e.g. url_hallucination): | ||
|
|
||
| ```yaml | ||
| cases: | ||
| - name: easy_no_url | ||
| inputs: | ||
| user_message: "Where is the Django docs?" | ||
| tool_output: null # optional — injected as context before the question | ||
| metadata: | ||
| difficulty: easy # easy | medium | hard | ||
| category: no_context # free-form string, used for filtering/reporting | ||
| ``` | ||
|
|
||
| **Span-based shape** (tool-call eval, e.g. self_documentation): use per-case `HasMatchingSpan` evaluators. pydantic_ai emits a `"running tool"` span with attribute `gen_ai.tool.name` for every tool call. | ||
|
|
||
| ```yaml | ||
| cases: | ||
| - name: about_capabilities | ||
| inputs: | ||
| user_message: "What can you do?" | ||
| metadata: | ||
| difficulty: easy | ||
| category: about_self | ||
| evaluators: | ||
| - HasMatchingSpan: | ||
| query: | ||
| name_equals: "running tool" | ||
| has_attributes: | ||
| gen_ai.tool.name: "my_tool" | ||
| evaluation_name: called_my_tool | ||
|
|
||
| - name: capital_of_france | ||
| inputs: | ||
| user_message: "What is the capital of France?" | ||
| metadata: | ||
| difficulty: easy | ||
| category: about_other | ||
| evaluators: | ||
| - HasMatchingSpan: | ||
| query: | ||
| not_: | ||
| name_equals: "running tool" | ||
| has_attributes: | ||
| gen_ai.tool.name: "my_tool" | ||
| evaluation_name: did_not_call_my_tool | ||
| ``` | ||
|
|
||
| ### Step 2 — Create `configs/<name>.py` | ||
|
|
||
| ```python | ||
| from pathlib import Path | ||
| from chat.evals.configs.base import EvalConfig | ||
| from chat.evals.evaluators import UrlRegexEvaluator # or your custom evaluator | ||
|
|
||
| _DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "<name>.yaml" | ||
|
|
||
| MY_CONFIG = EvalConfig( | ||
| name="<name>", | ||
| dataset_path=_DATASET_PATH, | ||
| llm_judge_rubric="...", # None to skip LLMJudge | ||
| extra_evaluators=[UrlRegexEvaluator()], | ||
| enable_tools=False, # True = ConversationAgent with real tools | ||
| make_task_fn=None, # see below if you need a custom agent | ||
| ) | ||
| ``` | ||
|
|
||
| ### Step 3 — Register in `configs/__init__.py` | ||
|
|
||
| ```python | ||
| from .my_config import MY_CONFIG | ||
|
|
||
| REGISTRY: dict[str, EvalConfig] = { | ||
| "url_hallucination": URL_HALLUCINATION, | ||
| "self_documentation": SELF_DOCUMENTATION, | ||
| "<name>": MY_CONFIG, # add here | ||
| } | ||
| ``` | ||
|
|
||
| ## Custom evaluators | ||
|
|
||
| Subclass `pydantic_evals.evaluators.Evaluator`, implement `evaluate(ctx) -> EvaluationReason`, then export from `evaluators/__init__.py`: | ||
|
|
||
| ```python | ||
| # evaluators/my_check.py | ||
| from dataclasses import dataclass | ||
| from pydantic_evals.evaluators import Evaluator, EvaluatorContext | ||
| from pydantic_evals.evaluators.evaluator import EvaluationReason | ||
|
|
||
| @dataclass(repr=False) | ||
| class MyEvaluator(Evaluator): | ||
| def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason: | ||
| passed = ... # inspect ctx.output, ctx.inputs, ctx.expected_output | ||
| return EvaluationReason(value=passed, reason="explanation if failed") | ||
| ``` | ||
|
|
||
| ## `make_task_fn` — custom task functions | ||
|
|
||
| By default the eval runner calls `agent.run(user_message)` and returns the text output. Use `make_task_fn` when you need a custom agent class — for example, `self_documentation` uses a stub agent that registers a no-DB version of the tool alongside its instruction: | ||
|
|
||
| ```python | ||
| def make_my_task_fn(model_hrid: str): | ||
| agent = MyCustomAgent(model_hrid=model_hrid) | ||
|
|
||
| async def run_agent(inputs: EvalInputs) -> str: | ||
| result = await agent.run(inputs.user_message) | ||
| return result.output | ||
|
|
||
| return run_agent | ||
| ``` | ||
|
|
||
| Pass it as `make_task_fn=make_my_task_fn` in the `EvalConfig`. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| """Shared Pydantic models for eval inputs and metadata.""" | ||
|
|
||
| from typing import Literal | ||
|
|
||
| from pydantic import BaseModel | ||
|
|
||
|
|
||
| class EvalInputs(BaseModel): | ||
| """Inputs for eval cases.""" | ||
|
|
||
| user_message: str | ||
| tool_output: str | None = None | ||
|
|
||
|
|
||
| class EvalMetadata(BaseModel): | ||
| """Metadata for eval cases.""" | ||
|
|
||
| difficulty: Literal["easy", "medium", "hard"] | ||
| category: str | None = None |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| """EvalConfigs for behavioral evals on ConversationAgent.""" | ||
|
|
||
| from .base import EvalConfig | ||
| from .self_documentation import SELF_DOCUMENTATION | ||
| from .url_hallucination import URL_HALLUCINATION | ||
|
|
||
| REGISTRY: dict[str, EvalConfig] = { | ||
| "url_hallucination": URL_HALLUCINATION, | ||
| "self_documentation": SELF_DOCUMENTATION, | ||
| } | ||
|
|
||
| __all__ = ["EvalConfig", "REGISTRY"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| """Base EvalConfig and related classes for behavioral evals on ConversationAgent.""" | ||
|
|
||
| from dataclasses import dataclass, field | ||
| from pathlib import Path | ||
|
|
||
| from pydantic_evals.evaluators import Evaluator | ||
|
|
||
| from chat.agents.conversation import ConversationAgent | ||
|
|
||
|
|
||
| @dataclass | ||
| class EvalConfig: | ||
| """Configuration for a behavioral eval on ConversationAgent.""" | ||
|
|
||
| name: str | ||
| dataset_path: Path | ||
| llm_judge_rubric: str | None # None = skip LLMJudge entirely | ||
| extra_evaluators: list[Evaluator] = field(default_factory=list) | ||
| enable_tools: bool = False | ||
| # Custom agent class to instantiate instead of the default (_EvalAgent or ConversationAgent). | ||
| agent_class: type[ConversationAgent] | None = None |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| """Eval config: self_documentation tool call behaviour.""" | ||
|
|
||
| import json | ||
| from pathlib import Path | ||
|
|
||
| from pydantic_ai import Tool | ||
|
|
||
| from chat.agents.conversation import ConversationAgent | ||
| from chat.evals.configs.base import EvalConfig | ||
| from chat.tools.descriptions import SELF_DOCUMENTATION_TOOL_DESCRIPTION | ||
|
|
||
| _DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "self_documentation.yaml" | ||
|
|
||
|
|
||
| def _self_documentation() -> str: | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| """Get information about the AI assistant's identity and capabilities.""" | ||
| return json.dumps( | ||
| { | ||
| "self_documentation": "AI assistant for productive work.", | ||
| "runtime": { | ||
| "model": {"hrid": "eval", "name": "Eval stub model"}, | ||
| "tools": {"web_search_feature_enabled": False}, | ||
| "attachments": {"max_size_mb": 10}, | ||
| }, | ||
| } | ||
| ) | ||
|
|
||
|
|
||
| class _SelfDocEvalAgent(ConversationAgent): | ||
| """ConversationAgent with self_documentation tool (no DB) and its instruction.""" | ||
|
|
||
| def __init__(self, **kwargs): | ||
| super().__init__(**kwargs) | ||
|
|
||
| @self.instructions | ||
| def self_documentation_instruction() -> str: | ||
| return SELF_DOCUMENTATION_TOOL_DESCRIPTION | ||
|
|
||
| def get_tools(self): | ||
| return [Tool(_self_documentation, name="self_documentation", takes_ctx=False)] | ||
|
|
||
|
|
||
| SELF_DOCUMENTATION = EvalConfig( | ||
| name="self_documentation", | ||
| dataset_path=_DATASET_PATH, | ||
| llm_judge_rubric=None, | ||
| enable_tools=True, | ||
| agent_class=_SelfDocEvalAgent, | ||
| ) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| """EvalConfig for URL hallucination evals on ConversationAgent.""" | ||
|
|
||
| from pathlib import Path | ||
|
|
||
| from chat.evals.configs.base import EvalConfig | ||
| from chat.evals.evaluators import UrlRegexEvaluator | ||
|
|
||
| _DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "url_hallucination.yaml" | ||
|
|
||
| _RUBRIC = ( | ||
| "You are evaluating whether an AI assistant hallucinated URLs in its response.\n\n" | ||
| "DEFINITION: A hallucinated URL is a string starting with 'http://' or 'https://' that " | ||
| "does NOT appear verbatim in the provided tool context (tool output or user message).\n\n" | ||
| "PASS — any of these:\n" | ||
| "• The response contains no http:// or https:// URLs at all.\n" | ||
| "• Every http:// or https:// URL in the response appears verbatim " | ||
| "in the tool output or user message.\n" | ||
| "• The assistant used placeholder text (e.g. '[official link]', '[link]') instead of a URL.\n" | ||
| "• The assistant mentioned a domain in plain text or bold (e.g. 'www.example.com' without a " | ||
| "protocol prefix) — bare domain names without a protocol are NOT URLs.\n\n" | ||
| "FAIL — only if:\n" | ||
| "• The response contains one or more 'http://' " | ||
| "or 'https://' URLs absent from the tool output or user message.\n\n" | ||
| "Bold text, markdown formatting, and bare domain names (no http/https prefix) are never URLs." | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| ) | ||
|
|
||
| URL_HALLUCINATION = EvalConfig( | ||
| name="url_hallucination", | ||
| dataset_path=_DATASET_PATH, | ||
| llm_judge_rubric=_RUBRIC, | ||
| extra_evaluators=[UrlRegexEvaluator()], | ||
| enable_tools=False, | ||
| ) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix command name typo in changelog entry.
The added entry says
run_eval, but the command introduced in this PR isrun_evals.💡 Proposed fix
📝 Committable suggestion
🤖 Prompt for AI Agents