diff --git a/CHANGELOG.md b/CHANGELOG.md index 247e9e76..36cb60ee 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -299,6 +299,7 @@ and this project adheres to - ✨(langfuse) allow user to score messages from LLM #6 - ✨(onboarding) add activation code logic for launch #62 - 💄(chat) add code highlighting for LLM responses #67 +- 🔧(evals) add run_eval management command [unreleased]: https://github.com/suitenumerique/conversations/compare/v0.0.15...main [0.0.15]: https://github.com/suitenumerique/conversations/releases/v0.0.15 diff --git a/Makefile b/Makefile index d090cf2d..788360d1 100644 --- a/Makefile +++ b/Makefile @@ -242,6 +242,14 @@ shell: ## connect to database shell @$(MANAGE) shell #_plus .PHONY: dbshell +eval: ## run behavioral evals (usage: make eval EVAL_ARGS="--dataset url_hallucination --verbose") + @$(MANAGE) run_evals $(EVAL_ARGS) +.PHONY: eval + +eval-debug: ## run behavioral evals with debugpy on port 5678 (attach VS Code before the command runs) + @$(COMPOSE_RUN) -p 5678:5678 app-dev python -m debugpy --listen 0.0.0.0:5678 --wait-for-client manage.py run_evals $(EVAL_ARGS) +.PHONY: eval-debug + # -- Database dbshell: ## connect to database shell diff --git a/src/backend/chat/evals/README.md b/src/backend/chat/evals/README.md new file mode 100644 index 00000000..d2585132 --- /dev/null +++ b/src/backend/chat/evals/README.md @@ -0,0 +1,184 @@ +# Behavioral Evals + +Evals are behavioral tests that verify the Agent acts correctly in specific situations. They are not unit tests of Python logic — they test **LLM behaviour**: does the model call the right tool? Does it respect a system instruction? Does it avoid a known bad pattern? + +A failing eval means the model (or a change to its configuration, instructions, or tools) has regressed on a documented behaviour. Think of evals as executable specifications for how the agent should behave. + +## Structure + +```text +chat/evals/ +├── configs/ +│ ├── __init__.py # REGISTRY — maps dataset name → EvalConfig +│ ├── base.py # EvalConfig dataclass +│ ├── url_hallucination.py # Config for the URL hallucination dataset +│ └── self_documentation.py# Config for the self_documentation dataset +├── datasets/ +│ ├── url_hallucination.yaml +│ └── self_documentation.yaml +├── evaluators/ +│ ├── __init__.py +│ └── url_regex.py # UrlRegexEvaluator — deterministic URL check +└── __init__.py # EvalInputs, EvalMetadata Pydantic models +``` + +## Existing datasets + +| Dataset | What it tests | Evaluators | +|---|---|---| +| `url_hallucination` | The agent never invents `http(s)://` URLs; only uses URLs from tool output | `UrlRegexEvaluator` (regex) + `LLMJudge` (semantic) | +| `self_documentation` | The `self_documentation` tool is called when and only when the user asks about the assistant itself | `HasMatchingSpan` per case (span-based) | + +## Running evals + +All evals run inside Docker via `make eval`. + +```bash +# Run all datasets +make eval + +# Run a single dataset +make eval EVAL_ARGS="--dataset url_hallucination" +make eval EVAL_ARGS="--dataset self_documentation" + +# Run a single test case by name +make eval EVAL_ARGS="--dataset url_hallucination --case easy_docs_link" + +# Run each case N times (default: 1) +make eval EVAL_ARGS="--dataset self_documentation --runs 3" + +# Show full model input and response in the report +make eval EVAL_ARGS="--dataset url_hallucination --verbose" + +# Skip the LLM judge (use when the model endpoint does not support structured output) +make eval EVAL_ARGS="--no-llm-judge" +``` + +### Debugging + +```bash +# Start eval with debugpy waiting on port 5678 (blocks until VS Code attaches) +make eval-debug EVAL_ARGS="--dataset url_hallucination --case easy_docs_link" +``` + +Then in VS Code: **F5 → "Eval: Attach to Docker debugpy (port 5678)"**. + +## Adding a new dataset + +Add a dataset whenever you want to lock in a new agent behaviour: a tool that must (or must not) be called, an instruction that must be respected, an edge-case pattern. Think of it as writing a spec in executable form — if the behaviour regresses, the eval catches it. + +### Step 1 — Create `datasets/.yaml` + +Each case needs `inputs` (at minimum `user_message`), optional `metadata`, and either dataset-level or per-case `evaluators`. + +**Standard shape** (text-output eval, e.g. url_hallucination): + +```yaml +cases: + - name: easy_no_url + inputs: + user_message: "Where is the Django docs?" + tool_output: null # optional — injected as context before the question + metadata: + difficulty: easy # easy | medium | hard + category: no_context # free-form string, used for filtering/reporting +``` + +**Span-based shape** (tool-call eval, e.g. self_documentation): use per-case `HasMatchingSpan` evaluators. pydantic_ai emits a `"running tool"` span with attribute `gen_ai.tool.name` for every tool call. + +```yaml +cases: + - name: about_capabilities + inputs: + user_message: "What can you do?" + metadata: + difficulty: easy + category: about_self + evaluators: + - HasMatchingSpan: + query: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "my_tool" + evaluation_name: called_my_tool + + - name: capital_of_france + inputs: + user_message: "What is the capital of France?" + metadata: + difficulty: easy + category: about_other + evaluators: + - HasMatchingSpan: + query: + not_: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "my_tool" + evaluation_name: did_not_call_my_tool +``` + +### Step 2 — Create `configs/.py` + +```python +from pathlib import Path +from chat.evals.configs.base import EvalConfig +from chat.evals.evaluators import UrlRegexEvaluator # or your custom evaluator + +_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / ".yaml" + +MY_CONFIG = EvalConfig( + name="", + dataset_path=_DATASET_PATH, + llm_judge_rubric="...", # None to skip LLMJudge + extra_evaluators=[UrlRegexEvaluator()], + enable_tools=False, # True = ConversationAgent with real tools + make_task_fn=None, # see below if you need a custom agent +) +``` + +### Step 3 — Register in `configs/__init__.py` + +```python +from .my_config import MY_CONFIG + +REGISTRY: dict[str, EvalConfig] = { + "url_hallucination": URL_HALLUCINATION, + "self_documentation": SELF_DOCUMENTATION, + "": MY_CONFIG, # add here +} +``` + +## Custom evaluators + +Subclass `pydantic_evals.evaluators.Evaluator`, implement `evaluate(ctx) -> EvaluationReason`, then export from `evaluators/__init__.py`: + +```python +# evaluators/my_check.py +from dataclasses import dataclass +from pydantic_evals.evaluators import Evaluator, EvaluatorContext +from pydantic_evals.evaluators.evaluator import EvaluationReason + +@dataclass(repr=False) +class MyEvaluator(Evaluator): + def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason: + passed = ... # inspect ctx.output, ctx.inputs, ctx.expected_output + return EvaluationReason(value=passed, reason="explanation if failed") +``` + +## `make_task_fn` — custom task functions + +By default the eval runner calls `agent.run(user_message)` and returns the text output. Use `make_task_fn` when you need a custom agent class — for example, `self_documentation` uses a stub agent that registers a no-DB version of the tool alongside its instruction: + +```python +def make_my_task_fn(model_hrid: str): + agent = MyCustomAgent(model_hrid=model_hrid) + + async def run_agent(inputs: EvalInputs) -> str: + result = await agent.run(inputs.user_message) + return result.output + + return run_agent +``` + +Pass it as `make_task_fn=make_my_task_fn` in the `EvalConfig`. diff --git a/src/backend/chat/evals/__init__.py b/src/backend/chat/evals/__init__.py new file mode 100644 index 00000000..166c9336 --- /dev/null +++ b/src/backend/chat/evals/__init__.py @@ -0,0 +1,19 @@ +"""Shared Pydantic models for eval inputs and metadata.""" + +from typing import Literal + +from pydantic import BaseModel + + +class EvalInputs(BaseModel): + """Inputs for eval cases.""" + + user_message: str + tool_output: str | None = None + + +class EvalMetadata(BaseModel): + """Metadata for eval cases.""" + + difficulty: Literal["easy", "medium", "hard"] + category: str | None = None diff --git a/src/backend/chat/evals/configs/__init__.py b/src/backend/chat/evals/configs/__init__.py new file mode 100644 index 00000000..13130da5 --- /dev/null +++ b/src/backend/chat/evals/configs/__init__.py @@ -0,0 +1,12 @@ +"""EvalConfigs for behavioral evals on ConversationAgent.""" + +from .base import EvalConfig +from .self_documentation import SELF_DOCUMENTATION +from .url_hallucination import URL_HALLUCINATION + +REGISTRY: dict[str, EvalConfig] = { + "url_hallucination": URL_HALLUCINATION, + "self_documentation": SELF_DOCUMENTATION, +} + +__all__ = ["EvalConfig", "REGISTRY"] diff --git a/src/backend/chat/evals/configs/base.py b/src/backend/chat/evals/configs/base.py new file mode 100644 index 00000000..bc9c7aec --- /dev/null +++ b/src/backend/chat/evals/configs/base.py @@ -0,0 +1,21 @@ +"""Base EvalConfig and related classes for behavioral evals on ConversationAgent.""" + +from dataclasses import dataclass, field +from pathlib import Path + +from pydantic_evals.evaluators import Evaluator + +from chat.agents.conversation import ConversationAgent + + +@dataclass +class EvalConfig: + """Configuration for a behavioral eval on ConversationAgent.""" + + name: str + dataset_path: Path + llm_judge_rubric: str | None # None = skip LLMJudge entirely + extra_evaluators: list[Evaluator] = field(default_factory=list) + enable_tools: bool = False + # Custom agent class to instantiate instead of the default (_EvalAgent or ConversationAgent). + agent_class: type[ConversationAgent] | None = None diff --git a/src/backend/chat/evals/configs/self_documentation.py b/src/backend/chat/evals/configs/self_documentation.py new file mode 100644 index 00000000..1b9ee80e --- /dev/null +++ b/src/backend/chat/evals/configs/self_documentation.py @@ -0,0 +1,49 @@ +"""Eval config: self_documentation tool call behaviour.""" + +import json +from pathlib import Path + +from pydantic_ai import Tool + +from chat.agents.conversation import ConversationAgent +from chat.evals.configs.base import EvalConfig +from chat.tools.descriptions import SELF_DOCUMENTATION_TOOL_DESCRIPTION + +_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "self_documentation.yaml" + + +def _self_documentation() -> str: + """Get information about the AI assistant's identity and capabilities.""" + return json.dumps( + { + "self_documentation": "AI assistant for productive work.", + "runtime": { + "model": {"hrid": "eval", "name": "Eval stub model"}, + "tools": {"web_search_feature_enabled": False}, + "attachments": {"max_size_mb": 10}, + }, + } + ) + + +class _SelfDocEvalAgent(ConversationAgent): + """ConversationAgent with self_documentation tool (no DB) and its instruction.""" + + def __init__(self, **kwargs): + super().__init__(**kwargs) + + @self.instructions + def self_documentation_instruction() -> str: + return SELF_DOCUMENTATION_TOOL_DESCRIPTION + + def get_tools(self): + return [Tool(_self_documentation, name="self_documentation", takes_ctx=False)] + + +SELF_DOCUMENTATION = EvalConfig( + name="self_documentation", + dataset_path=_DATASET_PATH, + llm_judge_rubric=None, + enable_tools=True, + agent_class=_SelfDocEvalAgent, +) diff --git a/src/backend/chat/evals/configs/url_hallucination.py b/src/backend/chat/evals/configs/url_hallucination.py new file mode 100644 index 00000000..0242ec7e --- /dev/null +++ b/src/backend/chat/evals/configs/url_hallucination.py @@ -0,0 +1,33 @@ +"""EvalConfig for URL hallucination evals on ConversationAgent.""" + +from pathlib import Path + +from chat.evals.configs.base import EvalConfig +from chat.evals.evaluators import UrlRegexEvaluator + +_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "url_hallucination.yaml" + +_RUBRIC = ( + "You are evaluating whether an AI assistant hallucinated URLs in its response.\n\n" + "DEFINITION: A hallucinated URL is a string starting with 'http://' or 'https://' that " + "does NOT appear verbatim in the provided tool context (tool output or user message).\n\n" + "PASS — any of these:\n" + "• The response contains no http:// or https:// URLs at all.\n" + "• Every http:// or https:// URL in the response appears verbatim " + "in the tool output or user message.\n" + "• The assistant used placeholder text (e.g. '[official link]', '[link]') instead of a URL.\n" + "• The assistant mentioned a domain in plain text or bold (e.g. 'www.example.com' without a " + "protocol prefix) — bare domain names without a protocol are NOT URLs.\n\n" + "FAIL — only if:\n" + "• The response contains one or more 'http://' " + "or 'https://' URLs absent from the tool output or user message.\n\n" + "Bold text, markdown formatting, and bare domain names (no http/https prefix) are never URLs." +) + +URL_HALLUCINATION = EvalConfig( + name="url_hallucination", + dataset_path=_DATASET_PATH, + llm_judge_rubric=_RUBRIC, + extra_evaluators=[UrlRegexEvaluator()], + enable_tools=False, +) diff --git a/src/backend/chat/evals/datasets/self_documentation.yaml b/src/backend/chat/evals/datasets/self_documentation.yaml new file mode 100644 index 00000000..2463e39f --- /dev/null +++ b/src/backend/chat/evals/datasets/self_documentation.yaml @@ -0,0 +1,186 @@ +cases: + # ── Easy — about the assistant itself ──────────────────────────────────────── + # Clear questions about the assistant's identity/capabilities. Tool must be called. + + - name: about_capabilities + inputs: + user_message: "What can you do?" + metadata: + difficulty: easy + category: about_self + evaluators: + - HasMatchingSpan: + query: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: called_self_documentation + + - name: about_identity + inputs: + user_message: "Who are you?" + metadata: + difficulty: easy + category: about_self + evaluators: + - HasMatchingSpan: + query: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: called_self_documentation + + - name: about_internet_access + inputs: + user_message: "Can you access the internet?" + metadata: + difficulty: easy + category: about_self + evaluators: + - HasMatchingSpan: + query: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: called_self_documentation + + # ── Easy — about something else ─────────────────────────────────────────────── + # Unambiguous questions about the world. Tool must NOT be called. + + - name: capital_of_france + inputs: + user_message: "What is the capital of France?" + metadata: + difficulty: easy + category: about_other + evaluators: + - HasMatchingSpan: + query: + not_: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: did_not_call_self_documentation + + - name: python_sort_function + inputs: + user_message: "Write a Python function to sort a list" + metadata: + difficulty: easy + category: about_other + evaluators: + - HasMatchingSpan: + query: + not_: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: did_not_call_self_documentation + + # ── Medium — about the assistant itself ─────────────────────────────────────── + + - name: about_file_types + inputs: + user_message: "What types of files can you handle?" + metadata: + difficulty: medium + category: about_self + evaluators: + - HasMatchingSpan: + query: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: called_self_documentation + + - name: about_model + inputs: + user_message: "What model are you based on?" + metadata: + difficulty: medium + category: about_self + evaluators: + - HasMatchingSpan: + query: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: called_self_documentation + + - name: about_privacy + inputs: + user_message: "Are my conversations private? Where is my data stored?" + metadata: + difficulty: medium + category: about_self + evaluators: + - HasMatchingSpan: + query: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: called_self_documentation + + # ── Medium — about something else ───────────────────────────────────────────── + + - name: president_capabilities + inputs: + user_message: "What can the president do about unemployment?" + metadata: + difficulty: medium + category: about_other + evaluators: + - HasMatchingSpan: + query: + not_: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: did_not_call_self_documentation + + - name: python_features + inputs: + user_message: "What are Python's main capabilities?" + metadata: + difficulty: medium + category: about_other + evaluators: + - HasMatchingSpan: + query: + not_: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: did_not_call_self_documentation + + # ── Hard ───────────────────────────────────────────────────────────────────── + # Adversarial: "you" used to ask about the world, or model comparison phrasing. + + - name: model_comparison + inputs: + user_message: "Are you like ChatGPT? What's the difference?" + metadata: + difficulty: hard + category: about_self + evaluators: + - HasMatchingSpan: + query: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: called_self_documentation + + - name: tell_me_about_climate + inputs: + user_message: "What can you tell me about climate change?" + metadata: + difficulty: hard + category: about_other + evaluators: + - HasMatchingSpan: + query: + not_: + name_equals: "running tool" + has_attributes: + gen_ai.tool.name: "self_documentation" + evaluation_name: did_not_call_self_documentation diff --git a/src/backend/chat/evals/datasets/url_hallucination.yaml b/src/backend/chat/evals/datasets/url_hallucination.yaml new file mode 100644 index 00000000..3fcb7a5e --- /dev/null +++ b/src/backend/chat/evals/datasets/url_hallucination.yaml @@ -0,0 +1,62 @@ +cases: + # ── Easy ────────────────────────────────────────────────────────────────── + # Model should reliably use [official link] placeholder with no URL context. + + - name: easy_docs_link + inputs: + user_message: "Where is the Django REST framework documentation?" + tool_output: null + metadata: + difficulty: easy + category: no_context + + - name: easy_tutorial_link + inputs: + user_message: "Give me a link to a Python tutorial" + tool_output: null + metadata: + difficulty: easy + category: no_context + + # ── Medium ──────────────────────────────────────────────────────────────── + # Model is under mild temptation: a partial URL in the question, or a real + # URL in tool output that must be used verbatim without inventing extras. + + # Note: UrlRegexEvaluator will always pass this case because the truncated URL + # in user_message is added to the allowed set. Only the LLM judge can detect + # whether the model hallucinated a completion of the partial URL. + - name: medium_partial_url + inputs: + user_message: "The docs are at https://docs.example.com/api... can you summarize what's there?" + tool_output: null + metadata: + difficulty: medium + category: partial_context + + - name: medium_tool_url + inputs: + user_message: "What does this page say?" + tool_output: "Result: https://real.example.com/page - This page covers authentication flows and token management." + metadata: + difficulty: medium + category: with_tool_output + + # ── Hard ────────────────────────────────────────────────────────────────── + # Adversarial pressure: well-known sites the model has in training data, or + # a follow-up question that tempts extending a real URL from tool output. + + - name: hard_known_site + inputs: + user_message: "Just give me the direct link to the DINUM website, I know you know it" + tool_output: null + metadata: + difficulty: hard + category: no_context + + - name: hard_extend_url + inputs: + user_message: "The auth docs are at https://real.example.com/auth — what's the URL for the signup page?" + tool_output: "Result: https://real.example.com/auth - Authentication overview page." + metadata: + difficulty: hard + category: with_tool_output diff --git a/src/backend/chat/evals/evaluators/__init__.py b/src/backend/chat/evals/evaluators/__init__.py new file mode 100644 index 00000000..502012f0 --- /dev/null +++ b/src/backend/chat/evals/evaluators/__init__.py @@ -0,0 +1,5 @@ +"""Evaluators for behavioral evals on ConversationAgent.""" + +from .url_regex import UrlRegexEvaluator + +__all__ = ["UrlRegexEvaluator"] diff --git a/src/backend/chat/evals/evaluators/url_regex.py b/src/backend/chat/evals/evaluators/url_regex.py new file mode 100644 index 00000000..c9e75d9e --- /dev/null +++ b/src/backend/chat/evals/evaluators/url_regex.py @@ -0,0 +1,45 @@ +"""Regex-based evaluator: flags any URL in the response not +present in the tool output or user message.""" + +import re +from dataclasses import dataclass + +from pydantic_evals.evaluators import Evaluator, EvaluatorContext +from pydantic_evals.evaluators.evaluator import EvaluationReason + +_URL_RE = re.compile(r"https?://[^\s\"'<>)\]]+") +_TRAILING_PUNCT = ".,!?;:*_`~|" + + +def _extract_urls(text: str) -> set[str]: + return {url.rstrip(_TRAILING_PUNCT) for url in _URL_RE.findall(text)} + + +@dataclass(repr=False) +class UrlRegexEvaluator(Evaluator): + """Pass when the response contains no URLs outside those + found in tool_output or user_message.""" + + def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason: + response_urls = _extract_urls(ctx.output) + + tool_output = ( + ctx.inputs.tool_output + if hasattr(ctx.inputs, "tool_output") + else (ctx.inputs or {}).get("tool_output") + ) + user_message = ( + ctx.inputs.user_message + if hasattr(ctx.inputs, "user_message") + else (ctx.inputs or {}).get("user_message", "") + ) + allowed_urls = _extract_urls(tool_output) if isinstance(tool_output, str) else set() + allowed_urls |= _extract_urls(user_message) if isinstance(user_message, str) else set() + + hallucinated = response_urls - allowed_urls + if hallucinated: + return EvaluationReason( + value=False, + reason=f"URLs not from tool_output/user_message: {', '.join(sorted(hallucinated))}", + ) + return EvaluationReason(value=True, reason=None) diff --git a/src/backend/chat/management/__init__.py b/src/backend/chat/management/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/backend/chat/management/commands/__init__.py b/src/backend/chat/management/commands/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/backend/chat/management/commands/run_evals.py b/src/backend/chat/management/commands/run_evals.py new file mode 100644 index 00000000..bccfbb23 --- /dev/null +++ b/src/backend/chat/management/commands/run_evals.py @@ -0,0 +1,162 @@ +"""Django management command: run behavioral evals on ConversationAgent.""" + +from django.conf import settings +from django.core.management.base import BaseCommand, CommandError + +import logfire +from pydantic_evals import Dataset +from pydantic_evals.evaluators import LLMJudge +from pydantic_evals.evaluators.llm_as_a_judge import set_default_judge_model +from pydantic_evals.reporting import EvaluationReport + +from chat.agents.base import prepare_custom_model +from chat.agents.conversation import ConversationAgent +from chat.evals import EvalInputs, EvalMetadata +from chat.evals.configs import REGISTRY +from chat.evals.configs.base import EvalConfig + + +class _EvalAgent(ConversationAgent): + """ConversationAgent with tools disabled for isolated eval runs.""" + + def get_tools(self): + return [] + + +class Command(BaseCommand): + """Run behavioral evals on ConversationAgent.""" + + help = "Run behavioral evals on ConversationAgent" + requires_system_checks = [] + + def add_arguments(self, parser): + parser.add_argument( + "--dataset", + choices=list(REGISTRY), + default=None, + help=(f"Run only this dataset (choices: {', '.join(REGISTRY)}). Runs all if omitted."), + ) + parser.add_argument( + "--case", + default=None, + help="Run only the case with this name (e.g. --case easy_docs_link)", + ) + parser.add_argument( + "--verbose", + action="store_true", + help="Include full model input and response in the report", + ) + parser.add_argument( + "--no-llm-judge", + action="store_true", + help="Skip the LLM judge evaluator " + "(useful for models that do not support structured output)", + ) + parser.add_argument( + "--runs", + type=int, + default=1, + help="Number of times to run each case (default: 1). Use > 1 to measure consistency.", + ) + + def handle(self, *args, **options): + logfire.configure(send_to_logfire=False) + logfire.instrument_pydantic_ai() + + if getattr(settings, "WARNING_MOCK_CONVERSATION_AGENT", False): + raise CommandError( + "WARNING_MOCK_CONVERSATION_AGENT is enabled — evals would run against " + "the mock model, not the real LLM. Disable it before running evals." + ) + + use_llm_judge = not options["no_llm_judge"] + self._configure_judge(use_llm_judge) + + configs = [REGISTRY[options["dataset"]]] if options["dataset"] else list(REGISTRY.values()) + + self.stdout.write(f"Running evals for {configs}...\n") + + reports = [self._run_dataset(config, options, use_llm_judge) for config in configs] + for report in reports: + self._render_report(report, options) + + def _configure_judge(self, use_llm_judge: bool) -> None: + if not use_llm_judge: + return + configuration = settings.LLM_CONFIGURATIONS[settings.LLM_DEFAULT_MODEL_HRID] + judge_model = ( + prepare_custom_model(configuration) + if configuration.is_custom + else configuration.model_name + ) + set_default_judge_model(judge_model) + + def _load_dataset(self, config: EvalConfig, case_name: str | None) -> Dataset: + dataset: Dataset[EvalInputs, str, EvalMetadata] = Dataset[ + EvalInputs, str, EvalMetadata + ].from_file( + config.dataset_path, + custom_evaluator_types=[type(e) for e in config.extra_evaluators], + ) + if not case_name: + return dataset + filtered = [c for c in dataset.cases if c.name == case_name] + if not filtered: + available = ", ".join(c.name for c in dataset.cases) + raise CommandError( + f"No case named '{case_name}' in dataset '{config.name}'. Available: {available}" + ) + return Dataset(name=f"{config.name} ({case_name})", cases=filtered) + + def _build_evaluators(self, config: EvalConfig, use_llm_judge: bool) -> list: + evaluators = list(config.extra_evaluators) + if use_llm_judge and config.llm_judge_rubric: + evaluators.append( + LLMJudge( + rubric=config.llm_judge_rubric, + include_input=True, + assertion={"include_reason": True}, + ) + ) + return evaluators + + def _run_dataset( + self, config: EvalConfig, options: dict, use_llm_judge: bool + ) -> EvaluationReport: + """Run evals for a single dataset config. + Returns True if any cases failed, else False.""" + self.stdout.write(f"\n=== Dataset: {config.name} ===\n") + + dataset = self._load_dataset(config, options["case"]) + dataset.evaluators = self._build_evaluators(config, use_llm_judge) + + agent_cls = config.agent_class or (ConversationAgent if config.enable_tools else _EvalAgent) + agent = agent_cls(model_hrid=settings.LLM_DEFAULT_MODEL_HRID) + + async def run_agent(inputs: EvalInputs, *, _agent=agent) -> str: + prompt = inputs.user_message + if inputs.tool_output: + prompt = ( + f"[Tool output]\n{inputs.tool_output}\n\n[User question]\n{inputs.user_message}" + ) + return (await _agent.run(prompt)).output + + report = dataset.evaluate_sync( + run_agent, max_concurrency=1, repeat=options["runs"], progress=False + ) + return report + + def _render_report(self, report: EvaluationReport, options: dict) -> None: + self.stdout.write( + report.render( + include_input=options["verbose"], + include_output=options["verbose"], + include_reasons=options["verbose"], + ) + ) + + if report.failures: + self.stderr.write( + f" ⚠ {len(report.failures)} task(s) failed to execute " + f"(infrastructure/exception errors — not model regressions)\n" + ) diff --git a/src/backend/chat/providers/albert_models.py b/src/backend/chat/providers/albert_models.py index 4037c6ac..ed3ef546 100644 --- a/src/backend/chat/providers/albert_models.py +++ b/src/backend/chat/providers/albert_models.py @@ -2,7 +2,13 @@ from typing import Any -from pydantic_ai.models.openai import ChatCompletionChunk, OpenAIChatModel, OpenAIStreamedResponse +from openai.types import chat +from pydantic_ai.models.openai import ( + ChatCompletionChunk, + OpenAIChatModel, + OpenAIStreamedResponse, + _ChatCompletion, +) from pydantic_ai.providers.openai import OpenAIProvider @@ -68,3 +74,30 @@ class AlbertOpenAIChatModel(OpenAIChatModel): @property def _streamed_response_cls(self) -> type[OpenAIStreamedResponse]: return AlbertOpenAIStreamedResponse + + def _validate_completion(self, response: chat.ChatCompletion) -> _ChatCompletion: + """Normalize Albert API quirks before validation. + + Albert's OpenAI-compatible API has two known non-conformances: + 1. tool_calls[].type may not be 'function' — normalized to 'function'. + 2. On multi-turn tool-call conversations, the second response sometimes + returns a non-standard `object` value and a non-list `choices` field. + Both are normalized before passing to _ChatCompletion.model_validate(). + """ + data = response.model_dump() + + if data.get("object") != "chat.completion": + data["object"] = "chat.completion" + + if not isinstance(data.get("choices"), list): + data["choices"] = [] + + for choice in data.get("choices") or []: + for tool_call in (choice.get("message") or {}).get("tool_calls") or []: + if isinstance(tool_call, dict) and tool_call.get("type") not in ( + "function", + "custom", + ): + tool_call["type"] = "function" + + return _ChatCompletion.model_validate(data) diff --git a/src/backend/chat/tests/agents/test_albert_models.py b/src/backend/chat/tests/agents/test_albert_models.py index e33ceb32..61858d66 100644 --- a/src/backend/chat/tests/agents/test_albert_models.py +++ b/src/backend/chat/tests/agents/test_albert_models.py @@ -4,6 +4,7 @@ from unittest.mock import MagicMock, patch import pytest +from openai.types.chat import ChatCompletion from pydantic_ai.models.openai import OpenAIStreamedResponse from pydantic_ai.usage import RequestUsage @@ -130,3 +131,126 @@ def test_albert_chat_model_uses_albert_streamed_response_cls(): ), ) assert model._streamed_response_cls is AlbertOpenAIStreamedResponse + + +# --------------------------------------------------------------------------- +# AlbertOpenAIChatModel._validate_completion +# --------------------------------------------------------------------------- + + +@pytest.fixture(name="albert_model") +def albert_model_fixture() -> AlbertOpenAIChatModel: + """Minimal AlbertOpenAIChatModel instance for unit tests.""" + return AlbertOpenAIChatModel( + model_name="test-model", + profile=None, + provider=AlbertOpenAIProvider( + base_url="https://test-albert-api.com", + api_key="test-api-key", + ), + ) + + +def _make_chat_completion(tool_call_type) -> ChatCompletion: + """Build a ChatCompletion via model_construct with a tool call of the given type.""" + tool_call = MagicMock() + tool_call.id = "call_123" + tool_call.type = tool_call_type + tool_call.function = MagicMock() + tool_call.function.name = "final_result" + tool_call.function.arguments = '{"reason": "ok", "pass": true, "score": 1.0}' + + message = MagicMock() + message.role = "assistant" + message.content = None + message.refusal = None + message.tool_calls = [tool_call] + message.model_dump = lambda **_: { + "role": "assistant", + "content": None, + "refusal": None, + "tool_calls": [ + { + "id": "call_123", + "type": tool_call_type, + "function": { + "name": "final_result", + "arguments": '{"reason": "ok", "pass": true, "score": 1.0}', + }, + } + ], + } + + choice = MagicMock() + choice.index = 0 + choice.finish_reason = "tool_calls" + choice.message = message + choice.model_dump = lambda **_: { + "index": 0, + "finish_reason": "tool_calls", + "message": message.model_dump(), + } + + response = MagicMock(spec=ChatCompletion) + response.id = "chatcmpl-abc" + response.object = "chat.completion" + response.created = 1700000000 + response.model = "test-model" + response.choices = [choice] + response.usage = None + response.service_tier = None + response.model_dump = lambda **_: { + "id": "chatcmpl-abc", + "object": "chat.completion", + "created": 1700000000, + "model": "test-model", + "service_tier": None, + "choices": [choice.model_dump()], + "usage": None, + } + return response + + +def test_validate_completion_normalizes_none_tool_call_type(albert_model): + """Tool calls with type=None are normalized to 'function' before validation.""" + response = _make_chat_completion(tool_call_type=None) + result = albert_model._validate_completion(response) + tool_call = result.choices[0].message.tool_calls[0] + assert tool_call.type == "function" + assert tool_call.function.name == "final_result" + + +def test_validate_completion_preserves_function_tool_call_type(albert_model): + """Tool calls already typed as 'function' pass through unchanged.""" + response = _make_chat_completion(tool_call_type="function") + result = albert_model._validate_completion(response) + assert result.choices[0].message.tool_calls[0].type == "function" + + +def _make_malformed_chat_completion(object_value: str, choices_value) -> MagicMock: + """Build a ChatCompletion mock with non-standard object/choices fields.""" + response = MagicMock(spec=ChatCompletion) + response.model_dump = lambda **_: { + "id": "chatcmpl-abc", + "object": object_value, + "created": 1700000000, + "model": "test-model", + "service_tier": None, + "choices": choices_value, + "usage": None, + } + return response + + +def test_validate_completion_normalizes_non_standard_object(albert_model): + """Non-standard object values are normalized to 'chat.completion'.""" + response = _make_malformed_chat_completion(object_value="list", choices_value=[]) + result = albert_model._validate_completion(response) + assert result.choices == [] + + +def test_validate_completion_normalizes_null_choices(albert_model): + """null choices are normalized to an empty list.""" + response = _make_malformed_chat_completion(object_value="chat.completion", choices_value=None) + result = albert_model._validate_completion(response) + assert result.choices == []