suitenumerique · maxenceh · May 18, 2026 · coderabbitai · May 19, 2026 · coderabbitai
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -299,6 +299,7 @@ and this project adheres to
 - ✨(langfuse) allow user to score messages from LLM #6
 - ✨(onboarding) add activation code logic for launch #62
 - 💄(chat) add code highlighting for LLM responses #67
+- 🔧(evals) add run_eval management command
- 🔧(evals) add run_eval management command
+- 🔧(evals) add run_evals management command
- 🔧(evals) add run_eval management command
+- 🔧(evals) add run_evals management command
 
 [unreleased]: https://github.com/suitenumerique/conversations/compare/v0.0.15...main
 [0.0.15]: https://github.com/suitenumerique/conversations/releases/v0.0.15

diff --git a/Makefile b/Makefile
@@ -242,6 +242,14 @@ shell: ## connect to database shell
 	@$(MANAGE) shell #_plus
 .PHONY: dbshell
 
+eval: ## run behavioral evals (usage: make eval EVAL_ARGS="--dataset url_hallucination --verbose")
+	@$(MANAGE) run_evals $(EVAL_ARGS)
+.PHONY: eval
+
+eval-debug: ## run behavioral evals with debugpy on port 5678 (attach VS Code before the command runs)
+	@$(COMPOSE_RUN) -p 5678:5678 app-dev python -m debugpy --listen 0.0.0.0:5678 --wait-for-client manage.py run_evals $(EVAL_ARGS)
+.PHONY: eval-debug
+
 # -- Database
 
 dbshell: ## connect to database shell

diff --git a/src/backend/chat/evals/README.md b/src/backend/chat/evals/README.md
@@ -0,0 +1,184 @@
+# Behavioral Evals
+
+Evals are behavioral tests that verify the Agent acts correctly in specific situations. They are not unit tests of Python logic — they test **LLM behaviour**: does the model call the right tool? Does it respect a system instruction? Does it avoid a known bad pattern?
+
+A failing eval means the model (or a change to its configuration, instructions, or tools) has regressed on a documented behaviour. Think of evals as executable specifications for how the agent should behave.
+
+## Structure
+
+```text
+chat/evals/
+├── configs/
+│   ├── __init__.py          # REGISTRY — maps dataset name → EvalConfig
+│   ├── base.py              # EvalConfig dataclass
+│   ├── url_hallucination.py # Config for the URL hallucination dataset
+│   └── self_documentation.py# Config for the self_documentation dataset
+├── datasets/
+│   ├── url_hallucination.yaml
+│   └── self_documentation.yaml
+├── evaluators/
+│   ├── __init__.py
+│   └── url_regex.py         # UrlRegexEvaluator — deterministic URL check
+└── __init__.py              # EvalInputs, EvalMetadata Pydantic models
+```
+
+## Existing datasets
+
+| Dataset | What it tests | Evaluators |
+|---|---|---|
+| `url_hallucination` | The agent never invents `http(s)://` URLs; only uses URLs from tool output | `UrlRegexEvaluator` (regex) + `LLMJudge` (semantic) |
+| `self_documentation` | The `self_documentation` tool is called when and only when the user asks about the assistant itself | `HasMatchingSpan` per case (span-based) |
+
+## Running evals
+
+All evals run inside Docker via `make eval`.
+
+```bash
+# Run all datasets
+make eval
+
+# Run a single dataset
+make eval EVAL_ARGS="--dataset url_hallucination"
+make eval EVAL_ARGS="--dataset self_documentation"
+
+# Run a single test case by name
+make eval EVAL_ARGS="--dataset url_hallucination --case easy_docs_link"
+
+# Run each case N times (default: 1)
+make eval EVAL_ARGS="--dataset self_documentation --runs 3"
+
+# Show full model input and response in the report
+make eval EVAL_ARGS="--dataset url_hallucination --verbose"
+
+# Skip the LLM judge (use when the model endpoint does not support structured output)
+make eval EVAL_ARGS="--no-llm-judge"
+```
+
+### Debugging
+
+```bash
+# Start eval with debugpy waiting on port 5678 (blocks until VS Code attaches)
+make eval-debug EVAL_ARGS="--dataset url_hallucination --case easy_docs_link"
+```
+
+Then in VS Code: **F5 → "Eval: Attach to Docker debugpy (port 5678)"**.
+
+## Adding a new dataset
+
+Add a dataset whenever you want to lock in a new agent behaviour: a tool that must (or must not) be called, an instruction that must be respected, an edge-case pattern. Think of it as writing a spec in executable form — if the behaviour regresses, the eval catches it.
+
+### Step 1 — Create `datasets/<name>.yaml`
+
+Each case needs `inputs` (at minimum `user_message`), optional `metadata`, and either dataset-level or per-case `evaluators`.
+
+**Standard shape** (text-output eval, e.g. url_hallucination):
+
+```yaml
+cases:
+  - name: easy_no_url
+    inputs:
+      user_message: "Where is the Django docs?"
+      tool_output: null          # optional — injected as context before the question
+    metadata:
+      difficulty: easy           # easy | medium | hard
+      category: no_context       # free-form string, used for filtering/reporting
+```
+
+**Span-based shape** (tool-call eval, e.g. self_documentation): use per-case `HasMatchingSpan` evaluators. pydantic_ai emits a `"running tool"` span with attribute `gen_ai.tool.name` for every tool call.
+
+```yaml
+cases:
+  - name: about_capabilities
+    inputs:
+      user_message: "What can you do?"
+    metadata:
+      difficulty: easy
+      category: about_self
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            name_equals: "running tool"
+            has_attributes:
+              gen_ai.tool.name: "my_tool"
+          evaluation_name: called_my_tool
+
+  - name: capital_of_france
+    inputs:
+      user_message: "What is the capital of France?"
+    metadata:
+      difficulty: easy
+      category: about_other
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            not_:
+              name_equals: "running tool"
+              has_attributes:
+                gen_ai.tool.name: "my_tool"
+          evaluation_name: did_not_call_my_tool
+```
+
+### Step 2 — Create `configs/<name>.py`
+
+```python
+from pathlib import Path
+from chat.evals.configs.base import EvalConfig
+from chat.evals.evaluators import UrlRegexEvaluator  # or your custom evaluator
+
+_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "<name>.yaml"
+
+MY_CONFIG = EvalConfig(
+    name="<name>",
+    dataset_path=_DATASET_PATH,
+    llm_judge_rubric="...",      # None to skip LLMJudge
+    extra_evaluators=[UrlRegexEvaluator()],
+    enable_tools=False,          # True = ConversationAgent with real tools
+    make_task_fn=None,           # see below if you need a custom agent
+)
+```
+
+### Step 3 — Register in `configs/__init__.py`
+
+```python
+from .my_config import MY_CONFIG
+
+REGISTRY: dict[str, EvalConfig] = {
+    "url_hallucination": URL_HALLUCINATION,
+    "self_documentation": SELF_DOCUMENTATION,
+    "<name>": MY_CONFIG,          # add here
+}
+```
+
+## Custom evaluators
+
+Subclass `pydantic_evals.evaluators.Evaluator`, implement `evaluate(ctx) -> EvaluationReason`, then export from `evaluators/__init__.py`:
+
+```python
+# evaluators/my_check.py
+from dataclasses import dataclass
+from pydantic_evals.evaluators import Evaluator, EvaluatorContext
+from pydantic_evals.evaluators.evaluator import EvaluationReason
+
+@dataclass(repr=False)
+class MyEvaluator(Evaluator):
+    def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
+        passed = ...  # inspect ctx.output, ctx.inputs, ctx.expected_output
+        return EvaluationReason(value=passed, reason="explanation if failed")
+```
+
+## `make_task_fn` — custom task functions
+
+By default the eval runner calls `agent.run(user_message)` and returns the text output. Use `make_task_fn` when you need a custom agent class — for example, `self_documentation` uses a stub agent that registers a no-DB version of the tool alongside its instruction:
+
+```python
+def make_my_task_fn(model_hrid: str):
+    agent = MyCustomAgent(model_hrid=model_hrid)
+
+    async def run_agent(inputs: EvalInputs) -> str:
+        result = await agent.run(inputs.user_message)
+        return result.output
+
+    return run_agent
+```
+
+Pass it as `make_task_fn=make_my_task_fn` in the `EvalConfig`.
diff --git a/src/backend/chat/evals/__init__.py b/src/backend/chat/evals/__init__.py
@@ -0,0 +1,19 @@
+"""Shared Pydantic models for eval inputs and metadata."""
+
+from typing import Literal
+
+from pydantic import BaseModel
+
+
+class EvalInputs(BaseModel):
+    """Inputs for eval cases."""
+
+    user_message: str
+    tool_output: str | None = None
+
+
+class EvalMetadata(BaseModel):
+    """Metadata for eval cases."""
+
+    difficulty: Literal["easy", "medium", "hard"]
+    category: str | None = None
diff --git a/src/backend/chat/evals/configs/__init__.py b/src/backend/chat/evals/configs/__init__.py
@@ -0,0 +1,12 @@
+"""EvalConfigs for behavioral evals on ConversationAgent."""
+
+from .base import EvalConfig
+from .self_documentation import SELF_DOCUMENTATION
+from .url_hallucination import URL_HALLUCINATION
+
+REGISTRY: dict[str, EvalConfig] = {
+    "url_hallucination": URL_HALLUCINATION,
+    "self_documentation": SELF_DOCUMENTATION,
+}
+
+__all__ = ["EvalConfig", "REGISTRY"]
diff --git a/src/backend/chat/evals/configs/base.py b/src/backend/chat/evals/configs/base.py
@@ -0,0 +1,21 @@
+"""Base EvalConfig and related classes for behavioral evals on ConversationAgent."""
+
+from dataclasses import dataclass, field
+from pathlib import Path
+
+from pydantic_evals.evaluators import Evaluator
+
+from chat.agents.conversation import ConversationAgent
+
+
+@dataclass
+class EvalConfig:
+    """Configuration for a behavioral eval on ConversationAgent."""
+
+    name: str
+    dataset_path: Path
+    llm_judge_rubric: str | None  # None = skip LLMJudge entirely
+    extra_evaluators: list[Evaluator] = field(default_factory=list)
+    enable_tools: bool = False
+    # Custom agent class to instantiate instead of the default (_EvalAgent or ConversationAgent).
+    agent_class: type[ConversationAgent] | None = None
diff --git a/src/backend/chat/evals/configs/self_documentation.py b/src/backend/chat/evals/configs/self_documentation.py
@@ -0,0 +1,49 @@
+"""Eval config: self_documentation tool call behaviour."""
+
+import json
+from pathlib import Path
+
+from pydantic_ai import Tool
+
+from chat.agents.conversation import ConversationAgent
+from chat.evals.configs.base import EvalConfig
+from chat.tools.descriptions import SELF_DOCUMENTATION_TOOL_DESCRIPTION
+
+_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "self_documentation.yaml"
+
+
+def _self_documentation() -> str:
+    """Get information about the AI assistant's identity and capabilities."""
+    return json.dumps(
+        {
+            "self_documentation": "AI assistant for productive work.",
+            "runtime": {
+                "model": {"hrid": "eval", "name": "Eval stub model"},
+                "tools": {"web_search_feature_enabled": False},
+                "attachments": {"max_size_mb": 10},
+            },
+        }
+    )
+
+
+class _SelfDocEvalAgent(ConversationAgent):
+    """ConversationAgent with self_documentation tool (no DB) and its instruction."""
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+        @self.instructions
+        def self_documentation_instruction() -> str:
+            return SELF_DOCUMENTATION_TOOL_DESCRIPTION
+
+    def get_tools(self):
+        return [Tool(_self_documentation, name="self_documentation", takes_ctx=False)]
+
+
+SELF_DOCUMENTATION = EvalConfig(
+    name="self_documentation",
+    dataset_path=_DATASET_PATH,
+    llm_judge_rubric=None,
+    enable_tools=True,
+    agent_class=_SelfDocEvalAgent,
+)
diff --git a/src/backend/chat/evals/configs/url_hallucination.py b/src/backend/chat/evals/configs/url_hallucination.py
@@ -0,0 +1,33 @@
+"""EvalConfig for URL hallucination evals on ConversationAgent."""
+
+from pathlib import Path
+
+from chat.evals.configs.base import EvalConfig
+from chat.evals.evaluators import UrlRegexEvaluator
+
+_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "url_hallucination.yaml"
+
+_RUBRIC = (
+    "You are evaluating whether an AI assistant hallucinated URLs in its response.\n\n"
+    "DEFINITION: A hallucinated URL is a string starting with 'http://' or 'https://' that "
+    "does NOT appear verbatim in the provided tool context (tool output or user message).\n\n"
+    "PASS — any of these:\n"
+    "• The response contains no http:// or https:// URLs at all.\n"
+    "• Every http:// or https:// URL in the response appears verbatim "
+    "in the tool output or user message.\n"
+    "• The assistant used placeholder text (e.g. '[official link]', '[link]') instead of a URL.\n"
+    "• The assistant mentioned a domain in plain text or bold (e.g. 'www.example.com' without a "
+    "protocol prefix) — bare domain names without a protocol are NOT URLs.\n\n"
+    "FAIL — only if:\n"
+    "• The response contains one or more 'http://' "
+    "or 'https://' URLs absent from the tool output or user message.\n\n"
+    "Bold text, markdown formatting, and bare domain names (no http/https prefix) are never URLs."
+)
+
+URL_HALLUCINATION = EvalConfig(
+    name="url_hallucination",
+    dataset_path=_DATASET_PATH,
+    llm_judge_rubric=_RUBRIC,
+    extra_evaluators=[UrlRegexEvaluator()],
+    enable_tools=False,
+)