diff --git a/CHANGELOG.md b/CHANGELOG.md
index 247e9e76..36cb60ee 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -299,6 +299,7 @@ and this project adheres to
 - ✨(langfuse) allow user to score messages from LLM #6
 - ✨(onboarding) add activation code logic for launch #62
 - 💄(chat) add code highlighting for LLM responses #67
+- 🔧(evals) add run_eval management command
 
 [unreleased]: https://github.com/suitenumerique/conversations/compare/v0.0.15...main
 [0.0.15]: https://github.com/suitenumerique/conversations/releases/v0.0.15
diff --git a/Makefile b/Makefile
index d090cf2d..788360d1 100644
--- a/Makefile
+++ b/Makefile
@@ -242,6 +242,14 @@ shell: ## connect to database shell
 	@$(MANAGE) shell #_plus
 .PHONY: dbshell
 
+eval: ## run behavioral evals (usage: make eval EVAL_ARGS="--dataset url_hallucination --verbose")
+	@$(MANAGE) run_evals $(EVAL_ARGS)
+.PHONY: eval
+
+eval-debug: ## run behavioral evals with debugpy on port 5678 (attach VS Code before the command runs)
+	@$(COMPOSE_RUN) -p 5678:5678 app-dev python -m debugpy --listen 0.0.0.0:5678 --wait-for-client manage.py run_evals $(EVAL_ARGS)
+.PHONY: eval-debug
+
 # -- Database
 
 dbshell: ## connect to database shell
diff --git a/src/backend/chat/evals/README.md b/src/backend/chat/evals/README.md
new file mode 100644
index 00000000..d2585132
--- /dev/null
+++ b/src/backend/chat/evals/README.md
@@ -0,0 +1,184 @@
+# Behavioral Evals
+
+Evals are behavioral tests that verify the Agent acts correctly in specific situations. They are not unit tests of Python logic — they test **LLM behaviour**: does the model call the right tool? Does it respect a system instruction? Does it avoid a known bad pattern?
+
+A failing eval means the model (or a change to its configuration, instructions, or tools) has regressed on a documented behaviour. Think of evals as executable specifications for how the agent should behave.
+
+## Structure
+
+```text
+chat/evals/
+├── configs/
+│   ├── __init__.py          # REGISTRY — maps dataset name → EvalConfig
+│   ├── base.py              # EvalConfig dataclass
+│   ├── url_hallucination.py # Config for the URL hallucination dataset
+│   └── self_documentation.py# Config for the self_documentation dataset
+├── datasets/
+│   ├── url_hallucination.yaml
+│   └── self_documentation.yaml
+├── evaluators/
+│   ├── __init__.py
+│   └── url_regex.py         # UrlRegexEvaluator — deterministic URL check
+└── __init__.py              # EvalInputs, EvalMetadata Pydantic models
+```
+
+## Existing datasets
+
+| Dataset | What it tests | Evaluators |
+|---|---|---|
+| `url_hallucination` | The agent never invents `http(s)://` URLs; only uses URLs from tool output | `UrlRegexEvaluator` (regex) + `LLMJudge` (semantic) |
+| `self_documentation` | The `self_documentation` tool is called when and only when the user asks about the assistant itself | `HasMatchingSpan` per case (span-based) |
+
+## Running evals
+
+All evals run inside Docker via `make eval`.
+
+```bash
+# Run all datasets
+make eval
+
+# Run a single dataset
+make eval EVAL_ARGS="--dataset url_hallucination"
+make eval EVAL_ARGS="--dataset self_documentation"
+
+# Run a single test case by name
+make eval EVAL_ARGS="--dataset url_hallucination --case easy_docs_link"
+
+# Run each case N times (default: 1)
+make eval EVAL_ARGS="--dataset self_documentation --runs 3"
+
+# Show full model input and response in the report
+make eval EVAL_ARGS="--dataset url_hallucination --verbose"
+
+# Skip the LLM judge (use when the model endpoint does not support structured output)
+make eval EVAL_ARGS="--no-llm-judge"
+```
+
+### Debugging
+
+```bash
+# Start eval with debugpy waiting on port 5678 (blocks until VS Code attaches)
+make eval-debug EVAL_ARGS="--dataset url_hallucination --case easy_docs_link"
+```
+
+Then in VS Code: **F5 → "Eval: Attach to Docker debugpy (port 5678)"**.
+
+## Adding a new dataset
+
+Add a dataset whenever you want to lock in a new agent behaviour: a tool that must (or must not) be called, an instruction that must be respected, an edge-case pattern. Think of it as writing a spec in executable form — if the behaviour regresses, the eval catches it.
+
+### Step 1 — Create `datasets/<name>.yaml`
+
+Each case needs `inputs` (at minimum `user_message`), optional `metadata`, and either dataset-level or per-case `evaluators`.
+
+**Standard shape** (text-output eval, e.g. url_hallucination):
+
+```yaml
+cases:
+  - name: easy_no_url
+    inputs:
+      user_message: "Where is the Django docs?"
+      tool_output: null          # optional — injected as context before the question
+    metadata:
+      difficulty: easy           # easy | medium | hard
+      category: no_context       # free-form string, used for filtering/reporting
+```
+
+**Span-based shape** (tool-call eval, e.g. self_documentation): use per-case `HasMatchingSpan` evaluators. pydantic_ai emits a `"running tool"` span with attribute `gen_ai.tool.name` for every tool call.
+
+```yaml
+cases:
+  - name: about_capabilities
+    inputs:
+      user_message: "What can you do?"
+    metadata:
+      difficulty: easy
+      category: about_self
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            name_equals: "running tool"
+            has_attributes:
+              gen_ai.tool.name: "my_tool"
+          evaluation_name: called_my_tool
+
+  - name: capital_of_france
+    inputs:
+      user_message: "What is the capital of France?"
+    metadata:
+      difficulty: easy
+      category: about_other
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            not_:
+              name_equals: "running tool"
+              has_attributes:
+                gen_ai.tool.name: "my_tool"
+          evaluation_name: did_not_call_my_tool
+```
+
+### Step 2 — Create `configs/<name>.py`
+
+```python
+from pathlib import Path
+from chat.evals.configs.base import EvalConfig
+from chat.evals.evaluators import UrlRegexEvaluator  # or your custom evaluator
+
+_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "<name>.yaml"
+
+MY_CONFIG = EvalConfig(
+    name="<name>",
+    dataset_path=_DATASET_PATH,
+    llm_judge_rubric="...",      # None to skip LLMJudge
+    extra_evaluators=[UrlRegexEvaluator()],
+    enable_tools=False,          # True = ConversationAgent with real tools
+    make_task_fn=None,           # see below if you need a custom agent
+)
+```
+
+### Step 3 — Register in `configs/__init__.py`
+
+```python
+from .my_config import MY_CONFIG
+
+REGISTRY: dict[str, EvalConfig] = {
+    "url_hallucination": URL_HALLUCINATION,
+    "self_documentation": SELF_DOCUMENTATION,
+    "<name>": MY_CONFIG,          # add here
+}
+```
+
+## Custom evaluators
+
+Subclass `pydantic_evals.evaluators.Evaluator`, implement `evaluate(ctx) -> EvaluationReason`, then export from `evaluators/__init__.py`:
+
+```python
+# evaluators/my_check.py
+from dataclasses import dataclass
+from pydantic_evals.evaluators import Evaluator, EvaluatorContext
+from pydantic_evals.evaluators.evaluator import EvaluationReason
+
+@dataclass(repr=False)
+class MyEvaluator(Evaluator):
+    def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
+        passed = ...  # inspect ctx.output, ctx.inputs, ctx.expected_output
+        return EvaluationReason(value=passed, reason="explanation if failed")
+```
+
+## `make_task_fn` — custom task functions
+
+By default the eval runner calls `agent.run(user_message)` and returns the text output. Use `make_task_fn` when you need a custom agent class — for example, `self_documentation` uses a stub agent that registers a no-DB version of the tool alongside its instruction:
+
+```python
+def make_my_task_fn(model_hrid: str):
+    agent = MyCustomAgent(model_hrid=model_hrid)
+
+    async def run_agent(inputs: EvalInputs) -> str:
+        result = await agent.run(inputs.user_message)
+        return result.output
+
+    return run_agent
+```
+
+Pass it as `make_task_fn=make_my_task_fn` in the `EvalConfig`.
diff --git a/src/backend/chat/evals/__init__.py b/src/backend/chat/evals/__init__.py
new file mode 100644
index 00000000..166c9336
--- /dev/null
+++ b/src/backend/chat/evals/__init__.py
@@ -0,0 +1,19 @@
+"""Shared Pydantic models for eval inputs and metadata."""
+
+from typing import Literal
+
+from pydantic import BaseModel
+
+
+class EvalInputs(BaseModel):
+    """Inputs for eval cases."""
+
+    user_message: str
+    tool_output: str | None = None
+
+
+class EvalMetadata(BaseModel):
+    """Metadata for eval cases."""
+
+    difficulty: Literal["easy", "medium", "hard"]
+    category: str | None = None
diff --git a/src/backend/chat/evals/configs/__init__.py b/src/backend/chat/evals/configs/__init__.py
new file mode 100644
index 00000000..13130da5
--- /dev/null
+++ b/src/backend/chat/evals/configs/__init__.py
@@ -0,0 +1,12 @@
+"""EvalConfigs for behavioral evals on ConversationAgent."""
+
+from .base import EvalConfig
+from .self_documentation import SELF_DOCUMENTATION
+from .url_hallucination import URL_HALLUCINATION
+
+REGISTRY: dict[str, EvalConfig] = {
+    "url_hallucination": URL_HALLUCINATION,
+    "self_documentation": SELF_DOCUMENTATION,
+}
+
+__all__ = ["EvalConfig", "REGISTRY"]
diff --git a/src/backend/chat/evals/configs/base.py b/src/backend/chat/evals/configs/base.py
new file mode 100644
index 00000000..bc9c7aec
--- /dev/null
+++ b/src/backend/chat/evals/configs/base.py
@@ -0,0 +1,21 @@
+"""Base EvalConfig and related classes for behavioral evals on ConversationAgent."""
+
+from dataclasses import dataclass, field
+from pathlib import Path
+
+from pydantic_evals.evaluators import Evaluator
+
+from chat.agents.conversation import ConversationAgent
+
+
+@dataclass
+class EvalConfig:
+    """Configuration for a behavioral eval on ConversationAgent."""
+
+    name: str
+    dataset_path: Path
+    llm_judge_rubric: str | None  # None = skip LLMJudge entirely
+    extra_evaluators: list[Evaluator] = field(default_factory=list)
+    enable_tools: bool = False
+    # Custom agent class to instantiate instead of the default (_EvalAgent or ConversationAgent).
+    agent_class: type[ConversationAgent] | None = None
diff --git a/src/backend/chat/evals/configs/self_documentation.py b/src/backend/chat/evals/configs/self_documentation.py
new file mode 100644
index 00000000..1b9ee80e
--- /dev/null
+++ b/src/backend/chat/evals/configs/self_documentation.py
@@ -0,0 +1,49 @@
+"""Eval config: self_documentation tool call behaviour."""
+
+import json
+from pathlib import Path
+
+from pydantic_ai import Tool
+
+from chat.agents.conversation import ConversationAgent
+from chat.evals.configs.base import EvalConfig
+from chat.tools.descriptions import SELF_DOCUMENTATION_TOOL_DESCRIPTION
+
+_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "self_documentation.yaml"
+
+
+def _self_documentation() -> str:
+    """Get information about the AI assistant's identity and capabilities."""
+    return json.dumps(
+        {
+            "self_documentation": "AI assistant for productive work.",
+            "runtime": {
+                "model": {"hrid": "eval", "name": "Eval stub model"},
+                "tools": {"web_search_feature_enabled": False},
+                "attachments": {"max_size_mb": 10},
+            },
+        }
+    )
+
+
+class _SelfDocEvalAgent(ConversationAgent):
+    """ConversationAgent with self_documentation tool (no DB) and its instruction."""
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+        @self.instructions
+        def self_documentation_instruction() -> str:
+            return SELF_DOCUMENTATION_TOOL_DESCRIPTION
+
+    def get_tools(self):
+        return [Tool(_self_documentation, name="self_documentation", takes_ctx=False)]
+
+
+SELF_DOCUMENTATION = EvalConfig(
+    name="self_documentation",
+    dataset_path=_DATASET_PATH,
+    llm_judge_rubric=None,
+    enable_tools=True,
+    agent_class=_SelfDocEvalAgent,
+)
diff --git a/src/backend/chat/evals/configs/url_hallucination.py b/src/backend/chat/evals/configs/url_hallucination.py
new file mode 100644
index 00000000..0242ec7e
--- /dev/null
+++ b/src/backend/chat/evals/configs/url_hallucination.py
@@ -0,0 +1,33 @@
+"""EvalConfig for URL hallucination evals on ConversationAgent."""
+
+from pathlib import Path
+
+from chat.evals.configs.base import EvalConfig
+from chat.evals.evaluators import UrlRegexEvaluator
+
+_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "url_hallucination.yaml"
+
+_RUBRIC = (
+    "You are evaluating whether an AI assistant hallucinated URLs in its response.\n\n"
+    "DEFINITION: A hallucinated URL is a string starting with 'http://' or 'https://' that "
+    "does NOT appear verbatim in the provided tool context (tool output or user message).\n\n"
+    "PASS — any of these:\n"
+    "• The response contains no http:// or https:// URLs at all.\n"
+    "• Every http:// or https:// URL in the response appears verbatim "
+    "in the tool output or user message.\n"
+    "• The assistant used placeholder text (e.g. '[official link]', '[link]') instead of a URL.\n"
+    "• The assistant mentioned a domain in plain text or bold (e.g. 'www.example.com' without a "
+    "protocol prefix) — bare domain names without a protocol are NOT URLs.\n\n"
+    "FAIL — only if:\n"
+    "• The response contains one or more 'http://' "
+    "or 'https://' URLs absent from the tool output or user message.\n\n"
+    "Bold text, markdown formatting, and bare domain names (no http/https prefix) are never URLs."
+)
+
+URL_HALLUCINATION = EvalConfig(
+    name="url_hallucination",
+    dataset_path=_DATASET_PATH,
+    llm_judge_rubric=_RUBRIC,
+    extra_evaluators=[UrlRegexEvaluator()],
+    enable_tools=False,
+)
diff --git a/src/backend/chat/evals/datasets/self_documentation.yaml b/src/backend/chat/evals/datasets/self_documentation.yaml
new file mode 100644
index 00000000..2463e39f
--- /dev/null
+++ b/src/backend/chat/evals/datasets/self_documentation.yaml
@@ -0,0 +1,186 @@
+cases:
+  # ── Easy — about the assistant itself ────────────────────────────────────────
+  # Clear questions about the assistant's identity/capabilities. Tool must be called.
+
+  - name: about_capabilities
+    inputs:
+      user_message: "What can you do?"
+    metadata:
+      difficulty: easy
+      category: about_self
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            name_equals: "running tool"
+            has_attributes:
+              gen_ai.tool.name: "self_documentation"
+          evaluation_name: called_self_documentation
+
+  - name: about_identity
+    inputs:
+      user_message: "Who are you?"
+    metadata:
+      difficulty: easy
+      category: about_self
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            name_equals: "running tool"
+            has_attributes:
+              gen_ai.tool.name: "self_documentation"
+          evaluation_name: called_self_documentation
+
+  - name: about_internet_access
+    inputs:
+      user_message: "Can you access the internet?"
+    metadata:
+      difficulty: easy
+      category: about_self
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            name_equals: "running tool"
+            has_attributes:
+              gen_ai.tool.name: "self_documentation"
+          evaluation_name: called_self_documentation
+
+  # ── Easy — about something else ───────────────────────────────────────────────
+  # Unambiguous questions about the world. Tool must NOT be called.
+
+  - name: capital_of_france
+    inputs:
+      user_message: "What is the capital of France?"
+    metadata:
+      difficulty: easy
+      category: about_other
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            not_:
+              name_equals: "running tool"
+              has_attributes:
+                gen_ai.tool.name: "self_documentation"
+          evaluation_name: did_not_call_self_documentation
+
+  - name: python_sort_function
+    inputs:
+      user_message: "Write a Python function to sort a list"
+    metadata:
+      difficulty: easy
+      category: about_other
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            not_:
+              name_equals: "running tool"
+              has_attributes:
+                gen_ai.tool.name: "self_documentation"
+          evaluation_name: did_not_call_self_documentation
+
+  # ── Medium — about the assistant itself ───────────────────────────────────────
+
+  - name: about_file_types
+    inputs:
+      user_message: "What types of files can you handle?"
+    metadata:
+      difficulty: medium
+      category: about_self
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            name_equals: "running tool"
+            has_attributes:
+              gen_ai.tool.name: "self_documentation"
+          evaluation_name: called_self_documentation
+
+  - name: about_model
+    inputs:
+      user_message: "What model are you based on?"
+    metadata:
+      difficulty: medium
+      category: about_self
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            name_equals: "running tool"
+            has_attributes:
+              gen_ai.tool.name: "self_documentation"
+          evaluation_name: called_self_documentation
+
+  - name: about_privacy
+    inputs:
+      user_message: "Are my conversations private? Where is my data stored?"
+    metadata:
+      difficulty: medium
+      category: about_self
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            name_equals: "running tool"
+            has_attributes:
+              gen_ai.tool.name: "self_documentation"
+          evaluation_name: called_self_documentation
+
+  # ── Medium — about something else ─────────────────────────────────────────────
+
+  - name: president_capabilities
+    inputs:
+      user_message: "What can the president do about unemployment?"
+    metadata:
+      difficulty: medium
+      category: about_other
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            not_:
+              name_equals: "running tool"
+              has_attributes:
+                gen_ai.tool.name: "self_documentation"
+          evaluation_name: did_not_call_self_documentation
+
+  - name: python_features
+    inputs:
+      user_message: "What are Python's main capabilities?"
+    metadata:
+      difficulty: medium
+      category: about_other
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            not_:
+              name_equals: "running tool"
+              has_attributes:
+                gen_ai.tool.name: "self_documentation"
+          evaluation_name: did_not_call_self_documentation
+
+  # ── Hard ─────────────────────────────────────────────────────────────────────
+  # Adversarial: "you" used to ask about the world, or model comparison phrasing.
+
+  - name: model_comparison
+    inputs:
+      user_message: "Are you like ChatGPT? What's the difference?"
+    metadata:
+      difficulty: hard
+      category: about_self
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            name_equals: "running tool"
+            has_attributes:
+              gen_ai.tool.name: "self_documentation"
+          evaluation_name: called_self_documentation
+
+  - name: tell_me_about_climate
+    inputs:
+      user_message: "What can you tell me about climate change?"
+    metadata:
+      difficulty: hard
+      category: about_other
+    evaluators:
+      - HasMatchingSpan:
+          query:
+            not_:
+              name_equals: "running tool"
+              has_attributes:
+                gen_ai.tool.name: "self_documentation"
+          evaluation_name: did_not_call_self_documentation
diff --git a/src/backend/chat/evals/datasets/url_hallucination.yaml b/src/backend/chat/evals/datasets/url_hallucination.yaml
new file mode 100644
index 00000000..3fcb7a5e
--- /dev/null
+++ b/src/backend/chat/evals/datasets/url_hallucination.yaml
@@ -0,0 +1,62 @@
+cases:
+  # ── Easy ──────────────────────────────────────────────────────────────────
+  # Model should reliably use [official link] placeholder with no URL context.
+
+  - name: easy_docs_link
+    inputs:
+      user_message: "Where is the Django REST framework documentation?"
+      tool_output: null
+    metadata:
+      difficulty: easy
+      category: no_context
+
+  - name: easy_tutorial_link
+    inputs:
+      user_message: "Give me a link to a Python tutorial"
+      tool_output: null
+    metadata:
+      difficulty: easy
+      category: no_context
+
+  # ── Medium ────────────────────────────────────────────────────────────────
+  # Model is under mild temptation: a partial URL in the question, or a real
+  # URL in tool output that must be used verbatim without inventing extras.
+
+  # Note: UrlRegexEvaluator will always pass this case because the truncated URL
+  # in user_message is added to the allowed set. Only the LLM judge can detect
+  # whether the model hallucinated a completion of the partial URL.
+  - name: medium_partial_url
+    inputs:
+      user_message: "The docs are at https://docs.example.com/api... can you summarize what's there?"
+      tool_output: null
+    metadata:
+      difficulty: medium
+      category: partial_context
+
+  - name: medium_tool_url
+    inputs:
+      user_message: "What does this page say?"
+      tool_output: "Result: https://real.example.com/page - This page covers authentication flows and token management."
+    metadata:
+      difficulty: medium
+      category: with_tool_output
+
+  # ── Hard ──────────────────────────────────────────────────────────────────
+  # Adversarial pressure: well-known sites the model has in training data, or
+  # a follow-up question that tempts extending a real URL from tool output.
+
+  - name: hard_known_site
+    inputs:
+      user_message: "Just give me the direct link to the DINUM website, I know you know it"
+      tool_output: null
+    metadata:
+      difficulty: hard
+      category: no_context
+
+  - name: hard_extend_url
+    inputs:
+      user_message: "The auth docs are at https://real.example.com/auth — what's the URL for the signup page?"
+      tool_output: "Result: https://real.example.com/auth - Authentication overview page."
+    metadata:
+      difficulty: hard
+      category: with_tool_output
diff --git a/src/backend/chat/evals/evaluators/__init__.py b/src/backend/chat/evals/evaluators/__init__.py
new file mode 100644
index 00000000..502012f0
--- /dev/null
+++ b/src/backend/chat/evals/evaluators/__init__.py
@@ -0,0 +1,5 @@
+"""Evaluators for behavioral evals on ConversationAgent."""
+
+from .url_regex import UrlRegexEvaluator
+
+__all__ = ["UrlRegexEvaluator"]
diff --git a/src/backend/chat/evals/evaluators/url_regex.py b/src/backend/chat/evals/evaluators/url_regex.py
new file mode 100644
index 00000000..c9e75d9e
--- /dev/null
+++ b/src/backend/chat/evals/evaluators/url_regex.py
@@ -0,0 +1,45 @@
+"""Regex-based evaluator: flags any URL in the response not
+present in the tool output or user message."""
+
+import re
+from dataclasses import dataclass
+
+from pydantic_evals.evaluators import Evaluator, EvaluatorContext
+from pydantic_evals.evaluators.evaluator import EvaluationReason
+
+_URL_RE = re.compile(r"https?://[^\s\"'<>)\]]+")
+_TRAILING_PUNCT = ".,!?;:*_`~|"
+
+
+def _extract_urls(text: str) -> set[str]:
+    return {url.rstrip(_TRAILING_PUNCT) for url in _URL_RE.findall(text)}
+
+
+@dataclass(repr=False)
+class UrlRegexEvaluator(Evaluator):
+    """Pass when the response contains no URLs outside those
+    found in tool_output or user_message."""
+
+    def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
+        response_urls = _extract_urls(ctx.output)
+
+        tool_output = (
+            ctx.inputs.tool_output
+            if hasattr(ctx.inputs, "tool_output")
+            else (ctx.inputs or {}).get("tool_output")
+        )
+        user_message = (
+            ctx.inputs.user_message
+            if hasattr(ctx.inputs, "user_message")
+            else (ctx.inputs or {}).get("user_message", "")
+        )
+        allowed_urls = _extract_urls(tool_output) if isinstance(tool_output, str) else set()
+        allowed_urls |= _extract_urls(user_message) if isinstance(user_message, str) else set()
+
+        hallucinated = response_urls - allowed_urls
+        if hallucinated:
+            return EvaluationReason(
+                value=False,
+                reason=f"URLs not from tool_output/user_message: {', '.join(sorted(hallucinated))}",
+            )
+        return EvaluationReason(value=True, reason=None)
diff --git a/src/backend/chat/management/__init__.py b/src/backend/chat/management/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/src/backend/chat/management/commands/__init__.py b/src/backend/chat/management/commands/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/src/backend/chat/management/commands/run_evals.py b/src/backend/chat/management/commands/run_evals.py
new file mode 100644
index 00000000..bccfbb23
--- /dev/null
+++ b/src/backend/chat/management/commands/run_evals.py
@@ -0,0 +1,162 @@
+"""Django management command: run behavioral evals on ConversationAgent."""
+
+from django.conf import settings
+from django.core.management.base import BaseCommand, CommandError
+
+import logfire
+from pydantic_evals import Dataset
+from pydantic_evals.evaluators import LLMJudge
+from pydantic_evals.evaluators.llm_as_a_judge import set_default_judge_model
+from pydantic_evals.reporting import EvaluationReport
+
+from chat.agents.base import prepare_custom_model
+from chat.agents.conversation import ConversationAgent
+from chat.evals import EvalInputs, EvalMetadata
+from chat.evals.configs import REGISTRY
+from chat.evals.configs.base import EvalConfig
+
+
+class _EvalAgent(ConversationAgent):
+    """ConversationAgent with tools disabled for isolated eval runs."""
+
+    def get_tools(self):
+        return []
+
+
+class Command(BaseCommand):
+    """Run behavioral evals on ConversationAgent."""
+
+    help = "Run behavioral evals on ConversationAgent"
+    requires_system_checks = []
+
+    def add_arguments(self, parser):
+        parser.add_argument(
+            "--dataset",
+            choices=list(REGISTRY),
+            default=None,
+            help=(f"Run only this dataset (choices: {', '.join(REGISTRY)}). Runs all if omitted."),
+        )
+        parser.add_argument(
+            "--case",
+            default=None,
+            help="Run only the case with this name (e.g. --case easy_docs_link)",
+        )
+        parser.add_argument(
+            "--verbose",
+            action="store_true",
+            help="Include full model input and response in the report",
+        )
+        parser.add_argument(
+            "--no-llm-judge",
+            action="store_true",
+            help="Skip the LLM judge evaluator "
+            "(useful for models that do not support structured output)",
+        )
+        parser.add_argument(
+            "--runs",
+            type=int,
+            default=1,
+            help="Number of times to run each case (default: 1). Use > 1 to measure consistency.",
+        )
+
+    def handle(self, *args, **options):
+        logfire.configure(send_to_logfire=False)
+        logfire.instrument_pydantic_ai()
+
+        if getattr(settings, "WARNING_MOCK_CONVERSATION_AGENT", False):
+            raise CommandError(
+                "WARNING_MOCK_CONVERSATION_AGENT is enabled — evals would run against "
+                "the mock model, not the real LLM. Disable it before running evals."
+            )
+
+        use_llm_judge = not options["no_llm_judge"]
+        self._configure_judge(use_llm_judge)
+
+        configs = [REGISTRY[options["dataset"]]] if options["dataset"] else list(REGISTRY.values())
+
+        self.stdout.write(f"Running evals for {configs}...\n")
+
+        reports = [self._run_dataset(config, options, use_llm_judge) for config in configs]
+        for report in reports:
+            self._render_report(report, options)
+
+    def _configure_judge(self, use_llm_judge: bool) -> None:
+        if not use_llm_judge:
+            return
+        configuration = settings.LLM_CONFIGURATIONS[settings.LLM_DEFAULT_MODEL_HRID]
+        judge_model = (
+            prepare_custom_model(configuration)
+            if configuration.is_custom
+            else configuration.model_name
+        )
+        set_default_judge_model(judge_model)
+
+    def _load_dataset(self, config: EvalConfig, case_name: str | None) -> Dataset:
+        dataset: Dataset[EvalInputs, str, EvalMetadata] = Dataset[
+            EvalInputs, str, EvalMetadata
+        ].from_file(
+            config.dataset_path,
+            custom_evaluator_types=[type(e) for e in config.extra_evaluators],
+        )
+        if not case_name:
+            return dataset
+        filtered = [c for c in dataset.cases if c.name == case_name]
+        if not filtered:
+            available = ", ".join(c.name for c in dataset.cases)
+            raise CommandError(
+                f"No case named '{case_name}' in dataset '{config.name}'. Available: {available}"
+            )
+        return Dataset(name=f"{config.name} ({case_name})", cases=filtered)
+
+    def _build_evaluators(self, config: EvalConfig, use_llm_judge: bool) -> list:
+        evaluators = list(config.extra_evaluators)
+        if use_llm_judge and config.llm_judge_rubric:
+            evaluators.append(
+                LLMJudge(
+                    rubric=config.llm_judge_rubric,
+                    include_input=True,
+                    assertion={"include_reason": True},
+                )
+            )
+        return evaluators
+
+    def _run_dataset(
+        self, config: EvalConfig, options: dict, use_llm_judge: bool
+    ) -> EvaluationReport:
+        """Run evals for a single dataset config.
+        Returns True if any cases failed, else False."""
+        self.stdout.write(f"\n=== Dataset: {config.name} ===\n")
+
+        dataset = self._load_dataset(config, options["case"])
+        dataset.evaluators = self._build_evaluators(config, use_llm_judge)
+
+        agent_cls = config.agent_class or (ConversationAgent if config.enable_tools else _EvalAgent)
+        agent = agent_cls(model_hrid=settings.LLM_DEFAULT_MODEL_HRID)
+
+        async def run_agent(inputs: EvalInputs, *, _agent=agent) -> str:
+            prompt = inputs.user_message
+            if inputs.tool_output:
+                prompt = (
+                    f"[Tool output]\n{inputs.tool_output}\n\n[User question]\n{inputs.user_message}"
+                )
+            return (await _agent.run(prompt)).output
+
+        report = dataset.evaluate_sync(
+            run_agent, max_concurrency=1, repeat=options["runs"], progress=False
+        )
+        return report
+
+    def _render_report(self, report: EvaluationReport, options: dict) -> None:
+        self.stdout.write(
+            report.render(
+                include_input=options["verbose"],
+                include_output=options["verbose"],
+                include_reasons=options["verbose"],
+            )
+        )
+
+        if report.failures:
+            self.stderr.write(
+                f"  ⚠  {len(report.failures)} task(s) failed to execute "
+                f"(infrastructure/exception errors — not model regressions)\n"
+            )
diff --git a/src/backend/chat/providers/albert_models.py b/src/backend/chat/providers/albert_models.py
index 4037c6ac..ed3ef546 100644
--- a/src/backend/chat/providers/albert_models.py
+++ b/src/backend/chat/providers/albert_models.py
@@ -2,7 +2,13 @@
 
 from typing import Any
 
-from pydantic_ai.models.openai import ChatCompletionChunk, OpenAIChatModel, OpenAIStreamedResponse
+from openai.types import chat
+from pydantic_ai.models.openai import (
+    ChatCompletionChunk,
+    OpenAIChatModel,
+    OpenAIStreamedResponse,
+    _ChatCompletion,
+)
 from pydantic_ai.providers.openai import OpenAIProvider
 
 
@@ -68,3 +74,30 @@ class AlbertOpenAIChatModel(OpenAIChatModel):
     @property
     def _streamed_response_cls(self) -> type[OpenAIStreamedResponse]:
         return AlbertOpenAIStreamedResponse
+
+    def _validate_completion(self, response: chat.ChatCompletion) -> _ChatCompletion:
+        """Normalize Albert API quirks before validation.
+
+        Albert's OpenAI-compatible API has two known non-conformances:
+        1. tool_calls[].type may not be 'function' — normalized to 'function'.
+        2. On multi-turn tool-call conversations, the second response sometimes
+           returns a non-standard `object` value and a non-list `choices` field.
+           Both are normalized before passing to _ChatCompletion.model_validate().
+        """
+        data = response.model_dump()
+
+        if data.get("object") != "chat.completion":
+            data["object"] = "chat.completion"
+
+        if not isinstance(data.get("choices"), list):
+            data["choices"] = []
+
+        for choice in data.get("choices") or []:
+            for tool_call in (choice.get("message") or {}).get("tool_calls") or []:
+                if isinstance(tool_call, dict) and tool_call.get("type") not in (
+                    "function",
+                    "custom",
+                ):
+                    tool_call["type"] = "function"
+
+        return _ChatCompletion.model_validate(data)
diff --git a/src/backend/chat/tests/agents/test_albert_models.py b/src/backend/chat/tests/agents/test_albert_models.py
index e33ceb32..61858d66 100644
--- a/src/backend/chat/tests/agents/test_albert_models.py
+++ b/src/backend/chat/tests/agents/test_albert_models.py
@@ -4,6 +4,7 @@
 from unittest.mock import MagicMock, patch
 
 import pytest
+from openai.types.chat import ChatCompletion
 from pydantic_ai.models.openai import OpenAIStreamedResponse
 from pydantic_ai.usage import RequestUsage
 
@@ -130,3 +131,126 @@ def test_albert_chat_model_uses_albert_streamed_response_cls():
         ),
     )
     assert model._streamed_response_cls is AlbertOpenAIStreamedResponse
+
+
+# ---------------------------------------------------------------------------
+# AlbertOpenAIChatModel._validate_completion
+# ---------------------------------------------------------------------------
+
+
+@pytest.fixture(name="albert_model")
+def albert_model_fixture() -> AlbertOpenAIChatModel:
+    """Minimal AlbertOpenAIChatModel instance for unit tests."""
+    return AlbertOpenAIChatModel(
+        model_name="test-model",
+        profile=None,
+        provider=AlbertOpenAIProvider(
+            base_url="https://test-albert-api.com",
+            api_key="test-api-key",
+        ),
+    )
+
+
+def _make_chat_completion(tool_call_type) -> ChatCompletion:
+    """Build a ChatCompletion via model_construct with a tool call of the given type."""
+    tool_call = MagicMock()
+    tool_call.id = "call_123"
+    tool_call.type = tool_call_type
+    tool_call.function = MagicMock()
+    tool_call.function.name = "final_result"
+    tool_call.function.arguments = '{"reason": "ok", "pass": true, "score": 1.0}'
+
+    message = MagicMock()
+    message.role = "assistant"
+    message.content = None
+    message.refusal = None
+    message.tool_calls = [tool_call]
+    message.model_dump = lambda **_: {
+        "role": "assistant",
+        "content": None,
+        "refusal": None,
+        "tool_calls": [
+            {
+                "id": "call_123",
+                "type": tool_call_type,
+                "function": {
+                    "name": "final_result",
+                    "arguments": '{"reason": "ok", "pass": true, "score": 1.0}',
+                },
+            }
+        ],
+    }
+
+    choice = MagicMock()
+    choice.index = 0
+    choice.finish_reason = "tool_calls"
+    choice.message = message
+    choice.model_dump = lambda **_: {
+        "index": 0,
+        "finish_reason": "tool_calls",
+        "message": message.model_dump(),
+    }
+
+    response = MagicMock(spec=ChatCompletion)
+    response.id = "chatcmpl-abc"
+    response.object = "chat.completion"
+    response.created = 1700000000
+    response.model = "test-model"
+    response.choices = [choice]
+    response.usage = None
+    response.service_tier = None
+    response.model_dump = lambda **_: {
+        "id": "chatcmpl-abc",
+        "object": "chat.completion",
+        "created": 1700000000,
+        "model": "test-model",
+        "service_tier": None,
+        "choices": [choice.model_dump()],
+        "usage": None,
+    }
+    return response
+
+
+def test_validate_completion_normalizes_none_tool_call_type(albert_model):
+    """Tool calls with type=None are normalized to 'function' before validation."""
+    response = _make_chat_completion(tool_call_type=None)
+    result = albert_model._validate_completion(response)
+    tool_call = result.choices[0].message.tool_calls[0]
+    assert tool_call.type == "function"
+    assert tool_call.function.name == "final_result"
+
+
+def test_validate_completion_preserves_function_tool_call_type(albert_model):
+    """Tool calls already typed as 'function' pass through unchanged."""
+    response = _make_chat_completion(tool_call_type="function")
+    result = albert_model._validate_completion(response)
+    assert result.choices[0].message.tool_calls[0].type == "function"
+
+
+def _make_malformed_chat_completion(object_value: str, choices_value) -> MagicMock:
+    """Build a ChatCompletion mock with non-standard object/choices fields."""
+    response = MagicMock(spec=ChatCompletion)
+    response.model_dump = lambda **_: {
+        "id": "chatcmpl-abc",
+        "object": object_value,
+        "created": 1700000000,
+        "model": "test-model",
+        "service_tier": None,
+        "choices": choices_value,
+        "usage": None,
+    }
+    return response
+
+
+def test_validate_completion_normalizes_non_standard_object(albert_model):
+    """Non-standard object values are normalized to 'chat.completion'."""
+    response = _make_malformed_chat_completion(object_value="list", choices_value=[])
+    result = albert_model._validate_completion(response)
+    assert result.choices == []
+
+
+def test_validate_completion_normalizes_null_choices(albert_model):
+    """null choices are normalized to an empty list."""
+    response = _make_malformed_chat_completion(object_value="chat.completion", choices_value=None)
+    result = albert_model._validate_completion(response)
+    assert result.choices == []