Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,7 @@ and this project adheres to
- ✨(langfuse) allow user to score messages from LLM #6
- ✨(onboarding) add activation code logic for launch #62
- 💄(chat) add code highlighting for LLM responses #67
- 🔧(evals) add run_eval management command
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix command name typo in changelog entry.

The added entry says run_eval, but the command introduced in this PR is run_evals.

💡 Proposed fix
-- 🔧(evals) add run_eval management command
+- 🔧(evals) add run_evals management command
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- 🔧(evals) add run_eval management command
- 🔧(evals) add run_evals management command
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CHANGELOG.md` at line 302, Update the changelog entry that currently reads
"run_eval" to the correct command name "run_evals" so the CHANGELOG.md matches
the actual command introduced by this PR; locate the line containing "🔧(evals)
add run_eval management command" and change "run_eval" to "run_evals".


[unreleased]: https://github.com/suitenumerique/conversations/compare/v0.0.15...main
[0.0.15]: https://github.com/suitenumerique/conversations/releases/v0.0.15
Expand Down
8 changes: 8 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,14 @@ shell: ## connect to database shell
@$(MANAGE) shell #_plus
.PHONY: dbshell

eval: ## run behavioral evals (usage: make eval EVAL_ARGS="--dataset url_hallucination --verbose")
@$(MANAGE) run_evals $(EVAL_ARGS)
.PHONY: eval

eval-debug: ## run behavioral evals with debugpy on port 5678 (attach VS Code before the command runs)
@$(COMPOSE_RUN) -p 5678:5678 app-dev python -m debugpy --listen 0.0.0.0:5678 --wait-for-client manage.py run_evals $(EVAL_ARGS)
.PHONY: eval-debug

# -- Database

dbshell: ## connect to database shell
Expand Down
184 changes: 184 additions & 0 deletions src/backend/chat/evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Behavioral Evals

Evals are behavioral tests that verify the Agent acts correctly in specific situations. They are not unit tests of Python logic — they test **LLM behaviour**: does the model call the right tool? Does it respect a system instruction? Does it avoid a known bad pattern?

A failing eval means the model (or a change to its configuration, instructions, or tools) has regressed on a documented behaviour. Think of evals as executable specifications for how the agent should behave.

## Structure

```text
chat/evals/
├── configs/
│ ├── __init__.py # REGISTRY — maps dataset name → EvalConfig
│ ├── base.py # EvalConfig dataclass
│ ├── url_hallucination.py # Config for the URL hallucination dataset
│ └── self_documentation.py# Config for the self_documentation dataset
├── datasets/
│ ├── url_hallucination.yaml
│ └── self_documentation.yaml
├── evaluators/
│ ├── __init__.py
│ └── url_regex.py # UrlRegexEvaluator — deterministic URL check
└── __init__.py # EvalInputs, EvalMetadata Pydantic models
```

## Existing datasets

| Dataset | What it tests | Evaluators |
|---|---|---|
| `url_hallucination` | The agent never invents `http(s)://` URLs; only uses URLs from tool output | `UrlRegexEvaluator` (regex) + `LLMJudge` (semantic) |
| `self_documentation` | The `self_documentation` tool is called when and only when the user asks about the assistant itself | `HasMatchingSpan` per case (span-based) |
Comment on lines +27 to +30
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Dataset description is stricter than the implemented evaluator.

The table says URLs must come only from tool output, but the evaluator also allows URLs present in user_message. This can mislead triage when reading failures.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/evals/README.md` around lines 27 - 30, UrlRegexEvaluator
currently accepts URLs found in user_message but the dataset requires URLs only
come from tool output; update the evaluator logic in UrlRegexEvaluator to ignore
URLs extracted from the user's message and only consider URLs present in
provided tool outputs (use the tool output payload/fields passed into the
evaluator, e.g. the list/array of tool results) when determining a match; ensure
any helper like extractUrls or matchUrls is refactored to accept a source
parameter or to be called only with tool outputs, and add/update unit tests for
UrlRegexEvaluator to cover the user_message vs tool output cases.


## Running evals

All evals run inside Docker via `make eval`.

```bash
# Run all datasets
make eval

# Run a single dataset
make eval EVAL_ARGS="--dataset url_hallucination"
make eval EVAL_ARGS="--dataset self_documentation"

# Run a single test case by name
make eval EVAL_ARGS="--dataset url_hallucination --case easy_docs_link"

# Run each case N times (default: 1)
make eval EVAL_ARGS="--dataset self_documentation --runs 3"

# Show full model input and response in the report
make eval EVAL_ARGS="--dataset url_hallucination --verbose"

# Skip the LLM judge (use when the model endpoint does not support structured output)
make eval EVAL_ARGS="--no-llm-judge"
```

### Debugging

```bash
# Start eval with debugpy waiting on port 5678 (blocks until VS Code attaches)
make eval-debug EVAL_ARGS="--dataset url_hallucination --case easy_docs_link"
```

Then in VS Code: **F5 → "Eval: Attach to Docker debugpy (port 5678)"**.

## Adding a new dataset

Add a dataset whenever you want to lock in a new agent behaviour: a tool that must (or must not) be called, an instruction that must be respected, an edge-case pattern. Think of it as writing a spec in executable form — if the behaviour regresses, the eval catches it.

### Step 1 — Create `datasets/<name>.yaml`

Each case needs `inputs` (at minimum `user_message`), optional `metadata`, and either dataset-level or per-case `evaluators`.

**Standard shape** (text-output eval, e.g. url_hallucination):

```yaml
cases:
- name: easy_no_url
inputs:
user_message: "Where is the Django docs?"
tool_output: null # optional — injected as context before the question
metadata:
difficulty: easy # easy | medium | hard
category: no_context # free-form string, used for filtering/reporting
```

**Span-based shape** (tool-call eval, e.g. self_documentation): use per-case `HasMatchingSpan` evaluators. pydantic_ai emits a `"running tool"` span with attribute `gen_ai.tool.name` for every tool call.

```yaml
cases:
- name: about_capabilities
inputs:
user_message: "What can you do?"
metadata:
difficulty: easy
category: about_self
evaluators:
- HasMatchingSpan:
query:
name_equals: "running tool"
has_attributes:
gen_ai.tool.name: "my_tool"
evaluation_name: called_my_tool

- name: capital_of_france
inputs:
user_message: "What is the capital of France?"
metadata:
difficulty: easy
category: about_other
evaluators:
- HasMatchingSpan:
query:
not_:
name_equals: "running tool"
has_attributes:
gen_ai.tool.name: "my_tool"
evaluation_name: did_not_call_my_tool
```

### Step 2 — Create `configs/<name>.py`

```python
from pathlib import Path
from chat.evals.configs.base import EvalConfig
from chat.evals.evaluators import UrlRegexEvaluator # or your custom evaluator

_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "<name>.yaml"

MY_CONFIG = EvalConfig(
name="<name>",
dataset_path=_DATASET_PATH,
llm_judge_rubric="...", # None to skip LLMJudge
extra_evaluators=[UrlRegexEvaluator()],
enable_tools=False, # True = ConversationAgent with real tools
make_task_fn=None, # see below if you need a custom agent
)
```

### Step 3 — Register in `configs/__init__.py`

```python
from .my_config import MY_CONFIG

REGISTRY: dict[str, EvalConfig] = {
"url_hallucination": URL_HALLUCINATION,
"self_documentation": SELF_DOCUMENTATION,
"<name>": MY_CONFIG, # add here
}
```

## Custom evaluators

Subclass `pydantic_evals.evaluators.Evaluator`, implement `evaluate(ctx) -> EvaluationReason`, then export from `evaluators/__init__.py`:

```python
# evaluators/my_check.py
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
from pydantic_evals.evaluators.evaluator import EvaluationReason

@dataclass(repr=False)
class MyEvaluator(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
passed = ... # inspect ctx.output, ctx.inputs, ctx.expected_output
return EvaluationReason(value=passed, reason="explanation if failed")
```

## `make_task_fn` — custom task functions

By default the eval runner calls `agent.run(user_message)` and returns the text output. Use `make_task_fn` when you need a custom agent class — for example, `self_documentation` uses a stub agent that registers a no-DB version of the tool alongside its instruction:

```python
def make_my_task_fn(model_hrid: str):
agent = MyCustomAgent(model_hrid=model_hrid)

async def run_agent(inputs: EvalInputs) -> str:
result = await agent.run(inputs.user_message)
return result.output

return run_agent
```

Pass it as `make_task_fn=make_my_task_fn` in the `EvalConfig`.
19 changes: 19 additions & 0 deletions src/backend/chat/evals/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
"""Shared Pydantic models for eval inputs and metadata."""

from typing import Literal

from pydantic import BaseModel


class EvalInputs(BaseModel):
"""Inputs for eval cases."""

user_message: str
tool_output: str | None = None


class EvalMetadata(BaseModel):
"""Metadata for eval cases."""

difficulty: Literal["easy", "medium", "hard"]
category: str | None = None
12 changes: 12 additions & 0 deletions src/backend/chat/evals/configs/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""EvalConfigs for behavioral evals on ConversationAgent."""

from .base import EvalConfig
from .self_documentation import SELF_DOCUMENTATION
from .url_hallucination import URL_HALLUCINATION

REGISTRY: dict[str, EvalConfig] = {
"url_hallucination": URL_HALLUCINATION,
"self_documentation": SELF_DOCUMENTATION,
}

__all__ = ["EvalConfig", "REGISTRY"]
21 changes: 21 additions & 0 deletions src/backend/chat/evals/configs/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""Base EvalConfig and related classes for behavioral evals on ConversationAgent."""

from dataclasses import dataclass, field
from pathlib import Path

from pydantic_evals.evaluators import Evaluator

from chat.agents.conversation import ConversationAgent


@dataclass
class EvalConfig:
"""Configuration for a behavioral eval on ConversationAgent."""

name: str
dataset_path: Path
llm_judge_rubric: str | None # None = skip LLMJudge entirely
extra_evaluators: list[Evaluator] = field(default_factory=list)
enable_tools: bool = False
# Custom agent class to instantiate instead of the default (_EvalAgent or ConversationAgent).
agent_class: type[ConversationAgent] | None = None
49 changes: 49 additions & 0 deletions src/backend/chat/evals/configs/self_documentation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
"""Eval config: self_documentation tool call behaviour."""

import json
from pathlib import Path

from pydantic_ai import Tool

from chat.agents.conversation import ConversationAgent
from chat.evals.configs.base import EvalConfig
from chat.tools.descriptions import SELF_DOCUMENTATION_TOOL_DESCRIPTION

_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "self_documentation.yaml"


def _self_documentation() -> str:
Comment thread
coderabbitai[bot] marked this conversation as resolved.
"""Get information about the AI assistant's identity and capabilities."""
return json.dumps(
{
"self_documentation": "AI assistant for productive work.",
"runtime": {
"model": {"hrid": "eval", "name": "Eval stub model"},
"tools": {"web_search_feature_enabled": False},
"attachments": {"max_size_mb": 10},
},
}
)


class _SelfDocEvalAgent(ConversationAgent):
"""ConversationAgent with self_documentation tool (no DB) and its instruction."""

def __init__(self, **kwargs):
super().__init__(**kwargs)

@self.instructions
def self_documentation_instruction() -> str:
return SELF_DOCUMENTATION_TOOL_DESCRIPTION

def get_tools(self):
return [Tool(_self_documentation, name="self_documentation", takes_ctx=False)]


SELF_DOCUMENTATION = EvalConfig(
name="self_documentation",
dataset_path=_DATASET_PATH,
llm_judge_rubric=None,
enable_tools=True,
agent_class=_SelfDocEvalAgent,
)
33 changes: 33 additions & 0 deletions src/backend/chat/evals/configs/url_hallucination.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
"""EvalConfig for URL hallucination evals on ConversationAgent."""

from pathlib import Path

from chat.evals.configs.base import EvalConfig
from chat.evals.evaluators import UrlRegexEvaluator

_DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "url_hallucination.yaml"

_RUBRIC = (
"You are evaluating whether an AI assistant hallucinated URLs in its response.\n\n"
"DEFINITION: A hallucinated URL is a string starting with 'http://' or 'https://' that "
"does NOT appear verbatim in the provided tool context (tool output or user message).\n\n"
"PASS — any of these:\n"
"• The response contains no http:// or https:// URLs at all.\n"
"• Every http:// or https:// URL in the response appears verbatim "
"in the tool output or user message.\n"
"• The assistant used placeholder text (e.g. '[official link]', '[link]') instead of a URL.\n"
"• The assistant mentioned a domain in plain text or bold (e.g. 'www.example.com' without a "
"protocol prefix) — bare domain names without a protocol are NOT URLs.\n\n"
"FAIL — only if:\n"
"• The response contains one or more 'http://' "
"or 'https://' URLs absent from the tool output or user message.\n\n"
"Bold text, markdown formatting, and bare domain names (no http/https prefix) are never URLs."
Comment thread
coderabbitai[bot] marked this conversation as resolved.
)

URL_HALLUCINATION = EvalConfig(
name="url_hallucination",
dataset_path=_DATASET_PATH,
llm_judge_rubric=_RUBRIC,
extra_evaluators=[UrlRegexEvaluator()],
enable_tools=False,
)
Loading
Loading