🔧(evals) add run_eval management command by maxenceh · Pull Request #481 · suitenumerique/conversations

maxenceh · 2026-05-19T10:27:03Z

Purpose

This PR introduces a behavioral eval system for ConversationAgent. Unlike
These evals test LLM behaviour end-to-end:

does the model call the right tool?
does it respect a system instruction?

A failing eval means a documented behaviour has regressed.
The system is designed to grow: adding a new dataset requires one YAML file, one config file, and a registry entry.

⚠️ Additional change: Add a wrapper for Albert model (in non stream mode) to give the eval the expected output.

Proposal

Add make eval / make eval-debug Makefile targets — runs evals inside
Docker; debug mode exposes debugpy on port 5678 for VS Code remote attach
Add run_evals Django management command with --dataset, --case, --runs,
--verbose, --no-llm-judge arguments
Add EvalConfig dataclass registry — each dataset declares its rubric,
evaluators, and optional custom agent class
Dataset: url_hallucination — verifies the agent never invents
http(s):// URLs; uses UrlRegexEvaluator (deterministic regex) + LLMJudge
(semantic, skippable with --no-llm-judge for Albert API compatibility)
Dataset: self_documentation — verifies the self_documentation tool is called when and only when the user asks about the assistant itself; uses per-case HasMatchingSpan evaluators on OpenTelemetry spans
(gen_ai.tool.name = "self_documentation")
Add UrlRegexEvaluator — strips trailing punctuation before URL
comparison to avoid false positives
Add README.md in chat/evals/ documenting how to run evals and how to add new datasets

Summary by CodeRabbit

New Features
- Behavioral evaluation framework for conversational agents with self-documentation and URL-hallucination test suites.
- CLI command to run evaluations with dataset/case selection, verbose output, judge toggle, and repeat runs.
- Makefile targets to run and debug evaluations.
Bug Fixes
- Improved normalization of tool-call types in the Albert models provider to ensure consistent validation.
Documentation
- Added a comprehensive guide for running, debugging, and authoring behavioral evals.
Tests
- Added unit tests covering tool-call normalization behavior.

coderabbitai · 2026-05-19T10:27:11Z

Warning

Rate limit exceeded

@maxenceh has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 34 minutes and 26 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 400f972e-b56d-4174-b805-72f7b0b7f2ea

📥 Commits

Reviewing files that changed from the base of the PR and between 1b29c2f and 4235b6a.

📒 Files selected for processing (17)

CHANGELOG.md
Makefile
src/backend/chat/evals/README.md
src/backend/chat/evals/__init__.py
src/backend/chat/evals/configs/__init__.py
src/backend/chat/evals/configs/base.py
src/backend/chat/evals/configs/self_documentation.py
src/backend/chat/evals/configs/url_hallucination.py
src/backend/chat/evals/datasets/self_documentation.yaml
src/backend/chat/evals/datasets/url_hallucination.yaml
src/backend/chat/evals/evaluators/__init__.py
src/backend/chat/evals/evaluators/url_regex.py
src/backend/chat/management/__init__.py
src/backend/chat/management/commands/__init__.py
src/backend/chat/management/commands/run_evals.py
src/backend/chat/providers/albert_models.py
src/backend/chat/tests/agents/test_albert_models.py

Walkthrough

Adds a behavioral evals subsystem (types, EvalConfig + REGISTRY, two eval configs, evaluators, datasets, README, Make targets) and a Django run_evals command to execute datasets; also normalizes Albert ChatCompletion tool-call types and adds tests.

Changes

Behavioral Evals Framework

Layer / File(s)	Summary
Evals schema and documentation `src/backend/chat/evals/__init__.py`, `src/backend/chat/evals/README.md`	Shared Pydantic models `EvalInputs` and `EvalMetadata`; README documents evals layout, running, debugging, dataset and evaluator authoring.
EvalConfig base and registry `src/backend/chat/evals/configs/base.py`, `src/backend/chat/evals/configs/__init__.py`	`EvalConfig` dataclass (dataset_path, llm_judge_rubric, extra_evaluators, enable_tools, agent_class) and typed `REGISTRY` mapping names to configs.
Self-documentation eval implementation `src/backend/chat/evals/configs/self_documentation.py`, `src/backend/chat/evals/datasets/self_documentation.yaml`	Implements a `_self_documentation` tool and `_SelfDocEvalAgent`, registers `SELF_DOCUMENTATION` EvalConfig, and adds a dataset asserting tool-call presence/absence across cases.
URL hallucination eval with custom evaluator `src/backend/chat/evals/evaluators/*`, `src/backend/chat/evals/configs/url_hallucination.py`, `src/backend/chat/evals/datasets/url_hallucination.yaml`	Adds `UrlRegexEvaluator` to detect hallucinated http/https links, wires it into `URL_HALLUCINATION` EvalConfig with rubric, and supplies tiered dataset cases.
Run evals management command `src/backend/chat/management/commands/run_evals.py`	Django command that loads configs from `REGISTRY`, composes evaluators (optionally LLMJudge), selects agent class (config override / ConversationAgent / `_EvalAgent`), runs `dataset.evaluate_sync` (max_concurrency=1, repeat runs), and prints verbosity-controlled reports (does not raise on model assertion failures).
Makefile and changelog `Makefile`, `CHANGELOG.md`	Adds `eval` and `eval-debug` Make targets and documents the new management command in CHANGELOG.

Albert ChatCompletion Tool-Call Normalization

Layer / File(s)	Summary
Albert ChatCompletion validation override `src/backend/chat/providers/albert_models.py`	Adds `_validate_completion` to normalize malformed tool-call `type` values to allowed values before `_ChatCompletion.model_validate`.
Tests for ChatCompletion normalization `src/backend/chat/tests/agents/test_albert_models.py`	Fixture and tests asserting normalization of `type=None` to `"function"` and preservation of existing `"function"` types.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

✨(evals) behavioral eval framework for ConversationAgent #484 — PR implements the behavioral evals framework and run_evals command described by the issue.

Suggested labels

enhancement, backend

Suggested reviewers

providenz
qbey

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title mentions 'add run_eval management command' which is the primary change, but uses an emoji prefix and doesn't capture the broader scope of the PR which introduces a full behavioral evaluation system with datasets, configs, evaluators, and tooling.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch maxenceh/setup-eval-llm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/backend/chat/evals/README.md (1)

169-185: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

make_task_fn docs are out of sync with the current config API.

This section instructs passing make_task_fn into EvalConfig, but EvalConfig in this PR exposes agent_class and does not define make_task_fn.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/evals/README.md` around lines 169 - 185, The docs for
make_task_fn are out of date: they tell readers to pass make_task_fn into
EvalConfig but the current API uses agent_class (and no make_task_fn). Update
the README section to reflect the new config API by explaining how to provide a
custom agent via EvalConfig.agent_class (mentioning agent_class and
MyCustomAgent/self_documentation as examples) and show the new usage pattern
(describe that EvalConfig now accepts agent_class=MyCustomAgent instead of
make_task_fn). Also remove or mark deprecated references to make_task_fn to
avoid confusion.

🧹 Nitpick comments (1)

src/backend/chat/tests/agents/test_albert_models.py (1)

223-228: ⚡ Quick win

Add coverage for "custom" tool-call type pass-through.

_validate_completion treats "custom" as valid, but this branch is currently untested. Adding one test will lock that contract.

Proposed test addition

 def test_validate_completion_preserves_function_tool_call_type(albert_model):
     """Tool calls already typed as 'function' pass through unchanged."""
     response = _make_chat_completion(tool_call_type="function")
     result = albert_model._validate_completion(response)
     assert result.choices[0].message.tool_calls[0].type == "function"
+
+
+def test_validate_completion_preserves_custom_tool_call_type(albert_model):
+    """Tool calls typed as 'custom' pass through unchanged."""
+    response = _make_chat_completion(tool_call_type="custom")
+    result = albert_model._validate_completion(response)
+    assert result.choices[0].message.tool_calls[0].type == "custom"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/tests/agents/test_albert_models.py` around lines 223 - 228,
Add a new test mirroring
test_validate_completion_preserves_function_tool_call_type to cover the "custom"
tool-call type: call _make_chat_completion(tool_call_type="custom"), pass the
response into albert_model._validate_completion(response), and assert that
result.choices[0].message.tool_calls[0].type == "custom"; name the test e.g.
test_validate_completion_preserves_custom_tool_call_type to make the purpose
explicit and ensure the "custom" branch is exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Line 302: Update the changelog entry that currently reads "run_eval" to the
correct command name "run_evals" so the CHANGELOG.md matches the actual command
introduced by this PR; locate the line containing "🔧(evals) add run_eval
management command" and change "run_eval" to "run_evals".

In `@src/backend/chat/evals/configs/self_documentation.py`:
- Line 15: The Tool created from the function _self_documentation is getting the
function's name (including the leading underscore) as the tool name, causing
eval assertions to expect "self_documentation" to fail; fix this by passing an
explicit name="self_documentation" when constructing the Tool (i.e., set the
Tool's name parameter where _self_documentation is wrapped/registered) so the
tool name matches the eval span contract.

In `@src/backend/chat/evals/configs/url_hallucination.py`:
- Around line 12-23: The rubric in url_hallucination.py currently only accepts
URLs that appear verbatim in the tool output, but this eval also allows verbatim
URLs from the user message; update the rubric string (the URL hallucination
prompt/definition text in url_hallucination.py) to accept URLs that appear
verbatim in either the tool output OR the user message as valid (adjust the PASS
bullets and the FAIL condition to only mark as FAIL when a http/https URL
appears that is absent from both tool output and the user message), and add a
clarifying sentence that user-provided verbatim URLs count as non-hallucinated.

In `@src/backend/chat/evals/evaluators/url_regex.py`:
- Line 1: Update the module docstring and the evaluator's failure message to
list all allowed URL sources (both tool_output and user_message) instead of only
"tool output"; locate the url_regex.py evaluator (module docstring at top) and
the place where it constructs the failure reason (e.g., the evaluator's
evaluate()/generate_failure_message()/reason variable) and change wording to
explicitly mention both "tool_output" and "user_message" as allowed sources.

In `@src/backend/chat/evals/README.md`:
- Around line 27-30: UrlRegexEvaluator currently accepts URLs found in
user_message but the dataset requires URLs only come from tool output; update
the evaluator logic in UrlRegexEvaluator to ignore URLs extracted from the
user's message and only consider URLs present in provided tool outputs (use the
tool output payload/fields passed into the evaluator, e.g. the list/array of
tool results) when determining a match; ensure any helper like extractUrls or
matchUrls is refactored to accept a source parameter or to be called only with
tool outputs, and add/update unit tests for UrlRegexEvaluator to cover the
user_message vs tool output cases.
- Around line 9-23: Update the fenced "Structure" code block in the README so
the opening fence includes a language identifier (e.g., change ``` to ```text)
to satisfy markdownlint MD040; locate the block that shows the directory tree
under the chat/evals header and modify its opening backticks to use "text" (or
another appropriate language) while leaving the block content unchanged.

In `@src/backend/chat/management/commands/run_evals.py`:
- Around line 54-58: The --runs argparse option currently allows zero or
negative values; update the add_argument for "--runs" (where type=int and
default=1 are set) to validate that the provided value is a strictly positive
integer (e.g., use a custom argparse type or validate args.runs after parsing
and raise argparse.ArgumentTypeError or exit with a clear error) and ensure any
use of args.runs (such as passed into repeat) assumes >0; this will prevent
passing 0/negative values into repeat and produce a clear user-facing error when
the flag is invalid.

---

Outside diff comments:
In `@src/backend/chat/evals/README.md`:
- Around line 169-185: The docs for make_task_fn are out of date: they tell
readers to pass make_task_fn into EvalConfig but the current API uses
agent_class (and no make_task_fn). Update the README section to reflect the new
config API by explaining how to provide a custom agent via
EvalConfig.agent_class (mentioning agent_class and
MyCustomAgent/self_documentation as examples) and show the new usage pattern
(describe that EvalConfig now accepts agent_class=MyCustomAgent instead of
make_task_fn). Also remove or mark deprecated references to make_task_fn to
avoid confusion.

---

Nitpick comments:
In `@src/backend/chat/tests/agents/test_albert_models.py`:
- Around line 223-228: Add a new test mirroring
test_validate_completion_preserves_function_tool_call_type to cover the "custom"
tool-call type: call _make_chat_completion(tool_call_type="custom"), pass the
response into albert_model._validate_completion(response), and assert that
result.choices[0].message.tool_calls[0].type == "custom"; name the test e.g.
test_validate_completion_preserves_custom_tool_call_type to make the purpose
explicit and ensure the "custom" branch is exercised.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6d19e392-57fe-42f7-92db-b19a0e6d564d

📥 Commits

Reviewing files that changed from the base of the PR and between 5e0e408 and 2b2ad04.

📒 Files selected for processing (17)

CHANGELOG.md
Makefile
src/backend/chat/evals/README.md
src/backend/chat/evals/__init__.py
src/backend/chat/evals/configs/__init__.py
src/backend/chat/evals/configs/base.py
src/backend/chat/evals/configs/self_documentation.py
src/backend/chat/evals/configs/url_hallucination.py
src/backend/chat/evals/datasets/self_documentation.yaml
src/backend/chat/evals/datasets/url_hallucination.yaml
src/backend/chat/evals/evaluators/__init__.py
src/backend/chat/evals/evaluators/url_regex.py
src/backend/chat/management/__init__.py
src/backend/chat/management/commands/__init__.py
src/backend/chat/management/commands/run_evals.py
src/backend/chat/providers/albert_models.py
src/backend/chat/tests/agents/test_albert_models.py

coderabbitai · 2026-05-19T12:49:05Z

 - ✨(langfuse) allow user to score messages from LLM #6
 - ✨(onboarding) add activation code logic for launch #62
 - 💄(chat) add code highlighting for LLM responses #67
+- 🔧(evals) add run_eval management command


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix command name typo in changelog entry.

The added entry says run_eval, but the command introduced in this PR is run_evals.

💡 Proposed fix

-- 🔧(evals) add run_eval management command +- 🔧(evals) add run_evals management command

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- 🔧(evals) add run_eval management command

- 🔧(evals) add run_evals management command

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CHANGELOG.md` at line 302, Update the changelog entry that currently reads "run_eval" to the correct command name "run_evals" so the CHANGELOG.md matches the actual command introduced by this PR; locate the line containing "🔧(evals) add run_eval management command" and change "run_eval" to "run_evals".

coderabbitai · 2026-05-19T12:49:06Z

+| Dataset | What it tests | Evaluators |
+|---|---|---|
+| `url_hallucination` | The agent never invents `http(s)://` URLs; only uses URLs from tool output | `UrlRegexEvaluator` (regex) + `LLMJudge` (semantic) |
+| `self_documentation` | The `self_documentation` tool is called when and only when the user asks about the assistant itself | `HasMatchingSpan` per case (span-based) |


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Dataset description is stricter than the implemented evaluator.

The table says URLs must come only from tool output, but the evaluator also allows URLs present in user_message. This can mislead triage when reading failures.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/backend/chat/evals/README.md` around lines 27 - 30, UrlRegexEvaluator currently accepts URLs found in user_message but the dataset requires URLs only come from tool output; update the evaluator logic in UrlRegexEvaluator to ignore URLs extracted from the user's message and only consider URLs present in provided tool outputs (use the tool output payload/fields passed into the evaluator, e.g. the list/array of tool results) when determining a match; ensure any helper like extractUrls or matchUrls is refactored to accept a source parameter or to be called only with tool outputs, and add/update unit tests for UrlRegexEvaluator to cover the user_message vs tool output cases.

coderabbitai · 2026-05-19T12:49:06Z

+            "--runs",
+            type=int,
+            default=1,
+            help="Number of times to run each case (default: 1). Use > 1 to measure consistency.",
+        )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate --runs as a strictly positive integer.

--runs currently accepts 0 or negative values, which can lead to misleading/no-op evaluations when passed to repeat.

💡 Proposed fix

def handle(self, *args, **options): logfire.configure(send_to_logfire=False) + if options["runs"] < 1: + raise CommandError("--runs must be >= 1.")

Also applies to: 131-131

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/backend/chat/management/commands/run_evals.py` around lines 54 - 58, The --runs argparse option currently allows zero or negative values; update the add_argument for "--runs" (where type=int and default=1 are set) to validate that the provided value is a strictly positive integer (e.g., use a custom argparse type or validate args.runs after parsing and raise argparse.ArgumentTypeError or exit with a clear error) and ensure any use of args.runs (such as passed into repeat) assumes >0; this will prevent passing 0/negative values into repeat and produce a clear user-facing error when the flag is invalid.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/backend/chat/evals/README.md (2)
9-23: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language identifier to the fenced code block.

The code fence at line 9 still lacks a language identifier, triggering markdownlint MD040. Adding text or plaintext will resolve the linting warning.
📝 Proposed fix
-```
+```text
 chat/evals/
 ├── configs/
 │   ├── __init__.py          # REGISTRY — maps dataset name → EvalConfig
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/evals/README.md` around lines 9 - 23, The fenced code block
in src/backend/chat/evals/README.md is missing a language identifier (triggering
MD040); update the opening triple-backtick of the directory tree block to
include a language token such as ```text or ```plaintext so the block becomes a
proper fenced code block with a language identifier; locate the block showing
"chat/evals/" and change its opening fence accordingly.
29-29: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify that URLs from user input are also allowed.

The description states URLs must come "only from tool output," but if the evaluator also accepts URLs present in user_message (as flagged in previous review), this description is misleading. Consider updating to "only uses URLs from tool output or user input" to accurately reflect the evaluator behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/evals/README.md` at line 29, Update the README description
for the `url_hallucination` check to reflect that allowed URLs may come from
both tool outputs and user input: change the phrase "only uses URLs from tool
output" to "only uses URLs from tool output or user input" and ensure the
`UrlRegexEvaluator` and `LLMJudge` references remain intact so readers know
those components perform regex and semantic checks respectively (refer to
`url_hallucination`, `UrlRegexEvaluator`, and `LLMJudge` in the same line).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/chat/management/commands/run_evals.py`:
- Around line 123-126: The _run_dataset function currently has a bool return
annotation and docstring but actually returns an EvaluationReport at the end
(see the return of EvaluationReport around the block that used to be line 145);
update the function signature and docstring of _run_dataset to reflect that it
returns an EvaluationReport (not bool), adjust the type hint to
EvaluationReport, and update the docstring to describe the EvaluationReport
contents so callers and type checks are correct.

---

Duplicate comments:
In `@src/backend/chat/evals/README.md`:
- Around line 9-23: The fenced code block in src/backend/chat/evals/README.md is
missing a language identifier (triggering MD040); update the opening
triple-backtick of the directory tree block to include a language token such as
```text or ```plaintext so the block becomes a proper fenced code block with a
language identifier; locate the block showing "chat/evals/" and change its
opening fence accordingly.
- Line 29: Update the README description for the `url_hallucination` check to
reflect that allowed URLs may come from both tool outputs and user input: change
the phrase "only uses URLs from tool output" to "only uses URLs from tool output
or user input" and ensure the `UrlRegexEvaluator` and `LLMJudge` references
remain intact so readers know those components perform regex and semantic checks
respectively (refer to `url_hallucination`, `UrlRegexEvaluator`, and `LLMJudge`
in the same line).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0e639ce5-50e8-4a14-a9a1-c4f2d02fb4e4

📥 Commits

Reviewing files that changed from the base of the PR and between 2b2ad04 and 1b29c2f.

📒 Files selected for processing (17)

CHANGELOG.md
Makefile
src/backend/chat/evals/README.md
src/backend/chat/evals/__init__.py
src/backend/chat/evals/configs/__init__.py
src/backend/chat/evals/configs/base.py
src/backend/chat/evals/configs/self_documentation.py
src/backend/chat/evals/configs/url_hallucination.py
src/backend/chat/evals/datasets/self_documentation.yaml
src/backend/chat/evals/datasets/url_hallucination.yaml
src/backend/chat/evals/evaluators/__init__.py
src/backend/chat/evals/evaluators/url_regex.py
src/backend/chat/management/__init__.py
src/backend/chat/management/commands/__init__.py
src/backend/chat/management/commands/run_evals.py
src/backend/chat/providers/albert_models.py
src/backend/chat/tests/agents/test_albert_models.py

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.md

🚧 Files skipped from review as they are similar to previous changes (10)

src/backend/chat/evals/datasets/self_documentation.yaml
Makefile
src/backend/chat/evals/init.py
src/backend/chat/evals/configs/url_hallucination.py
src/backend/chat/evals/configs/init.py
src/backend/chat/evals/configs/base.py
src/backend/chat/providers/albert_models.py
src/backend/chat/evals/evaluators/init.py
src/backend/chat/tests/agents/test_albert_models.py
src/backend/chat/evals/configs/self_documentation.py

Create dataset to evaluate url hallucination and self documentation tool call Run inside docker with Make target Add fix for albert non-stream completion to support evaluating with albert provider

sonarqubecloud · 2026-05-19T15:11:08Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

maxenceh force-pushed the maxenceh/setup-eval-llm branch 3 times, most recently from 3baa473 to 2b2ad04 Compare May 19, 2026 12:33

maxenceh changed the title ~~Maxenceh/setup eval llm~~ 🔧(evals) add run_eval management command May 19, 2026

maxenceh marked this pull request as ready for review May 19, 2026 12:39

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

maxenceh force-pushed the maxenceh/setup-eval-llm branch from 2b2ad04 to 41dd4c6 Compare May 19, 2026 12:59

maxenceh mentioned this pull request May 19, 2026

🐛(backend) improve url hallucination instruction #482

Merged

maxenceh linked an issue May 19, 2026 that may be closed by this pull request

✨(evals) behavioral eval framework for ConversationAgent #484

Open

maxenceh force-pushed the maxenceh/setup-eval-llm branch from 41dd4c6 to 1b29c2f Compare May 19, 2026 14:44

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/backend/chat/management/commands/run_evals.py Outdated

🔧(evals) add run_eval management command

4235b6a

Create dataset to evaluate url hallucination and self documentation tool call Run inside docker with Make target Add fix for albert non-stream completion to support evaluating with albert provider

maxenceh force-pushed the maxenceh/setup-eval-llm branch from 1b29c2f to 4235b6a Compare May 19, 2026 15:10

maxenceh requested a review from providenz May 25, 2026 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔧(evals) add run_eval management command#481

🔧(evals) add run_eval management command#481
maxenceh wants to merge 1 commit into
mainfrom
maxenceh/setup-eval-llm

maxenceh commented May 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Rate limit exceeded

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	- 🔧(evals) add run_eval management command
	- 🔧(evals) add run_evals management command

Conversation

maxenceh commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Proposal

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 19, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maxenceh commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading