Skip to content

🔧(evals) add run_eval management command#481

Open
maxenceh wants to merge 1 commit into
mainfrom
maxenceh/setup-eval-llm
Open

🔧(evals) add run_eval management command#481
maxenceh wants to merge 1 commit into
mainfrom
maxenceh/setup-eval-llm

Conversation

@maxenceh
Copy link
Copy Markdown
Collaborator

@maxenceh maxenceh commented May 19, 2026

Purpose

This PR introduces a behavioral eval system for ConversationAgent. Unlike
These evals test LLM behaviour end-to-end:

  • does the model call the right tool?
  • does it respect a system instruction?

A failing eval means a documented behaviour has regressed.
The system is designed to grow: adding a new dataset requires one YAML file, one config file, and a registry entry.

⚠️ Additional change: Add a wrapper for Albert model (in non stream mode) to give the eval the expected output.

Proposal

  • Add make eval / make eval-debug Makefile targets — runs evals inside
    Docker; debug mode exposes debugpy on port 5678 for VS Code remote attach
  • Add run_evals Django management command with --dataset, --case, --runs,
    --verbose, --no-llm-judge arguments
  • Add EvalConfig dataclass registry — each dataset declares its rubric,
    evaluators, and optional custom agent class
  • Dataset: url_hallucination — verifies the agent never invents
    http(s):// URLs; uses UrlRegexEvaluator (deterministic regex) + LLMJudge
    (semantic, skippable with --no-llm-judge for Albert API compatibility)
  • Dataset: self_documentation — verifies the self_documentation tool is called when and only when the user asks about the assistant itself; uses per-case HasMatchingSpan evaluators on OpenTelemetry spans
    (gen_ai.tool.name = "self_documentation")
  • Add UrlRegexEvaluator — strips trailing punctuation before URL
    comparison to avoid false positives
  • Add README.md in chat/evals/ documenting how to run evals and how to add new datasets

Summary by CodeRabbit

  • New Features

    • Behavioral evaluation framework for conversational agents with self-documentation and URL-hallucination test suites.
    • CLI command to run evaluations with dataset/case selection, verbose output, judge toggle, and repeat runs.
    • Makefile targets to run and debug evaluations.
  • Bug Fixes

    • Improved normalization of tool-call types in the Albert models provider to ensure consistent validation.
  • Documentation

    • Added a comprehensive guide for running, debugging, and authoring behavioral evals.
  • Tests

    • Added unit tests covering tool-call normalization behavior.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Warning

Rate limit exceeded

@maxenceh has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 34 minutes and 26 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 400f972e-b56d-4174-b805-72f7b0b7f2ea

📥 Commits

Reviewing files that changed from the base of the PR and between 1b29c2f and 4235b6a.

📒 Files selected for processing (17)
  • CHANGELOG.md
  • Makefile
  • src/backend/chat/evals/README.md
  • src/backend/chat/evals/__init__.py
  • src/backend/chat/evals/configs/__init__.py
  • src/backend/chat/evals/configs/base.py
  • src/backend/chat/evals/configs/self_documentation.py
  • src/backend/chat/evals/configs/url_hallucination.py
  • src/backend/chat/evals/datasets/self_documentation.yaml
  • src/backend/chat/evals/datasets/url_hallucination.yaml
  • src/backend/chat/evals/evaluators/__init__.py
  • src/backend/chat/evals/evaluators/url_regex.py
  • src/backend/chat/management/__init__.py
  • src/backend/chat/management/commands/__init__.py
  • src/backend/chat/management/commands/run_evals.py
  • src/backend/chat/providers/albert_models.py
  • src/backend/chat/tests/agents/test_albert_models.py

Walkthrough

Adds a behavioral evals subsystem (types, EvalConfig + REGISTRY, two eval configs, evaluators, datasets, README, Make targets) and a Django run_evals command to execute datasets; also normalizes Albert ChatCompletion tool-call types and adds tests.

Changes

Behavioral Evals Framework

Layer / File(s) Summary
Evals schema and documentation
src/backend/chat/evals/__init__.py, src/backend/chat/evals/README.md
Shared Pydantic models EvalInputs and EvalMetadata; README documents evals layout, running, debugging, dataset and evaluator authoring.
EvalConfig base and registry
src/backend/chat/evals/configs/base.py, src/backend/chat/evals/configs/__init__.py
EvalConfig dataclass (dataset_path, llm_judge_rubric, extra_evaluators, enable_tools, agent_class) and typed REGISTRY mapping names to configs.
Self-documentation eval implementation
src/backend/chat/evals/configs/self_documentation.py, src/backend/chat/evals/datasets/self_documentation.yaml
Implements a _self_documentation tool and _SelfDocEvalAgent, registers SELF_DOCUMENTATION EvalConfig, and adds a dataset asserting tool-call presence/absence across cases.
URL hallucination eval with custom evaluator
src/backend/chat/evals/evaluators/*, src/backend/chat/evals/configs/url_hallucination.py, src/backend/chat/evals/datasets/url_hallucination.yaml
Adds UrlRegexEvaluator to detect hallucinated http/https links, wires it into URL_HALLUCINATION EvalConfig with rubric, and supplies tiered dataset cases.
Run evals management command
src/backend/chat/management/commands/run_evals.py
Django command that loads configs from REGISTRY, composes evaluators (optionally LLMJudge), selects agent class (config override / ConversationAgent / _EvalAgent), runs dataset.evaluate_sync (max_concurrency=1, repeat runs), and prints verbosity-controlled reports (does not raise on model assertion failures).
Makefile and changelog
Makefile, CHANGELOG.md
Adds eval and eval-debug Make targets and documents the new management command in CHANGELOG.

Albert ChatCompletion Tool-Call Normalization

Layer / File(s) Summary
Albert ChatCompletion validation override
src/backend/chat/providers/albert_models.py
Adds _validate_completion to normalize malformed tool-call type values to allowed values before _ChatCompletion.model_validate.
Tests for ChatCompletion normalization
src/backend/chat/tests/agents/test_albert_models.py
Fixture and tests asserting normalization of type=None to "function" and preservation of existing "function" types.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Suggested labels

enhancement, backend

Suggested reviewers

  • providenz
  • qbey
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title mentions 'add run_eval management command' which is the primary change, but uses an emoji prefix and doesn't capture the broader scope of the PR which introduces a full behavioral evaluation system with datasets, configs, evaluators, and tooling.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch maxenceh/setup-eval-llm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@maxenceh maxenceh force-pushed the maxenceh/setup-eval-llm branch 3 times, most recently from 3baa473 to 2b2ad04 Compare May 19, 2026 12:33
@maxenceh maxenceh changed the title Maxenceh/setup eval llm 🔧(evals) add run_eval management command May 19, 2026
@maxenceh maxenceh marked this pull request as ready for review May 19, 2026 12:39
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/backend/chat/evals/README.md (1)

169-185: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

make_task_fn docs are out of sync with the current config API.

This section instructs passing make_task_fn into EvalConfig, but EvalConfig in this PR exposes agent_class and does not define make_task_fn.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/evals/README.md` around lines 169 - 185, The docs for
make_task_fn are out of date: they tell readers to pass make_task_fn into
EvalConfig but the current API uses agent_class (and no make_task_fn). Update
the README section to reflect the new config API by explaining how to provide a
custom agent via EvalConfig.agent_class (mentioning agent_class and
MyCustomAgent/self_documentation as examples) and show the new usage pattern
(describe that EvalConfig now accepts agent_class=MyCustomAgent instead of
make_task_fn). Also remove or mark deprecated references to make_task_fn to
avoid confusion.
🧹 Nitpick comments (1)
src/backend/chat/tests/agents/test_albert_models.py (1)

223-228: ⚡ Quick win

Add coverage for "custom" tool-call type pass-through.

_validate_completion treats "custom" as valid, but this branch is currently untested. Adding one test will lock that contract.

Proposed test addition
 def test_validate_completion_preserves_function_tool_call_type(albert_model):
     """Tool calls already typed as 'function' pass through unchanged."""
     response = _make_chat_completion(tool_call_type="function")
     result = albert_model._validate_completion(response)
     assert result.choices[0].message.tool_calls[0].type == "function"
+
+
+def test_validate_completion_preserves_custom_tool_call_type(albert_model):
+    """Tool calls typed as 'custom' pass through unchanged."""
+    response = _make_chat_completion(tool_call_type="custom")
+    result = albert_model._validate_completion(response)
+    assert result.choices[0].message.tool_calls[0].type == "custom"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/tests/agents/test_albert_models.py` around lines 223 - 228,
Add a new test mirroring
test_validate_completion_preserves_function_tool_call_type to cover the "custom"
tool-call type: call _make_chat_completion(tool_call_type="custom"), pass the
response into albert_model._validate_completion(response), and assert that
result.choices[0].message.tool_calls[0].type == "custom"; name the test e.g.
test_validate_completion_preserves_custom_tool_call_type to make the purpose
explicit and ensure the "custom" branch is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Line 302: Update the changelog entry that currently reads "run_eval" to the
correct command name "run_evals" so the CHANGELOG.md matches the actual command
introduced by this PR; locate the line containing "🔧(evals) add run_eval
management command" and change "run_eval" to "run_evals".

In `@src/backend/chat/evals/configs/self_documentation.py`:
- Line 15: The Tool created from the function _self_documentation is getting the
function's name (including the leading underscore) as the tool name, causing
eval assertions to expect "self_documentation" to fail; fix this by passing an
explicit name="self_documentation" when constructing the Tool (i.e., set the
Tool's name parameter where _self_documentation is wrapped/registered) so the
tool name matches the eval span contract.

In `@src/backend/chat/evals/configs/url_hallucination.py`:
- Around line 12-23: The rubric in url_hallucination.py currently only accepts
URLs that appear verbatim in the tool output, but this eval also allows verbatim
URLs from the user message; update the rubric string (the URL hallucination
prompt/definition text in url_hallucination.py) to accept URLs that appear
verbatim in either the tool output OR the user message as valid (adjust the PASS
bullets and the FAIL condition to only mark as FAIL when a http/https URL
appears that is absent from both tool output and the user message), and add a
clarifying sentence that user-provided verbatim URLs count as non-hallucinated.

In `@src/backend/chat/evals/evaluators/url_regex.py`:
- Line 1: Update the module docstring and the evaluator's failure message to
list all allowed URL sources (both tool_output and user_message) instead of only
"tool output"; locate the url_regex.py evaluator (module docstring at top) and
the place where it constructs the failure reason (e.g., the evaluator's
evaluate()/generate_failure_message()/reason variable) and change wording to
explicitly mention both "tool_output" and "user_message" as allowed sources.

In `@src/backend/chat/evals/README.md`:
- Around line 27-30: UrlRegexEvaluator currently accepts URLs found in
user_message but the dataset requires URLs only come from tool output; update
the evaluator logic in UrlRegexEvaluator to ignore URLs extracted from the
user's message and only consider URLs present in provided tool outputs (use the
tool output payload/fields passed into the evaluator, e.g. the list/array of
tool results) when determining a match; ensure any helper like extractUrls or
matchUrls is refactored to accept a source parameter or to be called only with
tool outputs, and add/update unit tests for UrlRegexEvaluator to cover the
user_message vs tool output cases.
- Around line 9-23: Update the fenced "Structure" code block in the README so
the opening fence includes a language identifier (e.g., change ``` to ```text)
to satisfy markdownlint MD040; locate the block that shows the directory tree
under the chat/evals header and modify its opening backticks to use "text" (or
another appropriate language) while leaving the block content unchanged.

In `@src/backend/chat/management/commands/run_evals.py`:
- Around line 54-58: The --runs argparse option currently allows zero or
negative values; update the add_argument for "--runs" (where type=int and
default=1 are set) to validate that the provided value is a strictly positive
integer (e.g., use a custom argparse type or validate args.runs after parsing
and raise argparse.ArgumentTypeError or exit with a clear error) and ensure any
use of args.runs (such as passed into repeat) assumes >0; this will prevent
passing 0/negative values into repeat and produce a clear user-facing error when
the flag is invalid.

---

Outside diff comments:
In `@src/backend/chat/evals/README.md`:
- Around line 169-185: The docs for make_task_fn are out of date: they tell
readers to pass make_task_fn into EvalConfig but the current API uses
agent_class (and no make_task_fn). Update the README section to reflect the new
config API by explaining how to provide a custom agent via
EvalConfig.agent_class (mentioning agent_class and
MyCustomAgent/self_documentation as examples) and show the new usage pattern
(describe that EvalConfig now accepts agent_class=MyCustomAgent instead of
make_task_fn). Also remove or mark deprecated references to make_task_fn to
avoid confusion.

---

Nitpick comments:
In `@src/backend/chat/tests/agents/test_albert_models.py`:
- Around line 223-228: Add a new test mirroring
test_validate_completion_preserves_function_tool_call_type to cover the "custom"
tool-call type: call _make_chat_completion(tool_call_type="custom"), pass the
response into albert_model._validate_completion(response), and assert that
result.choices[0].message.tool_calls[0].type == "custom"; name the test e.g.
test_validate_completion_preserves_custom_tool_call_type to make the purpose
explicit and ensure the "custom" branch is exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6d19e392-57fe-42f7-92db-b19a0e6d564d

📥 Commits

Reviewing files that changed from the base of the PR and between 5e0e408 and 2b2ad04.

📒 Files selected for processing (17)
  • CHANGELOG.md
  • Makefile
  • src/backend/chat/evals/README.md
  • src/backend/chat/evals/__init__.py
  • src/backend/chat/evals/configs/__init__.py
  • src/backend/chat/evals/configs/base.py
  • src/backend/chat/evals/configs/self_documentation.py
  • src/backend/chat/evals/configs/url_hallucination.py
  • src/backend/chat/evals/datasets/self_documentation.yaml
  • src/backend/chat/evals/datasets/url_hallucination.yaml
  • src/backend/chat/evals/evaluators/__init__.py
  • src/backend/chat/evals/evaluators/url_regex.py
  • src/backend/chat/management/__init__.py
  • src/backend/chat/management/commands/__init__.py
  • src/backend/chat/management/commands/run_evals.py
  • src/backend/chat/providers/albert_models.py
  • src/backend/chat/tests/agents/test_albert_models.py

Comment thread CHANGELOG.md
- ✨(langfuse) allow user to score messages from LLM #6
- ✨(onboarding) add activation code logic for launch #62
- 💄(chat) add code highlighting for LLM responses #67
- 🔧(evals) add run_eval management command
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix command name typo in changelog entry.

The added entry says run_eval, but the command introduced in this PR is run_evals.

💡 Proposed fix
-- 🔧(evals) add run_eval management command
+- 🔧(evals) add run_evals management command
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- 🔧(evals) add run_eval management command
- 🔧(evals) add run_evals management command
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CHANGELOG.md` at line 302, Update the changelog entry that currently reads
"run_eval" to the correct command name "run_evals" so the CHANGELOG.md matches
the actual command introduced by this PR; locate the line containing "🔧(evals)
add run_eval management command" and change "run_eval" to "run_evals".

Comment thread src/backend/chat/evals/configs/self_documentation.py
Comment thread src/backend/chat/evals/configs/url_hallucination.py
Comment thread src/backend/chat/evals/evaluators/url_regex.py Outdated
Comment thread src/backend/chat/evals/README.md Outdated
Comment on lines +27 to +30
| Dataset | What it tests | Evaluators |
|---|---|---|
| `url_hallucination` | The agent never invents `http(s)://` URLs; only uses URLs from tool output | `UrlRegexEvaluator` (regex) + `LLMJudge` (semantic) |
| `self_documentation` | The `self_documentation` tool is called when and only when the user asks about the assistant itself | `HasMatchingSpan` per case (span-based) |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Dataset description is stricter than the implemented evaluator.

The table says URLs must come only from tool output, but the evaluator also allows URLs present in user_message. This can mislead triage when reading failures.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/evals/README.md` around lines 27 - 30, UrlRegexEvaluator
currently accepts URLs found in user_message but the dataset requires URLs only
come from tool output; update the evaluator logic in UrlRegexEvaluator to ignore
URLs extracted from the user's message and only consider URLs present in
provided tool outputs (use the tool output payload/fields passed into the
evaluator, e.g. the list/array of tool results) when determining a match; ensure
any helper like extractUrls or matchUrls is refactored to accept a source
parameter or to be called only with tool outputs, and add/update unit tests for
UrlRegexEvaluator to cover the user_message vs tool output cases.

Comment on lines +54 to +58
"--runs",
type=int,
default=1,
help="Number of times to run each case (default: 1). Use > 1 to measure consistency.",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate --runs as a strictly positive integer.

--runs currently accepts 0 or negative values, which can lead to misleading/no-op evaluations when passed to repeat.

💡 Proposed fix
     def handle(self, *args, **options):
         logfire.configure(send_to_logfire=False)
+        if options["runs"] < 1:
+            raise CommandError("--runs must be >= 1.")

Also applies to: 131-131

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/management/commands/run_evals.py` around lines 54 - 58, The
--runs argparse option currently allows zero or negative values; update the
add_argument for "--runs" (where type=int and default=1 are set) to validate
that the provided value is a strictly positive integer (e.g., use a custom
argparse type or validate args.runs after parsing and raise
argparse.ArgumentTypeError or exit with a clear error) and ensure any use of
args.runs (such as passed into repeat) assumes >0; this will prevent passing
0/negative values into repeat and produce a clear user-facing error when the
flag is invalid.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/backend/chat/evals/README.md (2)

9-23: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language identifier to the fenced code block.

The code fence at line 9 still lacks a language identifier, triggering markdownlint MD040. Adding text or plaintext will resolve the linting warning.

📝 Proposed fix
-```
+```text
 chat/evals/
 ├── configs/
 │   ├── __init__.py          # REGISTRY — maps dataset name → EvalConfig
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/evals/README.md` around lines 9 - 23, The fenced code block
in src/backend/chat/evals/README.md is missing a language identifier (triggering
MD040); update the opening triple-backtick of the directory tree block to
include a language token such as ```text or ```plaintext so the block becomes a
proper fenced code block with a language identifier; locate the block showing
"chat/evals/" and change its opening fence accordingly.

29-29: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify that URLs from user input are also allowed.

The description states URLs must come "only from tool output," but if the evaluator also accepts URLs present in user_message (as flagged in previous review), this description is misleading. Consider updating to "only uses URLs from tool output or user input" to accurately reflect the evaluator behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/chat/evals/README.md` at line 29, Update the README description
for the `url_hallucination` check to reflect that allowed URLs may come from
both tool outputs and user input: change the phrase "only uses URLs from tool
output" to "only uses URLs from tool output or user input" and ensure the
`UrlRegexEvaluator` and `LLMJudge` references remain intact so readers know
those components perform regex and semantic checks respectively (refer to
`url_hallucination`, `UrlRegexEvaluator`, and `LLMJudge` in the same line).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/chat/management/commands/run_evals.py`:
- Around line 123-126: The _run_dataset function currently has a bool return
annotation and docstring but actually returns an EvaluationReport at the end
(see the return of EvaluationReport around the block that used to be line 145);
update the function signature and docstring of _run_dataset to reflect that it
returns an EvaluationReport (not bool), adjust the type hint to
EvaluationReport, and update the docstring to describe the EvaluationReport
contents so callers and type checks are correct.

---

Duplicate comments:
In `@src/backend/chat/evals/README.md`:
- Around line 9-23: The fenced code block in src/backend/chat/evals/README.md is
missing a language identifier (triggering MD040); update the opening
triple-backtick of the directory tree block to include a language token such as
```text or ```plaintext so the block becomes a proper fenced code block with a
language identifier; locate the block showing "chat/evals/" and change its
opening fence accordingly.
- Line 29: Update the README description for the `url_hallucination` check to
reflect that allowed URLs may come from both tool outputs and user input: change
the phrase "only uses URLs from tool output" to "only uses URLs from tool output
or user input" and ensure the `UrlRegexEvaluator` and `LLMJudge` references
remain intact so readers know those components perform regex and semantic checks
respectively (refer to `url_hallucination`, `UrlRegexEvaluator`, and `LLMJudge`
in the same line).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0e639ce5-50e8-4a14-a9a1-c4f2d02fb4e4

📥 Commits

Reviewing files that changed from the base of the PR and between 2b2ad04 and 1b29c2f.

📒 Files selected for processing (17)
  • CHANGELOG.md
  • Makefile
  • src/backend/chat/evals/README.md
  • src/backend/chat/evals/__init__.py
  • src/backend/chat/evals/configs/__init__.py
  • src/backend/chat/evals/configs/base.py
  • src/backend/chat/evals/configs/self_documentation.py
  • src/backend/chat/evals/configs/url_hallucination.py
  • src/backend/chat/evals/datasets/self_documentation.yaml
  • src/backend/chat/evals/datasets/url_hallucination.yaml
  • src/backend/chat/evals/evaluators/__init__.py
  • src/backend/chat/evals/evaluators/url_regex.py
  • src/backend/chat/management/__init__.py
  • src/backend/chat/management/commands/__init__.py
  • src/backend/chat/management/commands/run_evals.py
  • src/backend/chat/providers/albert_models.py
  • src/backend/chat/tests/agents/test_albert_models.py
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (10)
  • src/backend/chat/evals/datasets/self_documentation.yaml
  • Makefile
  • src/backend/chat/evals/init.py
  • src/backend/chat/evals/configs/url_hallucination.py
  • src/backend/chat/evals/configs/init.py
  • src/backend/chat/evals/configs/base.py
  • src/backend/chat/providers/albert_models.py
  • src/backend/chat/evals/evaluators/init.py
  • src/backend/chat/tests/agents/test_albert_models.py
  • src/backend/chat/evals/configs/self_documentation.py

Comment thread src/backend/chat/management/commands/run_evals.py Outdated
Create dataset to evaluate url hallucination
and self documentation tool call
Run inside docker with Make target
Add fix for albert non-stream completion to support
evaluating with albert provider
@maxenceh maxenceh force-pushed the maxenceh/setup-eval-llm branch from 1b29c2f to 4235b6a Compare May 19, 2026 15:10
@sonarqubecloud
Copy link
Copy Markdown

@maxenceh maxenceh requested a review from providenz May 25, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

✨(evals) behavioral eval framework for ConversationAgent

1 participant