🔧(evals) add run_eval management command#481
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (17)
WalkthroughAdds a behavioral evals subsystem (types, EvalConfig + REGISTRY, two eval configs, evaluators, datasets, README, Make targets) and a Django ChangesBehavioral Evals Framework
Albert ChatCompletion Tool-Call Normalization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
3baa473 to
2b2ad04
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/backend/chat/evals/README.md (1)
169-185:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
make_task_fndocs are out of sync with the current config API.This section instructs passing
make_task_fnintoEvalConfig, butEvalConfigin this PR exposesagent_classand does not definemake_task_fn.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/backend/chat/evals/README.md` around lines 169 - 185, The docs for make_task_fn are out of date: they tell readers to pass make_task_fn into EvalConfig but the current API uses agent_class (and no make_task_fn). Update the README section to reflect the new config API by explaining how to provide a custom agent via EvalConfig.agent_class (mentioning agent_class and MyCustomAgent/self_documentation as examples) and show the new usage pattern (describe that EvalConfig now accepts agent_class=MyCustomAgent instead of make_task_fn). Also remove or mark deprecated references to make_task_fn to avoid confusion.
🧹 Nitpick comments (1)
src/backend/chat/tests/agents/test_albert_models.py (1)
223-228: ⚡ Quick winAdd coverage for
"custom"tool-call type pass-through.
_validate_completiontreats"custom"as valid, but this branch is currently untested. Adding one test will lock that contract.Proposed test addition
def test_validate_completion_preserves_function_tool_call_type(albert_model): """Tool calls already typed as 'function' pass through unchanged.""" response = _make_chat_completion(tool_call_type="function") result = albert_model._validate_completion(response) assert result.choices[0].message.tool_calls[0].type == "function" + + +def test_validate_completion_preserves_custom_tool_call_type(albert_model): + """Tool calls typed as 'custom' pass through unchanged.""" + response = _make_chat_completion(tool_call_type="custom") + result = albert_model._validate_completion(response) + assert result.choices[0].message.tool_calls[0].type == "custom"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/backend/chat/tests/agents/test_albert_models.py` around lines 223 - 228, Add a new test mirroring test_validate_completion_preserves_function_tool_call_type to cover the "custom" tool-call type: call _make_chat_completion(tool_call_type="custom"), pass the response into albert_model._validate_completion(response), and assert that result.choices[0].message.tool_calls[0].type == "custom"; name the test e.g. test_validate_completion_preserves_custom_tool_call_type to make the purpose explicit and ensure the "custom" branch is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHANGELOG.md`:
- Line 302: Update the changelog entry that currently reads "run_eval" to the
correct command name "run_evals" so the CHANGELOG.md matches the actual command
introduced by this PR; locate the line containing "🔧(evals) add run_eval
management command" and change "run_eval" to "run_evals".
In `@src/backend/chat/evals/configs/self_documentation.py`:
- Line 15: The Tool created from the function _self_documentation is getting the
function's name (including the leading underscore) as the tool name, causing
eval assertions to expect "self_documentation" to fail; fix this by passing an
explicit name="self_documentation" when constructing the Tool (i.e., set the
Tool's name parameter where _self_documentation is wrapped/registered) so the
tool name matches the eval span contract.
In `@src/backend/chat/evals/configs/url_hallucination.py`:
- Around line 12-23: The rubric in url_hallucination.py currently only accepts
URLs that appear verbatim in the tool output, but this eval also allows verbatim
URLs from the user message; update the rubric string (the URL hallucination
prompt/definition text in url_hallucination.py) to accept URLs that appear
verbatim in either the tool output OR the user message as valid (adjust the PASS
bullets and the FAIL condition to only mark as FAIL when a http/https URL
appears that is absent from both tool output and the user message), and add a
clarifying sentence that user-provided verbatim URLs count as non-hallucinated.
In `@src/backend/chat/evals/evaluators/url_regex.py`:
- Line 1: Update the module docstring and the evaluator's failure message to
list all allowed URL sources (both tool_output and user_message) instead of only
"tool output"; locate the url_regex.py evaluator (module docstring at top) and
the place where it constructs the failure reason (e.g., the evaluator's
evaluate()/generate_failure_message()/reason variable) and change wording to
explicitly mention both "tool_output" and "user_message" as allowed sources.
In `@src/backend/chat/evals/README.md`:
- Around line 27-30: UrlRegexEvaluator currently accepts URLs found in
user_message but the dataset requires URLs only come from tool output; update
the evaluator logic in UrlRegexEvaluator to ignore URLs extracted from the
user's message and only consider URLs present in provided tool outputs (use the
tool output payload/fields passed into the evaluator, e.g. the list/array of
tool results) when determining a match; ensure any helper like extractUrls or
matchUrls is refactored to accept a source parameter or to be called only with
tool outputs, and add/update unit tests for UrlRegexEvaluator to cover the
user_message vs tool output cases.
- Around line 9-23: Update the fenced "Structure" code block in the README so
the opening fence includes a language identifier (e.g., change ``` to ```text)
to satisfy markdownlint MD040; locate the block that shows the directory tree
under the chat/evals header and modify its opening backticks to use "text" (or
another appropriate language) while leaving the block content unchanged.
In `@src/backend/chat/management/commands/run_evals.py`:
- Around line 54-58: The --runs argparse option currently allows zero or
negative values; update the add_argument for "--runs" (where type=int and
default=1 are set) to validate that the provided value is a strictly positive
integer (e.g., use a custom argparse type or validate args.runs after parsing
and raise argparse.ArgumentTypeError or exit with a clear error) and ensure any
use of args.runs (such as passed into repeat) assumes >0; this will prevent
passing 0/negative values into repeat and produce a clear user-facing error when
the flag is invalid.
---
Outside diff comments:
In `@src/backend/chat/evals/README.md`:
- Around line 169-185: The docs for make_task_fn are out of date: they tell
readers to pass make_task_fn into EvalConfig but the current API uses
agent_class (and no make_task_fn). Update the README section to reflect the new
config API by explaining how to provide a custom agent via
EvalConfig.agent_class (mentioning agent_class and
MyCustomAgent/self_documentation as examples) and show the new usage pattern
(describe that EvalConfig now accepts agent_class=MyCustomAgent instead of
make_task_fn). Also remove or mark deprecated references to make_task_fn to
avoid confusion.
---
Nitpick comments:
In `@src/backend/chat/tests/agents/test_albert_models.py`:
- Around line 223-228: Add a new test mirroring
test_validate_completion_preserves_function_tool_call_type to cover the "custom"
tool-call type: call _make_chat_completion(tool_call_type="custom"), pass the
response into albert_model._validate_completion(response), and assert that
result.choices[0].message.tool_calls[0].type == "custom"; name the test e.g.
test_validate_completion_preserves_custom_tool_call_type to make the purpose
explicit and ensure the "custom" branch is exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 6d19e392-57fe-42f7-92db-b19a0e6d564d
📒 Files selected for processing (17)
CHANGELOG.mdMakefilesrc/backend/chat/evals/README.mdsrc/backend/chat/evals/__init__.pysrc/backend/chat/evals/configs/__init__.pysrc/backend/chat/evals/configs/base.pysrc/backend/chat/evals/configs/self_documentation.pysrc/backend/chat/evals/configs/url_hallucination.pysrc/backend/chat/evals/datasets/self_documentation.yamlsrc/backend/chat/evals/datasets/url_hallucination.yamlsrc/backend/chat/evals/evaluators/__init__.pysrc/backend/chat/evals/evaluators/url_regex.pysrc/backend/chat/management/__init__.pysrc/backend/chat/management/commands/__init__.pysrc/backend/chat/management/commands/run_evals.pysrc/backend/chat/providers/albert_models.pysrc/backend/chat/tests/agents/test_albert_models.py
| - ✨(langfuse) allow user to score messages from LLM #6 | ||
| - ✨(onboarding) add activation code logic for launch #62 | ||
| - 💄(chat) add code highlighting for LLM responses #67 | ||
| - 🔧(evals) add run_eval management command |
There was a problem hiding this comment.
Fix command name typo in changelog entry.
The added entry says run_eval, but the command introduced in this PR is run_evals.
💡 Proposed fix
-- 🔧(evals) add run_eval management command
+- 🔧(evals) add run_evals management command📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - 🔧(evals) add run_eval management command | |
| - 🔧(evals) add run_evals management command |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@CHANGELOG.md` at line 302, Update the changelog entry that currently reads
"run_eval" to the correct command name "run_evals" so the CHANGELOG.md matches
the actual command introduced by this PR; locate the line containing "🔧(evals)
add run_eval management command" and change "run_eval" to "run_evals".
| | Dataset | What it tests | Evaluators | | ||
| |---|---|---| | ||
| | `url_hallucination` | The agent never invents `http(s)://` URLs; only uses URLs from tool output | `UrlRegexEvaluator` (regex) + `LLMJudge` (semantic) | | ||
| | `self_documentation` | The `self_documentation` tool is called when and only when the user asks about the assistant itself | `HasMatchingSpan` per case (span-based) | |
There was a problem hiding this comment.
Dataset description is stricter than the implemented evaluator.
The table says URLs must come only from tool output, but the evaluator also allows URLs present in user_message. This can mislead triage when reading failures.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/backend/chat/evals/README.md` around lines 27 - 30, UrlRegexEvaluator
currently accepts URLs found in user_message but the dataset requires URLs only
come from tool output; update the evaluator logic in UrlRegexEvaluator to ignore
URLs extracted from the user's message and only consider URLs present in
provided tool outputs (use the tool output payload/fields passed into the
evaluator, e.g. the list/array of tool results) when determining a match; ensure
any helper like extractUrls or matchUrls is refactored to accept a source
parameter or to be called only with tool outputs, and add/update unit tests for
UrlRegexEvaluator to cover the user_message vs tool output cases.
| "--runs", | ||
| type=int, | ||
| default=1, | ||
| help="Number of times to run each case (default: 1). Use > 1 to measure consistency.", | ||
| ) |
There was a problem hiding this comment.
Validate --runs as a strictly positive integer.
--runs currently accepts 0 or negative values, which can lead to misleading/no-op evaluations when passed to repeat.
💡 Proposed fix
def handle(self, *args, **options):
logfire.configure(send_to_logfire=False)
+ if options["runs"] < 1:
+ raise CommandError("--runs must be >= 1.")Also applies to: 131-131
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/backend/chat/management/commands/run_evals.py` around lines 54 - 58, The
--runs argparse option currently allows zero or negative values; update the
add_argument for "--runs" (where type=int and default=1 are set) to validate
that the provided value is a strictly positive integer (e.g., use a custom
argparse type or validate args.runs after parsing and raise
argparse.ArgumentTypeError or exit with a clear error) and ensure any use of
args.runs (such as passed into repeat) assumes >0; this will prevent passing
0/negative values into repeat and produce a clear user-facing error when the
flag is invalid.
2b2ad04 to
41dd4c6
Compare
41dd4c6 to
1b29c2f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/backend/chat/evals/README.md (2)
9-23:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd a language identifier to the fenced code block.
The code fence at line 9 still lacks a language identifier, triggering markdownlint MD040. Adding
textorplaintextwill resolve the linting warning.📝 Proposed fix
-``` +```text chat/evals/ ├── configs/ │ ├── __init__.py # REGISTRY — maps dataset name → EvalConfig🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/backend/chat/evals/README.md` around lines 9 - 23, The fenced code block in src/backend/chat/evals/README.md is missing a language identifier (triggering MD040); update the opening triple-backtick of the directory tree block to include a language token such as ```text or ```plaintext so the block becomes a proper fenced code block with a language identifier; locate the block showing "chat/evals/" and change its opening fence accordingly.
29-29:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClarify that URLs from user input are also allowed.
The description states URLs must come "only from tool output," but if the evaluator also accepts URLs present in
user_message(as flagged in previous review), this description is misleading. Consider updating to "only uses URLs from tool output or user input" to accurately reflect the evaluator behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/backend/chat/evals/README.md` at line 29, Update the README description for the `url_hallucination` check to reflect that allowed URLs may come from both tool outputs and user input: change the phrase "only uses URLs from tool output" to "only uses URLs from tool output or user input" and ensure the `UrlRegexEvaluator` and `LLMJudge` references remain intact so readers know those components perform regex and semantic checks respectively (refer to `url_hallucination`, `UrlRegexEvaluator`, and `LLMJudge` in the same line).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/backend/chat/management/commands/run_evals.py`:
- Around line 123-126: The _run_dataset function currently has a bool return
annotation and docstring but actually returns an EvaluationReport at the end
(see the return of EvaluationReport around the block that used to be line 145);
update the function signature and docstring of _run_dataset to reflect that it
returns an EvaluationReport (not bool), adjust the type hint to
EvaluationReport, and update the docstring to describe the EvaluationReport
contents so callers and type checks are correct.
---
Duplicate comments:
In `@src/backend/chat/evals/README.md`:
- Around line 9-23: The fenced code block in src/backend/chat/evals/README.md is
missing a language identifier (triggering MD040); update the opening
triple-backtick of the directory tree block to include a language token such as
```text or ```plaintext so the block becomes a proper fenced code block with a
language identifier; locate the block showing "chat/evals/" and change its
opening fence accordingly.
- Line 29: Update the README description for the `url_hallucination` check to
reflect that allowed URLs may come from both tool outputs and user input: change
the phrase "only uses URLs from tool output" to "only uses URLs from tool output
or user input" and ensure the `UrlRegexEvaluator` and `LLMJudge` references
remain intact so readers know those components perform regex and semantic checks
respectively (refer to `url_hallucination`, `UrlRegexEvaluator`, and `LLMJudge`
in the same line).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0e639ce5-50e8-4a14-a9a1-c4f2d02fb4e4
📒 Files selected for processing (17)
CHANGELOG.mdMakefilesrc/backend/chat/evals/README.mdsrc/backend/chat/evals/__init__.pysrc/backend/chat/evals/configs/__init__.pysrc/backend/chat/evals/configs/base.pysrc/backend/chat/evals/configs/self_documentation.pysrc/backend/chat/evals/configs/url_hallucination.pysrc/backend/chat/evals/datasets/self_documentation.yamlsrc/backend/chat/evals/datasets/url_hallucination.yamlsrc/backend/chat/evals/evaluators/__init__.pysrc/backend/chat/evals/evaluators/url_regex.pysrc/backend/chat/management/__init__.pysrc/backend/chat/management/commands/__init__.pysrc/backend/chat/management/commands/run_evals.pysrc/backend/chat/providers/albert_models.pysrc/backend/chat/tests/agents/test_albert_models.py
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (10)
- src/backend/chat/evals/datasets/self_documentation.yaml
- Makefile
- src/backend/chat/evals/init.py
- src/backend/chat/evals/configs/url_hallucination.py
- src/backend/chat/evals/configs/init.py
- src/backend/chat/evals/configs/base.py
- src/backend/chat/providers/albert_models.py
- src/backend/chat/evals/evaluators/init.py
- src/backend/chat/tests/agents/test_albert_models.py
- src/backend/chat/evals/configs/self_documentation.py
Create dataset to evaluate url hallucination and self documentation tool call Run inside docker with Make target Add fix for albert non-stream completion to support evaluating with albert provider
1b29c2f to
4235b6a
Compare
|



Purpose
This PR introduces a behavioral eval system for ConversationAgent. Unlike
These evals test LLM behaviour end-to-end:
A failing eval means a documented behaviour has regressed.
The system is designed to grow: adding a new dataset requires one YAML file, one config file, and a registry entry.
Proposal
Docker; debug mode exposes debugpy on port 5678 for VS Code remote attach
--verbose, --no-llm-judge arguments
evaluators, and optional custom agent class
http(s):// URLs; uses UrlRegexEvaluator (deterministic regex) + LLMJudge
(semantic, skippable with --no-llm-judge for Albert API compatibility)
(gen_ai.tool.name = "self_documentation")
comparison to avoid false positives
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests