Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 138 additions & 49 deletions docs/concepts/evaluation.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,168 @@
# Evaluation

An agent that worked yesterday may not work today — the model
changed, a tool changed, the prompt got tweaked. locus ships an
evaluation harness so regressions are tests, not surprises.
changed, a tool was renamed, the prompt got a one-line tweak. locus
ships a small evaluation harness so regressions become **failing
tests**, not customer tickets.

```python
from locus.evaluation import EvalCase, EvalRunner, EvalReport
from locus.evaluation import EvalCase, EvalRunner

cases = [
EvalCase(
name="books-real-flight",
prompt="Book TK-12 for customer C-42.",
expected={
"tool_calls": ["book_flight"],
"tool_args": {"book_flight": {"flight_id": "TK-12"}},
"final_message": lambda m: "TK-12" in m,
},
),
EvalCase(
name="rejects-unknown-flight",
prompt="Book ZZ-999.",
expected={
"tool_calls_lt": 2,
"final_message": lambda m: "not found" in m.lower(),
},
name="weather_lookup",
prompt="What's the weather in NYC?",
expected_tools=["get_weather"],
expected_output_contains=["temperature", "New York"],
max_iterations=5,
),
]

report: EvalReport = EvalRunner(agent_factory=build_agent).run(cases)
print(report.summary()) # pass-rate, p50/p95 latency, token cost
report.save_html("evals/2026-04-27.html")
report = EvalRunner(agent=agent).run(cases)
print(report.summary())
```

## When to reach for an eval suite

| Situation | Run evals? |
|---|---|
| You changed a tool's signature, default args, or system prompt | **yes — every commit that touches it** |
| You're swapping models (gpt-4o → gpt-5, llama-3.3 → llama-4) | **yes — same suite, two providers, diff the report** |
| You're debating "is the agent better than last week?" | **yes — nightly soak with `n=20` per case to see variance** |
| One-shot exploration, scratch agent | no — overhead's not worth it |
| Heavy LLM-as-judge needed (open-ended quality) | the harness covers structural checks; pair it with a custom judge tool for free-text grading |

## Getting started

### 1. Define cases

`EvalCase` is a Pydantic model — every field is optional except
`name` and `prompt`. The runner only checks fields you set.

```python
from locus.evaluation import EvalCase

books_real = EvalCase(
name="books_real_flight",
prompt="Book TK-12 for customer C-42.",
expected_tools=["book_flight"],
expected_output_contains=["TK-12", "booked"],
max_iterations=4,
)

rejects_unknown = EvalCase(
name="rejects_unknown_flight",
prompt="Book ZZ-999.",
expected_output_contains=["not found"],
expected_output_not_contains=["booked", "confirmed"],
)
```

### 2. Run them

```python
from locus.evaluation import EvalRunner

runner = EvalRunner(agent=agent)
report = runner.run([books_real, rejects_unknown])

print(report.summary())
# Eval Report: 2/2 passed (avg score: 1.00)
# Total duration: 4321ms
# [PASS] books_real_flight (score: 1.00, 1872ms)
# [PASS] rejects_unknown_flight (score: 1.00, 2449ms)
```

`run()` returns an `EvalReport` — a Pydantic model with per-case
results, aggregate pass/fail counts, average score, and total
duration. JSON-serialisable, drop into CI artifacts.

### 3. Wire it into CI

```python
# tests/test_agent_evals.py
import pytest
from locus.evaluation import EvalRunner

def test_agent_passes_eval_suite(agent):
report = EvalRunner(agent=agent).run(load_cases())
failures = [r for r in report.results if not r.passed]
assert not failures, report.summary()
```

## What an `EvalCase` checks
## Built-in checks

- **Tool trace** — which tools fired, in what order, with which args.
- **Final message** — exact match, regex, or a custom predicate.
- **Termination reason** — did the agent stop because the work was done
or because it hit a budget?
- **Latency / token cost** — within a budget per case.
- **Anything custom** — pass an `evaluators=[...]` list of callables.
Every check runs only when the corresponding field is set on the
case. Each check contributes equally to the per-case score.

## Reports
| Field | Passes when |
|---|---|
| `expected_tools` | All listed tools appear in the run's tool executions. |
| `expected_output_contains` | Every string is a case-insensitive substring of the final message. |
| `expected_output_not_contains` | None of the strings appear in the final message. |
| `max_iterations` | The run finished in ≤ N ReAct turns. |
| `max_duration_ms` | Wall-clock duration ≤ N milliseconds. |

`EvalReport` is JSON-serialisable; the HTML view is a static page you
can drop into CI artifacts. Pass-rate per case, latency histogram,
token-cost trend, and a diff against the previous report.
A case **passes** when every check passed; the **score** is the
fraction of checks that passed (handy for partial-credit scoring
across a soak).

## Custom evaluators
## Tags and filtering

The `expected` dict on each `EvalCase` accepts callables, so the
simplest way to add a custom check is a lambda or function reference:
```python
EvalCase(name="..." , prompt="..." , tags=["smoke", "happy-path"])
EvalCase(name="..." , prompt="..." , tags=["adversarial"])

# Run only smoke cases on every commit; full suite nightly.
smoke = [c for c in all_cases if "smoke" in c.tags]
runner.run(smoke)
```

`tags` is just a list — slice it however your CI matrix expects.

## LLM-as-judge for open-ended quality

The built-in checks are structural ("did the right tool fire?", "did
the answer mention 'temperature'?"). For free-text quality
("is this answer empathetic?", "is the explanation correct?"), wrap a
judge model as a tool and key on its verdict:

```python
def cited(message: str) -> bool:
"""Pass if every expected citation appears in the final message."""
return all(c in message for c in ["[1]", "[2]", "[3]"])
from locus.tools.decorator import tool

@tool
def judge(answer: str) -> dict:
"""LLM-graded quality verdict (0.0–1.0 + reasoning)."""
return judge_model.run_sync(f"Grade this answer: {answer}").message

# Then in the case:
EvalCase(
name="research-with-citations",
prompt="Summarise the Q3 results with citations.",
expected={"final_message": cited},
name="empathetic_response",
prompt="My order is late and I'm upset.",
expected_tools=["judge"],
expected_output_contains=["sorry"], # at minimum
)
```

## When to run
A future locus release may bundle a typed judge directly into
`EvalCase`; for today, this pattern is the path.

## Common gotchas

- On every commit that touches an agent's prompt, tools, or model.
- Before swapping a model.
- As a nightly soak with `n=20` per case to see variance.
| Symptom | Likely cause |
|---|---|
| Case passes locally, fails in CI | Non-deterministic model. Pin the model id, lower `temperature`, run with `n=5` and look at variance. |
| `max_duration_ms` flakes | Cold-start network latency. Use a wall-clock budget at the suite level, not per-case, or bump the per-case budget by 2×. |
| `expected_tools` reports failure even though the tool ran | Case-sensitive name match — `book_flight` != `Book_Flight`. |
| Score is 0.5 every time | One of two checks is consistently failing. Read `result.checks` — it carries the full pass/fail map. |

## Tutorial
## Source and tutorial

[`tutorial_26_evaluation.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_26_evaluation.py).
- [`tutorial_26_evaluation.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_26_evaluation.py) — runnable end-to-end suite.
- [`locus.evaluation.framework`](https://github.com/oracle-samples/locus/blob/main/src/locus/evaluation/framework.py) — `EvalCase`, `EvalRunner`, `EvalReport`.

## Source
## See also

`src/locus/evaluation/`.
- [Reasoning](reasoning.md) — `reflexion=True` and `grounding=True` reduce the kind of failures you'd otherwise catch only in evals.
- [Termination](termination.md) — `max_iterations` on `EvalCase` mirrors `MaxIterations` on the agent.
- [Hooks](hooks.md) — record per-eval traces with a `TelemetryHook` for offline review.
Loading
Loading