diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md index 9aa3289..a4ae4df 100644 --- a/docs/concepts/evaluation.md +++ b/docs/concepts/evaluation.md @@ -1,79 +1,168 @@ # Evaluation An agent that worked yesterday may not work today — the model -changed, a tool changed, the prompt got tweaked. locus ships an -evaluation harness so regressions are tests, not surprises. +changed, a tool was renamed, the prompt got a one-line tweak. locus +ships a small evaluation harness so regressions become **failing +tests**, not customer tickets. ```python -from locus.evaluation import EvalCase, EvalRunner, EvalReport +from locus.evaluation import EvalCase, EvalRunner cases = [ EvalCase( - name="books-real-flight", - prompt="Book TK-12 for customer C-42.", - expected={ - "tool_calls": ["book_flight"], - "tool_args": {"book_flight": {"flight_id": "TK-12"}}, - "final_message": lambda m: "TK-12" in m, - }, - ), - EvalCase( - name="rejects-unknown-flight", - prompt="Book ZZ-999.", - expected={ - "tool_calls_lt": 2, - "final_message": lambda m: "not found" in m.lower(), - }, + name="weather_lookup", + prompt="What's the weather in NYC?", + expected_tools=["get_weather"], + expected_output_contains=["temperature", "New York"], + max_iterations=5, ), ] -report: EvalReport = EvalRunner(agent_factory=build_agent).run(cases) -print(report.summary()) # pass-rate, p50/p95 latency, token cost -report.save_html("evals/2026-04-27.html") +report = EvalRunner(agent=agent).run(cases) +print(report.summary()) +``` + +## When to reach for an eval suite + +| Situation | Run evals? | +|---|---| +| You changed a tool's signature, default args, or system prompt | **yes — every commit that touches it** | +| You're swapping models (gpt-4o → gpt-5, llama-3.3 → llama-4) | **yes — same suite, two providers, diff the report** | +| You're debating "is the agent better than last week?" | **yes — nightly soak with `n=20` per case to see variance** | +| One-shot exploration, scratch agent | no — overhead's not worth it | +| Heavy LLM-as-judge needed (open-ended quality) | the harness covers structural checks; pair it with a custom judge tool for free-text grading | + +## Getting started + +### 1. Define cases + +`EvalCase` is a Pydantic model — every field is optional except +`name` and `prompt`. The runner only checks fields you set. + +```python +from locus.evaluation import EvalCase + +books_real = EvalCase( + name="books_real_flight", + prompt="Book TK-12 for customer C-42.", + expected_tools=["book_flight"], + expected_output_contains=["TK-12", "booked"], + max_iterations=4, +) + +rejects_unknown = EvalCase( + name="rejects_unknown_flight", + prompt="Book ZZ-999.", + expected_output_contains=["not found"], + expected_output_not_contains=["booked", "confirmed"], +) +``` + +### 2. Run them + +```python +from locus.evaluation import EvalRunner + +runner = EvalRunner(agent=agent) +report = runner.run([books_real, rejects_unknown]) + +print(report.summary()) +# Eval Report: 2/2 passed (avg score: 1.00) +# Total duration: 4321ms +# [PASS] books_real_flight (score: 1.00, 1872ms) +# [PASS] rejects_unknown_flight (score: 1.00, 2449ms) +``` + +`run()` returns an `EvalReport` — a Pydantic model with per-case +results, aggregate pass/fail counts, average score, and total +duration. JSON-serialisable, drop into CI artifacts. + +### 3. Wire it into CI + +```python +# tests/test_agent_evals.py +import pytest +from locus.evaluation import EvalRunner + +def test_agent_passes_eval_suite(agent): + report = EvalRunner(agent=agent).run(load_cases()) + failures = [r for r in report.results if not r.passed] + assert not failures, report.summary() ``` -## What an `EvalCase` checks +## Built-in checks -- **Tool trace** — which tools fired, in what order, with which args. -- **Final message** — exact match, regex, or a custom predicate. -- **Termination reason** — did the agent stop because the work was done - or because it hit a budget? -- **Latency / token cost** — within a budget per case. -- **Anything custom** — pass an `evaluators=[...]` list of callables. +Every check runs only when the corresponding field is set on the +case. Each check contributes equally to the per-case score. -## Reports +| Field | Passes when | +|---|---| +| `expected_tools` | All listed tools appear in the run's tool executions. | +| `expected_output_contains` | Every string is a case-insensitive substring of the final message. | +| `expected_output_not_contains` | None of the strings appear in the final message. | +| `max_iterations` | The run finished in ≤ N ReAct turns. | +| `max_duration_ms` | Wall-clock duration ≤ N milliseconds. | -`EvalReport` is JSON-serialisable; the HTML view is a static page you -can drop into CI artifacts. Pass-rate per case, latency histogram, -token-cost trend, and a diff against the previous report. +A case **passes** when every check passed; the **score** is the +fraction of checks that passed (handy for partial-credit scoring +across a soak). -## Custom evaluators +## Tags and filtering -The `expected` dict on each `EvalCase` accepts callables, so the -simplest way to add a custom check is a lambda or function reference: +```python +EvalCase(name="..." , prompt="..." , tags=["smoke", "happy-path"]) +EvalCase(name="..." , prompt="..." , tags=["adversarial"]) + +# Run only smoke cases on every commit; full suite nightly. +smoke = [c for c in all_cases if "smoke" in c.tags] +runner.run(smoke) +``` + +`tags` is just a list — slice it however your CI matrix expects. + +## LLM-as-judge for open-ended quality + +The built-in checks are structural ("did the right tool fire?", "did +the answer mention 'temperature'?"). For free-text quality +("is this answer empathetic?", "is the explanation correct?"), wrap a +judge model as a tool and key on its verdict: ```python -def cited(message: str) -> bool: - """Pass if every expected citation appears in the final message.""" - return all(c in message for c in ["[1]", "[2]", "[3]"]) +from locus.tools.decorator import tool +@tool +def judge(answer: str) -> dict: + """LLM-graded quality verdict (0.0–1.0 + reasoning).""" + return judge_model.run_sync(f"Grade this answer: {answer}").message + +# Then in the case: EvalCase( - name="research-with-citations", - prompt="Summarise the Q3 results with citations.", - expected={"final_message": cited}, + name="empathetic_response", + prompt="My order is late and I'm upset.", + expected_tools=["judge"], + expected_output_contains=["sorry"], # at minimum ) ``` -## When to run +A future locus release may bundle a typed judge directly into +`EvalCase`; for today, this pattern is the path. + +## Common gotchas -- On every commit that touches an agent's prompt, tools, or model. -- Before swapping a model. -- As a nightly soak with `n=20` per case to see variance. +| Symptom | Likely cause | +|---|---| +| Case passes locally, fails in CI | Non-deterministic model. Pin the model id, lower `temperature`, run with `n=5` and look at variance. | +| `max_duration_ms` flakes | Cold-start network latency. Use a wall-clock budget at the suite level, not per-case, or bump the per-case budget by 2×. | +| `expected_tools` reports failure even though the tool ran | Case-sensitive name match — `book_flight` != `Book_Flight`. | +| Score is 0.5 every time | One of two checks is consistently failing. Read `result.checks` — it carries the full pass/fail map. | -## Tutorial +## Source and tutorial -[`tutorial_26_evaluation.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_26_evaluation.py). +- [`tutorial_26_evaluation.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_26_evaluation.py) — runnable end-to-end suite. +- [`locus.evaluation.framework`](https://github.com/oracle-samples/locus/blob/main/src/locus/evaluation/framework.py) — `EvalCase`, `EvalRunner`, `EvalReport`. -## Source +## See also -`src/locus/evaluation/`. +- [Reasoning](reasoning.md) — `reflexion=True` and `grounding=True` reduce the kind of failures you'd otherwise catch only in evals. +- [Termination](termination.md) — `max_iterations` on `EvalCase` mirrors `MaxIterations` on the agent. +- [Hooks](hooks.md) — record per-eval traces with a `TelemetryHook` for offline review. diff --git a/docs/concepts/events.md b/docs/concepts/events.md index bd74559..8ab3cc8 100644 --- a/docs/concepts/events.md +++ b/docs/concepts/events.md @@ -1,63 +1,210 @@ -# Events & streaming +# Events -Every observable step of a run is a typed Pydantic event, not a -callback. `agent.run(...)` is an `AsyncIterator[LocusEvent]`. +Every observable step of an agent run is a typed Pydantic event. Not +a dict, not a callback, not a string — a frozen class with named +fields you can `match` on. + +This is the reference page. For the *how* (consuming the stream, +SSE, hooks), see [Streaming](streaming.md). For the *why* (frozen, +typed, write-protected), see [Agent loop](agent-loop.md). ```python -from locus import Agent from locus.core.events import ( ThinkEvent, ToolStartEvent, ToolCompleteEvent, TerminateEvent, ) async for event in agent.run("Plan a trip"): match event: - case ThinkEvent(thought=t): - print("thinking:", t) + case ThinkEvent(reasoning=r) if r: + print("💭", r) case ToolStartEvent(tool_name=n, arguments=a): - print(f"calling {n}({a})") + print(f"🔧 {n}({a})") case ToolCompleteEvent(tool_name=n, result=r, error=e): - print(f"done {n}: {e or r}") + print(f" ↳ {e or r}") case TerminateEvent(reason=r, final_message=m): print(f"[{r}] {m}") ``` -## Event types +## Common fields + +Every event inherits from `LocusEvent` and carries: + +| Field | Type | Meaning | +|---|---|---| +| `event_type` | `Literal[...]` | Discriminator string — `"think"`, `"tool_start"`, etc. | +| `timestamp` | `datetime` | UTC, populated at emit time. | + +Events are **frozen** Pydantic models. A hook can read every field; +it cannot mutate one. To steer a run, use the explicit method on the +event (`event.cancel()`, `event.replace_arguments(...)`) — the intent +is visible in code review. -| Event | When | +## Core events + +### `ThinkEvent` + +The model emitted reasoning, optionally with tool calls. + +| Field | Meaning | |---|---| -| `ThinkEvent` | Model produced reasoning (+ optional tool calls) | -| `ToolStartEvent` | About to invoke a tool | -| `ToolCompleteEvent` | Tool returned (or errored) | -| `ReflectEvent` | Reflexion cycle finished with new confidence | -| `GroundingEvent` | Grounding verified / disputed a claim | -| `ModelChunkEvent` | Streaming token from the LLM provider | -| `InterruptEvent` | A hook requested human-in-the-loop | -| `TerminateEvent` | Run ended (with `reason` and `final_message`) | +| `iteration` | ReAct turn index (0-based) | +| `reasoning` | The model's chain-of-thought, if the provider exposed it | +| `tool_calls` | Tool calls the model decided to make this turn | -## SSE +Render this as a "thinking…" bubble. Most providers return `None` +unless extended thinking is enabled (Claude 4 / o-series). -For HTTP deployments, the FastAPI wrapper emits the event stream as -Server-Sent Events. Each event becomes one SSE frame with its JSON -payload. +### `ToolStartEvent` -## Termination conditions +The agent is about to invoke a tool. -Termination is also typed and composable. `|` is OR, `&` is AND: +| Field | Meaning | +|---|---| +| `tool_name` | Tool registered with `@tool` | +| `tool_call_id` | Provider-issued id, used to correlate with the matching `ToolCompleteEvent` | +| `arguments` | The validated arguments dict | -```python -from locus.core.termination import ( - MaxIterations, TokenLimit, TextMention, TimeLimit, ToolCalled, -) +Show a "calling X" indicator. -# Stop after 10 iterations OR when the model says "DONE". -condition = MaxIterations(10) | TextMention("DONE") +### `ToolCompleteEvent` -# Stop when BOTH: the confidence is high AND a specific tool was called. -condition = ConfidenceMet(0.9) & ToolCalled("send_summary") +A tool returned, errored, or was cancelled. -agent = Agent(..., termination=condition) -``` +| Field | Meaning | +|---|---| +| `tool_name` | Same name as the matching start event | +| `tool_call_id` | Pairs with `ToolStartEvent.tool_call_id` | +| `result` | The serialised return value, or `None` on error | +| `error` | Exception message, or `None` on success | +| `duration_ms` | How long the body actually ran | + +Always check `error` first — a non-`None` `error` means `result` is +`None`. + +### `ModelChunkEvent` + +One streamed chunk from the LLM provider — the granularity that +drives token-by-token rendering. + +| Field | Meaning | +|---|---| +| `content` | Text delta (may be `None` for tool-call-only chunks) | +| `tool_calls` | Tool-call deltas, if the provider streams those | +| `done` | `True` on the final chunk of a turn | + +`None`-guard before printing: `if e.content: print(e.content, end="")`. + +### `ModelCompleteEvent` + +A full model response was received (paired with the chunks above). + +| Field | Meaning | +|---|---| +| `content` | The complete text | +| `tool_calls` | All tool calls in this turn | +| `usage` | `{"input_tokens": ..., "output_tokens": ...}` | +| `stop_reason` | Provider-specific stop reason | + +Telemetry hooks key off `usage` for cost tracking. + +### `ReflectEvent` + +[Reflexion](reasoning.md#reflexion) emitted a self-evaluation. + +| Field | Meaning | +|---|---| +| `iteration` | Which turn this reflection concerns | +| `assessment` | `"on_track"`, `"stuck"`, `"new_findings"`, or `"loop_detected"` | +| `confidence_delta` | Change vs the previous turn | +| `new_confidence` | Current value, 0.0–1.0 | +| `guidance` | Free-text steering for the next turn | + +Pair `new_confidence` with [`ConfidenceMet`](termination.md) for early +stopping. + +### `GroundingEvent` + +[Grounding](reasoning.md#grounding) finished evaluating claims. + +| Field | Meaning | +|---|---| +| `score` | 0.0–1.0, fraction of claims supported | +| `claims_evaluated` | How many claims the judge looked at | +| `ungrounded_claims` | The text of every unsupported claim | +| `requires_replan` | `True` if the run should re-research | + +### `InterruptEvent` + +A tool requested human-in-the-loop input. The run pauses; resume by +calling the agent with the user's reply. + +| Field | Meaning | +|---|---| +| `question` | What to ask the human | +| `options` | If multiple-choice, the allowed answers | +| `interrupt_id` | Pass back to resume | +| `metadata` | Free-form context for the UI | + +See [Interrupts](interrupts.md). + +### `TerminateEvent` + +The run finished. + +| Field | Meaning | +|---|---| +| `reason` | Which termination condition fired (its `repr`) | +| `iterations_used` | How many ReAct turns ran | +| `final_confidence` | Reflexion confidence at end of run | +| `total_tool_calls` | Distinct tool invocations | +| `final_message` | The assistant's last text, if any | + +Always emitted exactly once per run. + +## Multi-agent events + +These appear when an `Orchestrator`, `Swarm`, or `StateGraph` is +running. + +| Event | Fired when | +|---|---| +| `SpecialistStartEvent` | Orchestrator dispatched to a specialist | +| `SpecialistCompleteEvent` | Specialist returned a result | +| `OrchestratorDecisionEvent` | Orchestrator picked its next step (`invoke_specialist`, `correlate`, `summarize`, `finalize`) | + +See [Multi-agent](multi-agent.md). + +## Causal-reasoning events + +When `causal=True`, the agent emits node and edge events as the graph +grows. + +| Event | Fired when | +|---|---| +| `CausalNodeEvent` | A new entity entered the cause-effect graph (root cause / symptom / intermediate) | +| `CausalEdgeEvent` | A causal link was added between two nodes | + +## Hook events + +`BeforeInvocationEvent`, `AfterInvocationEvent`, `BeforeToolCallEvent`, +`AfterToolCallEvent` — emitted *to hooks* around the same lifecycle +points the user-visible events come from. See [Hooks](hooks.md). + +## Common gotchas + +| Symptom | Likely cause | +|---|---| +| `match` is non-exhaustive at the type checker | Add a `case _: pass` fallthrough or handle the missing variant. | +| `ModelChunkEvent.content` is `None` | Tool-call-only chunk. Guard with `if event.content:`. | +| `TerminateEvent` never arrives | Generator was cancelled mid-stream. Check the consumer for exceptions. | +| Hook tried to mutate `event.tool_name` and got `ValidationError` | Frozen by design — use `event.replace_arguments(...)` or `event.cancel()` instead. | + +## Source + +- [`locus.core.events`](https://github.com/oracle-samples/locus/blob/main/src/locus/core/events.py) — every event class. + +## See also -Built-in conditions: `MaxIterations`, `TokenLimit`, `TextMention`, -`TimeLimit`, `ToolCalled`, `ConfidenceMet`, `NoToolCalls`, -`CustomCondition`. +- [Streaming](streaming.md) — how to consume the event stream. +- [Hooks](hooks.md) — observe the same events from inside the loop. +- [Agent server](server.md) — re-emit events over Server-Sent Events. diff --git a/docs/concepts/hooks.md b/docs/concepts/hooks.md index 32fc117..06d7090 100644 --- a/docs/concepts/hooks.md +++ b/docs/concepts/hooks.md @@ -61,7 +61,7 @@ no-op defaults from the base class. ```python agent = Agent( - model="oci:openai.gpt-5.5", + model="oci:openai.gpt-5", tools=[search, book_flight], hooks=[AuditHook()], ) @@ -86,7 +86,7 @@ from locus.hooks.builtin import ( ) agent = Agent( - model="oci:openai.gpt-5.5", + model="oci:openai.gpt-5", tools=[...], hooks=[ StructuredLoggingHook(), # JSON logs at every phase @@ -132,7 +132,7 @@ call is higher than the cost of a second model round-trip. ```python agent = Agent( ..., - hooks=[SteeringHook(approver="oci:openai.gpt-5.5")], + hooks=[SteeringHook(approver="oci:openai.gpt-5")], ) ``` diff --git a/docs/concepts/mcp.md b/docs/concepts/mcp.md index f0fec86..5b9f0d1 100644 --- a/docs/concepts/mcp.md +++ b/docs/concepts/mcp.md @@ -50,7 +50,7 @@ stdin/stdout, and discovers what tools the server exposes. from locus import Agent agent = Agent( - model="oci:openai.gpt-5.5", + model="oci:openai.gpt-5", tools=[*fs.tools()], # MCP tools become locus tools system_prompt="You can read files in /data.", ) @@ -136,7 +136,7 @@ analytics = LocusMCPServer( # producer side analytics.run_http(port=7400, in_background=True) agent_a = Agent( - model="oci:openai.gpt-5.5", + model="oci:openai.gpt-5", tools=[*fs.tools(), summarise_csv, plot_histogram], ) ``` diff --git a/docs/concepts/reasoning.md b/docs/concepts/reasoning.md index 93fd531..270be24 100644 --- a/docs/concepts/reasoning.md +++ b/docs/concepts/reasoning.md @@ -1,59 +1,141 @@ # Reasoning -A model that loops without thinking is a model that pays you to be -wrong faster. locus ships three reasoning add-ons that are each a -single argument on `Agent(...)`. +A model that loops without thinking just pays you to be wrong faster. +locus ships three reasoning add-ons that catch wrong premises *before* +the next tool call, not in the post-mortem: + +- **Reflexion** — after each turn, the agent self-evaluates and + re-plans if the last step was wrong. +- **Grounding** — every factual claim is checked against tool results + by an LLM-as-judge before the answer goes out. +- **Causal reasoning** — a running cause-effect graph that surfaces + contradictions linear chat history hides. + +Each is a single argument on `Agent(...)`. You can combine them. + +## When to pick which + +| Situation | Add-on | +|---|---| +| Agent loops endlessly or stacks tool calls on a wrong premise | `reflexion=True` | +| Customer-facing answers where hallucinated facts cost money (drug names, prices, account numbers) | `grounding=True` | +| Multi-step diagnosis or root-cause analysis where one bad assumption poisons the chain | `causal=True` | +| All three apply — production research agent, compliance-sensitive answer | turn them all on | +| Quick prototype, low-stakes Q&A | leave them off — extra model calls are wasted | + +The cost is more model round-trips. The win is fewer wrong answers. +For short tasks the math doesn't pencil out. For runs of 5+ tool calls +or anything that ships to a customer, it almost always does. + +## Getting started + +### Reflexion + +Self-evaluate per turn. ```python +from locus import Agent + agent = Agent( model="oci:openai.gpt-5", - tools=[search, summarise, validate_claim], - reflexion=True, # self-evaluate per turn - grounding=True, # LLM-as-judge claim verification - causal=True, # cause-effect chain analysis + tools=[search, summarise], + reflexion=True, ) + +result = agent.run_sync("Find Q3 revenue and explain the YoY change.") +print(result.metrics.reflexion_iterations) ``` -## Reflexion +After each tool result, the agent is asked: *given this, was the last +step right?* If the answer is "no", the next turn rewrites the plan +instead of stacking another tool call on top. Streamed as +`ReflectEvent` — render it in your UI and the user can literally watch +the agent change its mind. -After each tool result, the agent is asked: *"given this, was your -last step right?"* If the answer is "no", the next turn rewrites the -plan instead of stacking another tool call on top of a wrong premise. +### Grounding -Source: [Shinn et al., 2023](https://arxiv.org/abs/2303.11366) plus a -locus-native execution loop. Implementation in -`src/locus/reasoning/reflexion.py`. Streamed as `ReflectEvent`. +Verify claims before answering. + +```python +agent = Agent( + model="oci:openai.gpt-5", + tools=[search_pricing, lookup_inventory], + grounding=True, +) -## Grounding +result = agent.run_sync("What's the cheapest GPU instance with 80GB?") +for claim in result.grounding_report.unsupported: + print(f"DROPPED: {claim.text}") +``` Before the agent finalises an answer, every factual claim is checked against the conversation's tool results. A second model — the judge — -reads each claim and the supporting tool output and emits "supported / -unsupported / partially supported". Unsupported claims are removed or -sent back for re-research. +reads each claim and the supporting tool output and emits *supported / +unsupported / partially supported*. Unsupported claims are dropped or +sent back for re-research. Streamed as `GroundingEvent`. -Source: `src/locus/reasoning/grounding.py`. +### Causal + +Track cause-effect chains. + +```python +agent = Agent( + model="oci:openai.gpt-5", + tools=[fetch_logs, query_metrics, traceback], + causal=True, +) + +result = agent.run_sync("Why is checkout p99 latency up 4x since 14:00?") +print(result.causal_chain.root_causes) +``` + +The agent maintains a running cause-effect graph — *X happened +because Y; Y because Z* — and validates new conclusions against it. +Cycles, contradictions, and unsupported jumps surface as the chain +grows. Particularly useful for incident triage where the linear chat +log doesn't show that turn 3's "fix" contradicts turn 1's "root +cause". + +## Combining them + +```python +agent = Agent( + model="oci:openai.gpt-5", + tools=[...], + reflexion=True, + grounding=True, + causal=True, +) +``` -## Causal +The order is fixed: reflect first (was the last step right?), build +the causal graph as you go, ground only at the end (don't waste +judge tokens on intermediate claims). All three are observable as +their own event types. -The agent maintains a running cause-effect chain — *"did X because Y; -Y because Z"* — and checks new conclusions against it. Surfaces -contradictions that the linear chat history hides. +## Common gotchas -Source: `src/locus/reasoning/causal.py`. +| Symptom | Likely cause | +|---|---| +| Reflexion loops forever | The model can't agree with itself. Cap with `MaxIterations` in your termination condition. | +| Grounding flags everything as unsupported | The judge model is stricter than the answerer. Use the same model for both, or lower the threshold. | +| Causal graph has many disconnected nodes | The model isn't naming entities consistently across turns. Sharpen the system prompt to name entities the same way each time. | +| Reasoning add-ons feel slow | They're extra model calls — that's the trade. Keep them for runs that ship to a human, drop them for hot paths. | -## When to use +## Source and tutorial -- **Reflexion** — agents that loop, especially research and - long-running planning. -- **Grounding** — anything customer-facing where hallucinated facts - are bad. Drug names. Account numbers. Prices. -- **Causal** — multi-step explanations where a wrong root assumption - silently poisons everything downstream. +- [`tutorial_14_reasoning_patterns.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_14_reasoning_patterns.py) — all three add-ons end-to-end. +- [`locus.reasoning.reflexion`](https://github.com/oracle-samples/locus/blob/main/src/locus/reasoning/reflexion.py) +- [`locus.reasoning.grounding`](https://github.com/oracle-samples/locus/blob/main/src/locus/reasoning/grounding.py) +- [`locus.reasoning.causal`](https://github.com/oracle-samples/locus/blob/main/src/locus/reasoning/causal.py) +- [`ReflectNode`](https://github.com/oracle-samples/locus/blob/main/src/locus/loop/nodes.py) in the ReAct loop — where reflection plugs in. -You can combine all three. The cost is more model calls; the win is -fewer wrong answers. +Reflexion: [Shinn et al., 2023](https://arxiv.org/abs/2303.11366). +Grounding-Stratified Adaptive Replanning: see [GSAR](gsar.md) for the +typed-evidence variant locus also ships. -## Tutorial +## See also -[`tutorial_14_reasoning_patterns.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_14_reasoning_patterns.py). +- [GSAR](gsar.md) — typed-grounding layer with weighted scoring and tiered replanning. +- [Events](events.md) — `ReflectEvent`, `GroundingEvent`, causal node/edge events. +- [Termination](termination.md) — combine `ConfidenceMet` with reflexion to early-stop on high-confidence answers. diff --git a/docs/concepts/state.md b/docs/concepts/state.md index d2828db..08821f4 100644 --- a/docs/concepts/state.md +++ b/docs/concepts/state.md @@ -1,9 +1,14 @@ # State -`AgentState` is the single typed record of everything a run knows. It -is an immutable Pydantic model — every mutation returns a new instance -— which means the state round-trips through JSON cleanly, survives -checkpointing, and can be compared across turns. +`AgentState` is the single typed record of everything a run knows. +It's an **immutable Pydantic model** — every mutation returns a new +instance, every collection is a `tuple` or `frozenset`, and the whole +thing round-trips through JSON without loss. + +That immutability is load-bearing: it's why checkpoints are +deterministic, why two parallel branches in a graph can each "modify" +the state without stepping on each other, and why a hook reading +`state.tool_executions` can't accidentally corrupt the run. ```python from locus.core.state import AgentState @@ -12,50 +17,117 @@ from locus.core.messages import Message, Role state = AgentState(agent_id="my-agent", max_iterations=20) state = state.with_message(Message(role=Role.USER, content="hi")) state = state.with_confidence(0.85) + +# The original is untouched. +assert state.confidence == 0.85 ``` +## When you'll touch state directly + +Most of the time you don't — `Agent.run(...)` builds and threads it +for you. Reach for it when: + +| Situation | What to do | +|---|---| +| You're writing a custom hook and want to inspect the conversation so far | Read `state.messages`, `state.tool_executions`, `state.confidence` | +| You're persisting a run and rehydrating later | `state.to_checkpoint()` / `AgentState.from_checkpoint(d)` — every checkpointer does this internally | +| You're writing a custom termination predicate | `CustomCondition(lambda s: ...)` — `s` is `AgentState` | +| You're building a multi-agent graph | Reducers compose new `AgentState` from parallel branches (see below) | +| You want to seed a run from a previous transcript | Construct `AgentState(messages=(...))` and pass to `agent.run(...)` | + ## Fields | Field | Type | Meaning | |---|---|---| -| `agent_id` | `str` | Identifier carried across turns. | | `run_id` | `str` (UUID) | Unique to this run. | -| `messages` | `list[Message]` | Full conversation, in order. | -| `tool_executions` | `list[ToolExecution]` | Every tool call with its arguments, result, and duration. | -| `reasoning_steps` | `list[ReasoningStep]` | Think / Execute / Reflect steps. | +| `agent_id` | `str \| None` | Stable identifier carried across runs of the same agent. | +| `messages` | `tuple[Message, ...]` | Full conversation, in order. | | `iteration` | `int` | Current ReAct iteration index. | | `max_iterations` | `int` | Upper bound before termination. | -| `confidence` | `float` | Reflexion signal 0.0–1.0. | -| `confidence_threshold` | `float` | Early-stop threshold. | -| `terminal_tools` | `frozenset[str]` | Tool names that end the run. | -| `token_budget` | `int \| None` | Optional token cap. | -| `total_tokens_used` | `int` | Running total. | -| `errors` | `list[str]` | Any tool/model errors. | -| `metadata` | `dict[str, Any]` | User-supplied context. | +| `tool_executions` | `tuple[ToolExecution, ...]` | Every tool call: name, args, result/error, duration, idempotent-cache hit flag. | +| `reasoning_steps` | `tuple[ReasoningStep, ...]` | Per-iteration think → execute → reflect record. | +| `confidence` | `float` | Reflexion signal, 0.0–1.0. | +| `confidence_threshold` | `float` | Threshold used by `ConfidenceMet`. | +| `confidence_history` | `tuple[float, ...]` | Confidence at each iteration — useful for plotting. | +| `tool_history` | `tuple[str, ...]` | Just the tool names, in order. Powers loop detection. | +| `tool_loop_threshold` | `int` | How many identical consecutive calls qualify as a loop. | +| `terminal_tools` | `frozenset[str]` | Tool names that auto-end the run when called. | +| `total_tokens_used` | `int` | Running total. `prompt_tokens_used` + `completion_tokens_used`. | +| `token_budget` | `int \| None` | Optional cap; `TokenLimit` reads this. | +| `errors` | `tuple[str, ...]` | Tool/model error messages encountered this run. | +| `metadata` | `dict[str, Any]` | Free-form context you can attach. | +| `started_at`, `updated_at` | `datetime` | UTC timestamps. | + +## Builder methods + +Every "mutation" returns a new state. Helpers exist for the common +cases — you rarely need to construct an `AgentState` from scratch: + +```python +state = ( + state + .with_message(Message(role=Role.ASSISTANT, content="...")) + .with_tool_execution(execution) + .with_iteration(state.iteration + 1) + .with_confidence(0.78) + .with_error("rate-limited") + .with_metadata("user_tz", "America/New_York") + .with_token_usage(prompt_tokens=312, completion_tokens=87) +) +``` + +The full set: `with_message`, `with_messages`, `with_iteration`, +`with_tool_execution`, `with_reasoning_step`, `with_confidence`, +`with_error`, `with_metadata`, `with_token_usage`. ## Round-trip through JSON ```python -data = state.to_checkpoint() # → dict[str, Any] +data: dict = state.to_checkpoint() # plain dict, JSON-safe restored = AgentState.from_checkpoint(data) assert restored == state ``` -Every checkpointer uses this pair under the hood. If you build a custom -checkpointer, all you have to do is serialize `to_checkpoint()` and -rehydrate with `from_checkpoint()`. +Every checkpointer in `locus.memory.backends` uses this pair under the +hood. If you're writing a custom backend, all you need to do is +serialize whatever `to_checkpoint()` returns and rehydrate with +`from_checkpoint()` on resume. + +## Reducers (for graphs only) + +When two branches of a [StateGraph](multi-agent/graph.md) modify the +state in parallel, locus needs to know how to merge them. That's +what reducers do: + +| Reducer | Combines two values by… | +|---|---| +| `add_messages` | extending the message tuple | +| `merge_dict` / `deep_merge_dict` | shallow / recursive dict merge | +| `append_list` / `unique_append_list` | concatenating, optionally deduping | +| `add_numbers`, `max_value`, `min_value` | arithmetic / extremum | +| `first_value`, `last_value` | take one branch's value | +| `set_union` | union the two sets | + +Reducers are **opt-in at the graph level** — a plain `agent.run(...)` +doesn't use them. See `locus.core.reducers` for the source. + +## Common gotchas + +| Symptom | Likely cause | +|---|---| +| `state.messages.append(...)` raises | Tuples are immutable. Use `state.with_message(m)`. | +| `to_checkpoint()` round-trip drops a field | The field's value isn't JSON-serialisable (e.g., a custom class in `metadata`). Stash a serialisable form, or extend the checkpointer. | +| Two branches in a graph clobber each other's messages | You forgot to declare the reducer for `messages`. Use `add_messages`. | +| `confidence_history` has fewer entries than iterations | Reflexion isn't running (`reflexion=True` not set), or the run terminated before the first reflect step. | -## Reducers +## Source -When running multi-agent graphs, you sometimes want two parallel -branches to each modify the state, then merge the result. Locus ships -with reducers for that: +- [`locus.core.state`](https://github.com/oracle-samples/locus/blob/main/src/locus/core/state.py) — `AgentState`, `ToolExecution`, `ReasoningStep`. +- [`locus.core.reducers`](https://github.com/oracle-samples/locus/blob/main/src/locus/core/reducers.py) — graph-level merge helpers. -- `add_messages` — extend message list -- `merge_dict` / `deep_merge_dict` -- `append_list` / `unique_append_list` -- `add_numbers`, `max_value`, `min_value`, `first_value`, `last_value` -- `set_union` +## See also -Reducers are opt-in at the graph level — a plain agent run doesn't use -them. See `locus.core.reducers`. +- [Checkpointers](checkpointers.md) — durable persistence of `AgentState`. +- [Events](events.md) — what gets emitted as state changes. +- [Termination](termination.md) — `CustomCondition(fn)` is `(state) -> bool`. +- [Multi-agent: StateGraph](multi-agent/graph.md) — where reducers earn their keep. diff --git a/docs/concepts/termination.md b/docs/concepts/termination.md index 5894057..02364c3 100644 --- a/docs/concepts/termination.md +++ b/docs/concepts/termination.md @@ -1,65 +1,138 @@ -# Termination algebra +# Termination -When does an agent stop? locus answers that with **composable -conditions** — small classes that return `True` when the run is done, -combined with `And` / `Or`. +When does an agent stop? locus answers that with a typed, composable +**algebra of stop conditions** — small classes that each return `True` +when the run should end, combined with `&` (and) and `|` (or). ```python from locus.core.termination import ( - MaxIterations, TokenLimit, TimeLimit, - NoToolCalls, ToolCalled, ConfidenceMet, - TextMention, CustomCondition, + MaxIterations, ToolCalled, ConfidenceMet, TextMention, ) +termination = ( + (ToolCalled("send_summary") & ConfidenceMet(0.9)) + | TextMention(r"\bDONE\b") + | MaxIterations(10) +) +``` + +Read it left to right: *stop when we sent the summary and we're +confident, **or** the model said "DONE", **or** we hit ten iterations*. + +This is one of locus's signature primitives. Every stop condition is +inspectable, unit-testable, and serialisable — no hand-rolled `if` +ladders sprinkled through the loop. + +## When to pick which condition + +| Situation | Use | +|---|---| +| Hard cap on cost / runaway protection | `MaxIterations`, `TokenLimit`, `TimeLimit` | +| The work is "done" when one specific tool fires | `ToolCalled("submit_order")` | +| The model is confident and Reflexion agrees | `ConfidenceMet(0.85)` (requires `reflexion=True`) | +| The agent is supposed to write text, not call more tools | `NoToolCalls()` | +| The run ends when the model emits a sentinel | `TextMention(r"\bSHIP\b")` | +| Custom predicate over `AgentState` | `CustomCondition(fn)` | + +## Getting started + +### 1. Pick one condition + +```python +from locus import Agent +from locus.core.termination import MaxIterations + agent = Agent( - model=..., - tools=[search, send], - termination=( - # the work happened AND we believe it - (ToolCalled("send") & ConfidenceMet(0.9)) - # … or we hit the safety cap - | MaxIterations(10) - ), + model="oci:openai.gpt-5", + tools=[search, summarise], + termination=MaxIterations(8), +) +``` + +A single condition is a perfectly fine starting point. `MaxIterations` +is the safety net every production agent should have. + +### 2. Combine with `&` and `|` + +```python +from locus.core.termination import ( + MaxIterations, ToolCalled, ConfidenceMet, ) + +termination = ( + ToolCalled("send_summary") # the work happened + & ConfidenceMet(0.85) # we believe the result +) | MaxIterations(8) # …or the safety cap +``` + +`&` and `|` are real Python operator overloads (`__and__` / `__or__`) +on `TerminationCondition`, so the result is a typed +`AndCondition` / `OrCondition` you can keep composing, log, or pass +through tests. + +### 3. Inspect what stopped the run + +```python +result = agent.run_sync(prompt) +print(result.termination_reason) +# → "ToolCalled('send_summary') and ConfidenceMet(0.85)" ``` +Each condition has a `__repr__` that round-trips to its constructor, +so logs and traces tell you *exactly* which branch of the algebra +fired. + ## Built-in conditions -| Condition | Trigger | +| Condition | Triggers when | |---|---| -| `MaxIterations(n)` | n ReAct turns reached. | -| `TokenLimit(n)` | Cumulative model tokens exceed n. | +| `MaxIterations(n)` | The ReAct loop has run `n` turns. | +| `TokenLimit(n)` | Cumulative model tokens exceed `n`. | | `TimeLimit(seconds)` | Wall-clock budget exceeded. | -| `NoToolCalls()` | Last turn produced text and no tool calls. | -| `ToolCalled(name)` | A specific tool fired (with optional args predicate). | -| `ConfidenceMet(threshold)` | Reflexion / self-eval clears the bar. | +| `NoToolCalls()` | The most recent turn produced text and zero tool calls. | +| `ToolCalled(name, args=None)` | A specific tool fired (with optional args predicate). | +| `ConfidenceMet(threshold)` | Reflexion confidence ≥ threshold. | | `TextMention(pattern)` | Final message contains a regex match. | -| `CustomCondition(fn)` | Anything you can write as `(state) -> bool`. | +| `CustomCondition(fn)` | `fn(state) -> bool` — anything you can write in Python. | -## Composition +Every condition takes `AgentState` and returns `bool`. They run after +each iteration; the first `True` wins. -Compose with the `&` (And) and `|` (Or) operators directly on the -condition objects. The result is a typed `AndCondition` / -`OrCondition` you can keep composing: +## Custom conditions + +Write any predicate over `AgentState`: ```python -termination=( - ToolCalled("submit") - & (ConfidenceMet(0.85) | MaxIterations(5)) -) +from locus.core.termination import CustomCondition + +def revenue_extracted(state) -> bool: + return any( + "revenue_usd" in (e.result or {}) + for e in state.tool_executions + ) + +termination = CustomCondition(revenue_extracted) | MaxIterations(15) ``` -## Why algebra? +Custom conditions compose with built-ins exactly the same way — `&` +and `|` work across the whole hierarchy. -Real agents have multiple stopping criteria — *"finish when X is done -**and** we're confident, **or** time's up"*. Hand-rolling that as `if` -statements gets painful fast. Termination conditions are explicit, -inspectable, and unit-testable as ordinary classes. +## Common gotchas + +| Symptom | Likely cause | +|---|---| +| Agent always stops at `MaxIterations` | The "happy-path" condition never fires — model isn't calling the tool you keyed on, or confidence never reaches the threshold. Lower the threshold or check the tool name. | +| `&` / `\|` precedence surprises | Python's normal precedence applies: `&` binds tighter than `\|`. Add parentheses when in doubt — `(A & B) \| C` reads cleaner anyway. | +| `ConfidenceMet` never trips | `reflexion=True` is required — without it, confidence stays at the default. | +| `ToolCalled("x")` fires before the tool finishes | It checks the *call*, not the *result*. Pair with `ConfidenceMet` or a `CustomCondition` that inspects `tool_executions`. | -## Tutorial +## Source and tutorial -[`tutorial_37_termination.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_37_termination.py). +- [`tutorial_37_termination.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_37_termination.py) — runnable algebra examples. +- [`locus.core.termination`](https://github.com/oracle-samples/locus/blob/main/src/locus/core/termination.py) — every condition class, plus `__or__` / `__and__`. -## Source +## See also -`src/locus/core/termination.py`. +- [Reasoning](reasoning.md) — pair `ConfidenceMet` with `reflexion=True`. +- [Events](events.md) — `TerminateEvent.reason` carries the condition's `repr`. +- [Agent loop](agent-loop.md) — where conditions evaluate inside the ReAct cycle. diff --git a/docs/concepts/tools.md b/docs/concepts/tools.md index 0d80735..3271f86 100644 --- a/docs/concepts/tools.md +++ b/docs/concepts/tools.md @@ -40,7 +40,7 @@ mark optional parameters. ### 2. Pass to the agent ```python -agent = Agent(model="oci:openai.gpt-5.5", tools=[search]) +agent = Agent(model="oci:openai.gpt-5", tools=[search]) ``` That's the wiring. The model now sees `search` in its tool list and