diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md
index 9aa3289..a4ae4df 100644
--- a/docs/concepts/evaluation.md
+++ b/docs/concepts/evaluation.md
@@ -1,79 +1,168 @@
 # Evaluation
 
 An agent that worked yesterday may not work today — the model
-changed, a tool changed, the prompt got tweaked. locus ships an
-evaluation harness so regressions are tests, not surprises.
+changed, a tool was renamed, the prompt got a one-line tweak. locus
+ships a small evaluation harness so regressions become **failing
+tests**, not customer tickets.
 
 ```python
-from locus.evaluation import EvalCase, EvalRunner, EvalReport
+from locus.evaluation import EvalCase, EvalRunner
 
 cases = [
     EvalCase(
-        name="books-real-flight",
-        prompt="Book TK-12 for customer C-42.",
-        expected={
-            "tool_calls": ["book_flight"],
-            "tool_args": {"book_flight": {"flight_id": "TK-12"}},
-            "final_message": lambda m: "TK-12" in m,
-        },
-    ),
-    EvalCase(
-        name="rejects-unknown-flight",
-        prompt="Book ZZ-999.",
-        expected={
-            "tool_calls_lt": 2,
-            "final_message": lambda m: "not found" in m.lower(),
-        },
+        name="weather_lookup",
+        prompt="What's the weather in NYC?",
+        expected_tools=["get_weather"],
+        expected_output_contains=["temperature", "New York"],
+        max_iterations=5,
     ),
 ]
 
-report: EvalReport = EvalRunner(agent_factory=build_agent).run(cases)
-print(report.summary())          # pass-rate, p50/p95 latency, token cost
-report.save_html("evals/2026-04-27.html")
+report = EvalRunner(agent=agent).run(cases)
+print(report.summary())
+```
+
+## When to reach for an eval suite
+
+| Situation | Run evals? |
+|---|---|
+| You changed a tool's signature, default args, or system prompt | **yes — every commit that touches it** |
+| You're swapping models (gpt-4o → gpt-5, llama-3.3 → llama-4) | **yes — same suite, two providers, diff the report** |
+| You're debating "is the agent better than last week?" | **yes — nightly soak with `n=20` per case to see variance** |
+| One-shot exploration, scratch agent | no — overhead's not worth it |
+| Heavy LLM-as-judge needed (open-ended quality) | the harness covers structural checks; pair it with a custom judge tool for free-text grading |
+
+## Getting started
+
+### 1. Define cases
+
+`EvalCase` is a Pydantic model — every field is optional except
+`name` and `prompt`. The runner only checks fields you set.
+
+```python
+from locus.evaluation import EvalCase
+
+books_real = EvalCase(
+    name="books_real_flight",
+    prompt="Book TK-12 for customer C-42.",
+    expected_tools=["book_flight"],
+    expected_output_contains=["TK-12", "booked"],
+    max_iterations=4,
+)
+
+rejects_unknown = EvalCase(
+    name="rejects_unknown_flight",
+    prompt="Book ZZ-999.",
+    expected_output_contains=["not found"],
+    expected_output_not_contains=["booked", "confirmed"],
+)
+```
+
+### 2. Run them
+
+```python
+from locus.evaluation import EvalRunner
+
+runner = EvalRunner(agent=agent)
+report = runner.run([books_real, rejects_unknown])
+
+print(report.summary())
+# Eval Report: 2/2 passed (avg score: 1.00)
+# Total duration: 4321ms
+#   [PASS] books_real_flight (score: 1.00, 1872ms)
+#   [PASS] rejects_unknown_flight (score: 1.00, 2449ms)
+```
+
+`run()` returns an `EvalReport` — a Pydantic model with per-case
+results, aggregate pass/fail counts, average score, and total
+duration. JSON-serialisable, drop into CI artifacts.
+
+### 3. Wire it into CI
+
+```python
+# tests/test_agent_evals.py
+import pytest
+from locus.evaluation import EvalRunner
+
+def test_agent_passes_eval_suite(agent):
+    report = EvalRunner(agent=agent).run(load_cases())
+    failures = [r for r in report.results if not r.passed]
+    assert not failures, report.summary()
 ```
 
-## What an `EvalCase` checks
+## Built-in checks
 
-- **Tool trace** — which tools fired, in what order, with which args.
-- **Final message** — exact match, regex, or a custom predicate.
-- **Termination reason** — did the agent stop because the work was done
-  or because it hit a budget?
-- **Latency / token cost** — within a budget per case.
-- **Anything custom** — pass an `evaluators=[...]` list of callables.
+Every check runs only when the corresponding field is set on the
+case. Each check contributes equally to the per-case score.
 
-## Reports
+| Field | Passes when |
+|---|---|
+| `expected_tools` | All listed tools appear in the run's tool executions. |
+| `expected_output_contains` | Every string is a case-insensitive substring of the final message. |
+| `expected_output_not_contains` | None of the strings appear in the final message. |
+| `max_iterations` | The run finished in ≤ N ReAct turns. |
+| `max_duration_ms` | Wall-clock duration ≤ N milliseconds. |
 
-`EvalReport` is JSON-serialisable; the HTML view is a static page you
-can drop into CI artifacts. Pass-rate per case, latency histogram,
-token-cost trend, and a diff against the previous report.
+A case **passes** when every check passed; the **score** is the
+fraction of checks that passed (handy for partial-credit scoring
+across a soak).
 
-## Custom evaluators
+## Tags and filtering
 
-The `expected` dict on each `EvalCase` accepts callables, so the
-simplest way to add a custom check is a lambda or function reference:
+```python
+EvalCase(name="..." , prompt="..." , tags=["smoke", "happy-path"])
+EvalCase(name="..." , prompt="..." , tags=["adversarial"])
+
+# Run only smoke cases on every commit; full suite nightly.
+smoke = [c for c in all_cases if "smoke" in c.tags]
+runner.run(smoke)
+```
+
+`tags` is just a list — slice it however your CI matrix expects.
+
+## LLM-as-judge for open-ended quality
+
+The built-in checks are structural ("did the right tool fire?", "did
+the answer mention 'temperature'?"). For free-text quality
+("is this answer empathetic?", "is the explanation correct?"), wrap a
+judge model as a tool and key on its verdict:
 
 ```python
-def cited(message: str) -> bool:
-    """Pass if every expected citation appears in the final message."""
-    return all(c in message for c in ["[1]", "[2]", "[3]"])
+from locus.tools.decorator import tool
 
+@tool
+def judge(answer: str) -> dict:
+    """LLM-graded quality verdict (0.0–1.0 + reasoning)."""
+    return judge_model.run_sync(f"Grade this answer: {answer}").message
+
+# Then in the case:
 EvalCase(
-    name="research-with-citations",
-    prompt="Summarise the Q3 results with citations.",
-    expected={"final_message": cited},
+    name="empathetic_response",
+    prompt="My order is late and I'm upset.",
+    expected_tools=["judge"],
+    expected_output_contains=["sorry"],  # at minimum
 )
 ```
 
-## When to run
+A future locus release may bundle a typed judge directly into
+`EvalCase`; for today, this pattern is the path.
+
+## Common gotchas
 
-- On every commit that touches an agent's prompt, tools, or model.
-- Before swapping a model.
-- As a nightly soak with `n=20` per case to see variance.
+| Symptom | Likely cause |
+|---|---|
+| Case passes locally, fails in CI | Non-deterministic model. Pin the model id, lower `temperature`, run with `n=5` and look at variance. |
+| `max_duration_ms` flakes | Cold-start network latency. Use a wall-clock budget at the suite level, not per-case, or bump the per-case budget by 2×. |
+| `expected_tools` reports failure even though the tool ran | Case-sensitive name match — `book_flight` != `Book_Flight`. |
+| Score is 0.5 every time | One of two checks is consistently failing. Read `result.checks` — it carries the full pass/fail map. |
 
-## Tutorial
+## Source and tutorial
 
-[`tutorial_26_evaluation.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_26_evaluation.py).
+- [`tutorial_26_evaluation.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_26_evaluation.py) — runnable end-to-end suite.
+- [`locus.evaluation.framework`](https://github.com/oracle-samples/locus/blob/main/src/locus/evaluation/framework.py) — `EvalCase`, `EvalRunner`, `EvalReport`.
 
-## Source
+## See also
 
-`src/locus/evaluation/`.
+- [Reasoning](reasoning.md) — `reflexion=True` and `grounding=True` reduce the kind of failures you'd otherwise catch only in evals.
+- [Termination](termination.md) — `max_iterations` on `EvalCase` mirrors `MaxIterations` on the agent.
+- [Hooks](hooks.md) — record per-eval traces with a `TelemetryHook` for offline review.
diff --git a/docs/concepts/events.md b/docs/concepts/events.md
index bd74559..8ab3cc8 100644
--- a/docs/concepts/events.md
+++ b/docs/concepts/events.md
@@ -1,63 +1,210 @@
-# Events & streaming
+# Events
 
-Every observable step of a run is a typed Pydantic event, not a
-callback. `agent.run(...)` is an `AsyncIterator[LocusEvent]`.
+Every observable step of an agent run is a typed Pydantic event. Not
+a dict, not a callback, not a string — a frozen class with named
+fields you can `match` on.
+
+This is the reference page. For the *how* (consuming the stream,
+SSE, hooks), see [Streaming](streaming.md). For the *why* (frozen,
+typed, write-protected), see [Agent loop](agent-loop.md).
 
 ```python
-from locus import Agent
 from locus.core.events import (
     ThinkEvent, ToolStartEvent, ToolCompleteEvent, TerminateEvent,
 )
 
 async for event in agent.run("Plan a trip"):
     match event:
-        case ThinkEvent(thought=t):
-            print("thinking:", t)
+        case ThinkEvent(reasoning=r) if r:
+            print("💭", r)
         case ToolStartEvent(tool_name=n, arguments=a):
-            print(f"calling {n}({a})")
+            print(f"🔧 {n}({a})")
         case ToolCompleteEvent(tool_name=n, result=r, error=e):
-            print(f"done {n}: {e or r}")
+            print(f"   ↳ {e or r}")
         case TerminateEvent(reason=r, final_message=m):
             print(f"[{r}] {m}")
 ```
 
-## Event types
+## Common fields
+
+Every event inherits from `LocusEvent` and carries:
+
+| Field | Type | Meaning |
+|---|---|---|
+| `event_type` | `Literal[...]` | Discriminator string — `"think"`, `"tool_start"`, etc. |
+| `timestamp` | `datetime` | UTC, populated at emit time. |
+
+Events are **frozen** Pydantic models. A hook can read every field;
+it cannot mutate one. To steer a run, use the explicit method on the
+event (`event.cancel()`, `event.replace_arguments(...)`) — the intent
+is visible in code review.
 
-| Event | When |
+## Core events
+
+### `ThinkEvent`
+
+The model emitted reasoning, optionally with tool calls.
+
+| Field | Meaning |
 |---|---|
-| `ThinkEvent` | Model produced reasoning (+ optional tool calls) |
-| `ToolStartEvent` | About to invoke a tool |
-| `ToolCompleteEvent` | Tool returned (or errored) |
-| `ReflectEvent` | Reflexion cycle finished with new confidence |
-| `GroundingEvent` | Grounding verified / disputed a claim |
-| `ModelChunkEvent` | Streaming token from the LLM provider |
-| `InterruptEvent` | A hook requested human-in-the-loop |
-| `TerminateEvent` | Run ended (with `reason` and `final_message`) |
+| `iteration` | ReAct turn index (0-based) |
+| `reasoning` | The model's chain-of-thought, if the provider exposed it |
+| `tool_calls` | Tool calls the model decided to make this turn |
 
-## SSE
+Render this as a "thinking…" bubble. Most providers return `None`
+unless extended thinking is enabled (Claude 4 / o-series).
 
-For HTTP deployments, the FastAPI wrapper emits the event stream as
-Server-Sent Events. Each event becomes one SSE frame with its JSON
-payload.
+### `ToolStartEvent`
 
-## Termination conditions
+The agent is about to invoke a tool.
 
-Termination is also typed and composable. `|` is OR, `&` is AND:
+| Field | Meaning |
+|---|---|
+| `tool_name` | Tool registered with `@tool` |
+| `tool_call_id` | Provider-issued id, used to correlate with the matching `ToolCompleteEvent` |
+| `arguments` | The validated arguments dict |
 
-```python
-from locus.core.termination import (
-    MaxIterations, TokenLimit, TextMention, TimeLimit, ToolCalled,
-)
+Show a "calling X" indicator.
 
-# Stop after 10 iterations OR when the model says "DONE".
-condition = MaxIterations(10) | TextMention("DONE")
+### `ToolCompleteEvent`
 
-# Stop when BOTH: the confidence is high AND a specific tool was called.
-condition = ConfidenceMet(0.9) & ToolCalled("send_summary")
+A tool returned, errored, or was cancelled.
 
-agent = Agent(..., termination=condition)
-```
+| Field | Meaning |
+|---|---|
+| `tool_name` | Same name as the matching start event |
+| `tool_call_id` | Pairs with `ToolStartEvent.tool_call_id` |
+| `result` | The serialised return value, or `None` on error |
+| `error` | Exception message, or `None` on success |
+| `duration_ms` | How long the body actually ran |
+
+Always check `error` first — a non-`None` `error` means `result` is
+`None`.
+
+### `ModelChunkEvent`
+
+One streamed chunk from the LLM provider — the granularity that
+drives token-by-token rendering.
+
+| Field | Meaning |
+|---|---|
+| `content` | Text delta (may be `None` for tool-call-only chunks) |
+| `tool_calls` | Tool-call deltas, if the provider streams those |
+| `done` | `True` on the final chunk of a turn |
+
+`None`-guard before printing: `if e.content: print(e.content, end="")`.
+
+### `ModelCompleteEvent`
+
+A full model response was received (paired with the chunks above).
+
+| Field | Meaning |
+|---|---|
+| `content` | The complete text |
+| `tool_calls` | All tool calls in this turn |
+| `usage` | `{"input_tokens": ..., "output_tokens": ...}` |
+| `stop_reason` | Provider-specific stop reason |
+
+Telemetry hooks key off `usage` for cost tracking.
+
+### `ReflectEvent`
+
+[Reflexion](reasoning.md#reflexion) emitted a self-evaluation.
+
+| Field | Meaning |
+|---|---|
+| `iteration` | Which turn this reflection concerns |
+| `assessment` | `"on_track"`, `"stuck"`, `"new_findings"`, or `"loop_detected"` |
+| `confidence_delta` | Change vs the previous turn |
+| `new_confidence` | Current value, 0.0–1.0 |
+| `guidance` | Free-text steering for the next turn |
+
+Pair `new_confidence` with [`ConfidenceMet`](termination.md) for early
+stopping.
+
+### `GroundingEvent`
+
+[Grounding](reasoning.md#grounding) finished evaluating claims.
+
+| Field | Meaning |
+|---|---|
+| `score` | 0.0–1.0, fraction of claims supported |
+| `claims_evaluated` | How many claims the judge looked at |
+| `ungrounded_claims` | The text of every unsupported claim |
+| `requires_replan` | `True` if the run should re-research |
+
+### `InterruptEvent`
+
+A tool requested human-in-the-loop input. The run pauses; resume by
+calling the agent with the user's reply.
+
+| Field | Meaning |
+|---|---|
+| `question` | What to ask the human |
+| `options` | If multiple-choice, the allowed answers |
+| `interrupt_id` | Pass back to resume |
+| `metadata` | Free-form context for the UI |
+
+See [Interrupts](interrupts.md).
+
+### `TerminateEvent`
+
+The run finished.
+
+| Field | Meaning |
+|---|---|
+| `reason` | Which termination condition fired (its `repr`) |
+| `iterations_used` | How many ReAct turns ran |
+| `final_confidence` | Reflexion confidence at end of run |
+| `total_tool_calls` | Distinct tool invocations |
+| `final_message` | The assistant's last text, if any |
+
+Always emitted exactly once per run.
+
+## Multi-agent events
+
+These appear when an `Orchestrator`, `Swarm`, or `StateGraph` is
+running.
+
+| Event | Fired when |
+|---|---|
+| `SpecialistStartEvent` | Orchestrator dispatched to a specialist |
+| `SpecialistCompleteEvent` | Specialist returned a result |
+| `OrchestratorDecisionEvent` | Orchestrator picked its next step (`invoke_specialist`, `correlate`, `summarize`, `finalize`) |
+
+See [Multi-agent](multi-agent.md).
+
+## Causal-reasoning events
+
+When `causal=True`, the agent emits node and edge events as the graph
+grows.
+
+| Event | Fired when |
+|---|---|
+| `CausalNodeEvent` | A new entity entered the cause-effect graph (root cause / symptom / intermediate) |
+| `CausalEdgeEvent` | A causal link was added between two nodes |
+
+## Hook events
+
+`BeforeInvocationEvent`, `AfterInvocationEvent`, `BeforeToolCallEvent`,
+`AfterToolCallEvent` — emitted *to hooks* around the same lifecycle
+points the user-visible events come from. See [Hooks](hooks.md).
+
+## Common gotchas
+
+| Symptom | Likely cause |
+|---|---|
+| `match` is non-exhaustive at the type checker | Add a `case _: pass` fallthrough or handle the missing variant. |
+| `ModelChunkEvent.content` is `None` | Tool-call-only chunk. Guard with `if event.content:`. |
+| `TerminateEvent` never arrives | Generator was cancelled mid-stream. Check the consumer for exceptions. |
+| Hook tried to mutate `event.tool_name` and got `ValidationError` | Frozen by design — use `event.replace_arguments(...)` or `event.cancel()` instead. |
+
+## Source
+
+- [`locus.core.events`](https://github.com/oracle-samples/locus/blob/main/src/locus/core/events.py) — every event class.
+
+## See also
 
-Built-in conditions: `MaxIterations`, `TokenLimit`, `TextMention`,
-`TimeLimit`, `ToolCalled`, `ConfidenceMet`, `NoToolCalls`,
-`CustomCondition`.
+- [Streaming](streaming.md) — how to consume the event stream.
+- [Hooks](hooks.md) — observe the same events from inside the loop.
+- [Agent server](server.md) — re-emit events over Server-Sent Events.
diff --git a/docs/concepts/hooks.md b/docs/concepts/hooks.md
index 32fc117..06d7090 100644
--- a/docs/concepts/hooks.md
+++ b/docs/concepts/hooks.md
@@ -61,7 +61,7 @@ no-op defaults from the base class.
 
 ```python
 agent = Agent(
-    model="oci:openai.gpt-5.5",
+    model="oci:openai.gpt-5",
     tools=[search, book_flight],
     hooks=[AuditHook()],
 )
@@ -86,7 +86,7 @@ from locus.hooks.builtin import (
 )
 
 agent = Agent(
-    model="oci:openai.gpt-5.5",
+    model="oci:openai.gpt-5",
     tools=[...],
     hooks=[
         StructuredLoggingHook(),       # JSON logs at every phase
@@ -132,7 +132,7 @@ call is higher than the cost of a second model round-trip.
 ```python
 agent = Agent(
     ...,
-    hooks=[SteeringHook(approver="oci:openai.gpt-5.5")],
+    hooks=[SteeringHook(approver="oci:openai.gpt-5")],
 )
 ```
 
diff --git a/docs/concepts/mcp.md b/docs/concepts/mcp.md
index f0fec86..5b9f0d1 100644
--- a/docs/concepts/mcp.md
+++ b/docs/concepts/mcp.md
@@ -50,7 +50,7 @@ stdin/stdout, and discovers what tools the server exposes.
 from locus import Agent
 
 agent = Agent(
-    model="oci:openai.gpt-5.5",
+    model="oci:openai.gpt-5",
     tools=[*fs.tools()],          # MCP tools become locus tools
     system_prompt="You can read files in /data.",
 )
@@ -136,7 +136,7 @@ analytics = LocusMCPServer(              # producer side
 analytics.run_http(port=7400, in_background=True)
 
 agent_a = Agent(
-    model="oci:openai.gpt-5.5",
+    model="oci:openai.gpt-5",
     tools=[*fs.tools(), summarise_csv, plot_histogram],
 )
 ```
diff --git a/docs/concepts/reasoning.md b/docs/concepts/reasoning.md
index 93fd531..270be24 100644
--- a/docs/concepts/reasoning.md
+++ b/docs/concepts/reasoning.md
@@ -1,59 +1,141 @@
 # Reasoning
 
-A model that loops without thinking is a model that pays you to be
-wrong faster. locus ships three reasoning add-ons that are each a
-single argument on `Agent(...)`.
+A model that loops without thinking just pays you to be wrong faster.
+locus ships three reasoning add-ons that catch wrong premises *before*
+the next tool call, not in the post-mortem:
+
+- **Reflexion** — after each turn, the agent self-evaluates and
+  re-plans if the last step was wrong.
+- **Grounding** — every factual claim is checked against tool results
+  by an LLM-as-judge before the answer goes out.
+- **Causal reasoning** — a running cause-effect graph that surfaces
+  contradictions linear chat history hides.
+
+Each is a single argument on `Agent(...)`. You can combine them.
+
+## When to pick which
+
+| Situation | Add-on |
+|---|---|
+| Agent loops endlessly or stacks tool calls on a wrong premise | `reflexion=True` |
+| Customer-facing answers where hallucinated facts cost money (drug names, prices, account numbers) | `grounding=True` |
+| Multi-step diagnosis or root-cause analysis where one bad assumption poisons the chain | `causal=True` |
+| All three apply — production research agent, compliance-sensitive answer | turn them all on |
+| Quick prototype, low-stakes Q&A | leave them off — extra model calls are wasted |
+
+The cost is more model round-trips. The win is fewer wrong answers.
+For short tasks the math doesn't pencil out. For runs of 5+ tool calls
+or anything that ships to a customer, it almost always does.
+
+## Getting started
+
+### Reflexion
+
+Self-evaluate per turn.
 
 ```python
+from locus import Agent
+
 agent = Agent(
     model="oci:openai.gpt-5",
-    tools=[search, summarise, validate_claim],
-    reflexion=True,    # self-evaluate per turn
-    grounding=True,    # LLM-as-judge claim verification
-    causal=True,       # cause-effect chain analysis
+    tools=[search, summarise],
+    reflexion=True,
 )
+
+result = agent.run_sync("Find Q3 revenue and explain the YoY change.")
+print(result.metrics.reflexion_iterations)
 ```
 
-## Reflexion
+After each tool result, the agent is asked: *given this, was the last
+step right?* If the answer is "no", the next turn rewrites the plan
+instead of stacking another tool call on top. Streamed as
+`ReflectEvent` — render it in your UI and the user can literally watch
+the agent change its mind.
 
-After each tool result, the agent is asked: *"given this, was your
-last step right?"* If the answer is "no", the next turn rewrites the
-plan instead of stacking another tool call on top of a wrong premise.
+### Grounding
 
-Source: [Shinn et al., 2023](https://arxiv.org/abs/2303.11366) plus a
-locus-native execution loop. Implementation in
-`src/locus/reasoning/reflexion.py`. Streamed as `ReflectEvent`.
+Verify claims before answering.
+
+```python
+agent = Agent(
+    model="oci:openai.gpt-5",
+    tools=[search_pricing, lookup_inventory],
+    grounding=True,
+)
 
-## Grounding
+result = agent.run_sync("What's the cheapest GPU instance with 80GB?")
+for claim in result.grounding_report.unsupported:
+    print(f"DROPPED: {claim.text}")
+```
 
 Before the agent finalises an answer, every factual claim is checked
 against the conversation's tool results. A second model — the judge —
-reads each claim and the supporting tool output and emits "supported /
-unsupported / partially supported". Unsupported claims are removed or
-sent back for re-research.
+reads each claim and the supporting tool output and emits *supported /
+unsupported / partially supported*. Unsupported claims are dropped or
+sent back for re-research. Streamed as `GroundingEvent`.
 
-Source: `src/locus/reasoning/grounding.py`.
+### Causal
+
+Track cause-effect chains.
+
+```python
+agent = Agent(
+    model="oci:openai.gpt-5",
+    tools=[fetch_logs, query_metrics, traceback],
+    causal=True,
+)
+
+result = agent.run_sync("Why is checkout p99 latency up 4x since 14:00?")
+print(result.causal_chain.root_causes)
+```
+
+The agent maintains a running cause-effect graph — *X happened
+because Y; Y because Z* — and validates new conclusions against it.
+Cycles, contradictions, and unsupported jumps surface as the chain
+grows. Particularly useful for incident triage where the linear chat
+log doesn't show that turn 3's "fix" contradicts turn 1's "root
+cause".
+
+## Combining them
+
+```python
+agent = Agent(
+    model="oci:openai.gpt-5",
+    tools=[...],
+    reflexion=True,
+    grounding=True,
+    causal=True,
+)
+```
 
-## Causal
+The order is fixed: reflect first (was the last step right?), build
+the causal graph as you go, ground only at the end (don't waste
+judge tokens on intermediate claims). All three are observable as
+their own event types.
 
-The agent maintains a running cause-effect chain — *"did X because Y;
-Y because Z"* — and checks new conclusions against it. Surfaces
-contradictions that the linear chat history hides.
+## Common gotchas
 
-Source: `src/locus/reasoning/causal.py`.
+| Symptom | Likely cause |
+|---|---|
+| Reflexion loops forever | The model can't agree with itself. Cap with `MaxIterations` in your termination condition. |
+| Grounding flags everything as unsupported | The judge model is stricter than the answerer. Use the same model for both, or lower the threshold. |
+| Causal graph has many disconnected nodes | The model isn't naming entities consistently across turns. Sharpen the system prompt to name entities the same way each time. |
+| Reasoning add-ons feel slow | They're extra model calls — that's the trade. Keep them for runs that ship to a human, drop them for hot paths. |
 
-## When to use
+## Source and tutorial
 
-- **Reflexion** — agents that loop, especially research and
-  long-running planning.
-- **Grounding** — anything customer-facing where hallucinated facts
-  are bad. Drug names. Account numbers. Prices.
-- **Causal** — multi-step explanations where a wrong root assumption
-  silently poisons everything downstream.
+- [`tutorial_14_reasoning_patterns.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_14_reasoning_patterns.py) — all three add-ons end-to-end.
+- [`locus.reasoning.reflexion`](https://github.com/oracle-samples/locus/blob/main/src/locus/reasoning/reflexion.py)
+- [`locus.reasoning.grounding`](https://github.com/oracle-samples/locus/blob/main/src/locus/reasoning/grounding.py)
+- [`locus.reasoning.causal`](https://github.com/oracle-samples/locus/blob/main/src/locus/reasoning/causal.py)
+- [`ReflectNode`](https://github.com/oracle-samples/locus/blob/main/src/locus/loop/nodes.py) in the ReAct loop — where reflection plugs in.
 
-You can combine all three. The cost is more model calls; the win is
-fewer wrong answers.
+Reflexion: [Shinn et al., 2023](https://arxiv.org/abs/2303.11366).
+Grounding-Stratified Adaptive Replanning: see [GSAR](gsar.md) for the
+typed-evidence variant locus also ships.
 
-## Tutorial
+## See also
 
-[`tutorial_14_reasoning_patterns.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_14_reasoning_patterns.py).
+- [GSAR](gsar.md) — typed-grounding layer with weighted scoring and tiered replanning.
+- [Events](events.md) — `ReflectEvent`, `GroundingEvent`, causal node/edge events.
+- [Termination](termination.md) — combine `ConfidenceMet` with reflexion to early-stop on high-confidence answers.
diff --git a/docs/concepts/state.md b/docs/concepts/state.md
index d2828db..08821f4 100644
--- a/docs/concepts/state.md
+++ b/docs/concepts/state.md
@@ -1,9 +1,14 @@
 # State
 
-`AgentState` is the single typed record of everything a run knows. It
-is an immutable Pydantic model — every mutation returns a new instance
-— which means the state round-trips through JSON cleanly, survives
-checkpointing, and can be compared across turns.
+`AgentState` is the single typed record of everything a run knows.
+It's an **immutable Pydantic model** — every mutation returns a new
+instance, every collection is a `tuple` or `frozenset`, and the whole
+thing round-trips through JSON without loss.
+
+That immutability is load-bearing: it's why checkpoints are
+deterministic, why two parallel branches in a graph can each "modify"
+the state without stepping on each other, and why a hook reading
+`state.tool_executions` can't accidentally corrupt the run.
 
 ```python
 from locus.core.state import AgentState
@@ -12,50 +17,117 @@ from locus.core.messages import Message, Role
 state = AgentState(agent_id="my-agent", max_iterations=20)
 state = state.with_message(Message(role=Role.USER, content="hi"))
 state = state.with_confidence(0.85)
+
+# The original is untouched.
+assert state.confidence == 0.85
 ```
 
+## When you'll touch state directly
+
+Most of the time you don't — `Agent.run(...)` builds and threads it
+for you. Reach for it when:
+
+| Situation | What to do |
+|---|---|
+| You're writing a custom hook and want to inspect the conversation so far | Read `state.messages`, `state.tool_executions`, `state.confidence` |
+| You're persisting a run and rehydrating later | `state.to_checkpoint()` / `AgentState.from_checkpoint(d)` — every checkpointer does this internally |
+| You're writing a custom termination predicate | `CustomCondition(lambda s: ...)` — `s` is `AgentState` |
+| You're building a multi-agent graph | Reducers compose new `AgentState` from parallel branches (see below) |
+| You want to seed a run from a previous transcript | Construct `AgentState(messages=(...))` and pass to `agent.run(...)` |
+
 ## Fields
 
 | Field | Type | Meaning |
 |---|---|---|
-| `agent_id` | `str` | Identifier carried across turns. |
 | `run_id` | `str` (UUID) | Unique to this run. |
-| `messages` | `list[Message]` | Full conversation, in order. |
-| `tool_executions` | `list[ToolExecution]` | Every tool call with its arguments, result, and duration. |
-| `reasoning_steps` | `list[ReasoningStep]` | Think / Execute / Reflect steps. |
+| `agent_id` | `str \| None` | Stable identifier carried across runs of the same agent. |
+| `messages` | `tuple[Message, ...]` | Full conversation, in order. |
 | `iteration` | `int` | Current ReAct iteration index. |
 | `max_iterations` | `int` | Upper bound before termination. |
-| `confidence` | `float` | Reflexion signal 0.0–1.0. |
-| `confidence_threshold` | `float` | Early-stop threshold. |
-| `terminal_tools` | `frozenset[str]` | Tool names that end the run. |
-| `token_budget` | `int \| None` | Optional token cap. |
-| `total_tokens_used` | `int` | Running total. |
-| `errors` | `list[str]` | Any tool/model errors. |
-| `metadata` | `dict[str, Any]` | User-supplied context. |
+| `tool_executions` | `tuple[ToolExecution, ...]` | Every tool call: name, args, result/error, duration, idempotent-cache hit flag. |
+| `reasoning_steps` | `tuple[ReasoningStep, ...]` | Per-iteration think → execute → reflect record. |
+| `confidence` | `float` | Reflexion signal, 0.0–1.0. |
+| `confidence_threshold` | `float` | Threshold used by `ConfidenceMet`. |
+| `confidence_history` | `tuple[float, ...]` | Confidence at each iteration — useful for plotting. |
+| `tool_history` | `tuple[str, ...]` | Just the tool names, in order. Powers loop detection. |
+| `tool_loop_threshold` | `int` | How many identical consecutive calls qualify as a loop. |
+| `terminal_tools` | `frozenset[str]` | Tool names that auto-end the run when called. |
+| `total_tokens_used` | `int` | Running total. `prompt_tokens_used` + `completion_tokens_used`. |
+| `token_budget` | `int \| None` | Optional cap; `TokenLimit` reads this. |
+| `errors` | `tuple[str, ...]` | Tool/model error messages encountered this run. |
+| `metadata` | `dict[str, Any]` | Free-form context you can attach. |
+| `started_at`, `updated_at` | `datetime` | UTC timestamps. |
+
+## Builder methods
+
+Every "mutation" returns a new state. Helpers exist for the common
+cases — you rarely need to construct an `AgentState` from scratch:
+
+```python
+state = (
+    state
+    .with_message(Message(role=Role.ASSISTANT, content="..."))
+    .with_tool_execution(execution)
+    .with_iteration(state.iteration + 1)
+    .with_confidence(0.78)
+    .with_error("rate-limited")
+    .with_metadata("user_tz", "America/New_York")
+    .with_token_usage(prompt_tokens=312, completion_tokens=87)
+)
+```
+
+The full set: `with_message`, `with_messages`, `with_iteration`,
+`with_tool_execution`, `with_reasoning_step`, `with_confidence`,
+`with_error`, `with_metadata`, `with_token_usage`.
 
 ## Round-trip through JSON
 
 ```python
-data = state.to_checkpoint()           # → dict[str, Any]
+data: dict = state.to_checkpoint()           # plain dict, JSON-safe
 restored = AgentState.from_checkpoint(data)
 assert restored == state
 ```
 
-Every checkpointer uses this pair under the hood. If you build a custom
-checkpointer, all you have to do is serialize `to_checkpoint()` and
-rehydrate with `from_checkpoint()`.
+Every checkpointer in `locus.memory.backends` uses this pair under the
+hood. If you're writing a custom backend, all you need to do is
+serialize whatever `to_checkpoint()` returns and rehydrate with
+`from_checkpoint()` on resume.
+
+## Reducers (for graphs only)
+
+When two branches of a [StateGraph](multi-agent/graph.md) modify the
+state in parallel, locus needs to know how to merge them. That's
+what reducers do:
+
+| Reducer | Combines two values by… |
+|---|---|
+| `add_messages` | extending the message tuple |
+| `merge_dict` / `deep_merge_dict` | shallow / recursive dict merge |
+| `append_list` / `unique_append_list` | concatenating, optionally deduping |
+| `add_numbers`, `max_value`, `min_value` | arithmetic / extremum |
+| `first_value`, `last_value` | take one branch's value |
+| `set_union` | union the two sets |
+
+Reducers are **opt-in at the graph level** — a plain `agent.run(...)`
+doesn't use them. See `locus.core.reducers` for the source.
+
+## Common gotchas
+
+| Symptom | Likely cause |
+|---|---|
+| `state.messages.append(...)` raises | Tuples are immutable. Use `state.with_message(m)`. |
+| `to_checkpoint()` round-trip drops a field | The field's value isn't JSON-serialisable (e.g., a custom class in `metadata`). Stash a serialisable form, or extend the checkpointer. |
+| Two branches in a graph clobber each other's messages | You forgot to declare the reducer for `messages`. Use `add_messages`. |
+| `confidence_history` has fewer entries than iterations | Reflexion isn't running (`reflexion=True` not set), or the run terminated before the first reflect step. |
 
-## Reducers
+## Source
 
-When running multi-agent graphs, you sometimes want two parallel
-branches to each modify the state, then merge the result. Locus ships
-with reducers for that:
+- [`locus.core.state`](https://github.com/oracle-samples/locus/blob/main/src/locus/core/state.py) — `AgentState`, `ToolExecution`, `ReasoningStep`.
+- [`locus.core.reducers`](https://github.com/oracle-samples/locus/blob/main/src/locus/core/reducers.py) — graph-level merge helpers.
 
-- `add_messages` — extend message list
-- `merge_dict` / `deep_merge_dict`
-- `append_list` / `unique_append_list`
-- `add_numbers`, `max_value`, `min_value`, `first_value`, `last_value`
-- `set_union`
+## See also
 
-Reducers are opt-in at the graph level — a plain agent run doesn't use
-them. See `locus.core.reducers`.
+- [Checkpointers](checkpointers.md) — durable persistence of `AgentState`.
+- [Events](events.md) — what gets emitted as state changes.
+- [Termination](termination.md) — `CustomCondition(fn)` is `(state) -> bool`.
+- [Multi-agent: StateGraph](multi-agent/graph.md) — where reducers earn their keep.
diff --git a/docs/concepts/termination.md b/docs/concepts/termination.md
index 5894057..02364c3 100644
--- a/docs/concepts/termination.md
+++ b/docs/concepts/termination.md
@@ -1,65 +1,138 @@
-# Termination algebra
+# Termination
 
-When does an agent stop? locus answers that with **composable
-conditions** — small classes that return `True` when the run is done,
-combined with `And` / `Or`.
+When does an agent stop? locus answers that with a typed, composable
+**algebra of stop conditions** — small classes that each return `True`
+when the run should end, combined with `&` (and) and `|` (or).
 
 ```python
 from locus.core.termination import (
-    MaxIterations, TokenLimit, TimeLimit,
-    NoToolCalls, ToolCalled, ConfidenceMet,
-    TextMention, CustomCondition,
+    MaxIterations, ToolCalled, ConfidenceMet, TextMention,
 )
 
+termination = (
+    (ToolCalled("send_summary") & ConfidenceMet(0.9))
+    | TextMention(r"\bDONE\b")
+    | MaxIterations(10)
+)
+```
+
+Read it left to right: *stop when we sent the summary and we're
+confident, **or** the model said "DONE", **or** we hit ten iterations*.
+
+This is one of locus's signature primitives. Every stop condition is
+inspectable, unit-testable, and serialisable — no hand-rolled `if`
+ladders sprinkled through the loop.
+
+## When to pick which condition
+
+| Situation | Use |
+|---|---|
+| Hard cap on cost / runaway protection | `MaxIterations`, `TokenLimit`, `TimeLimit` |
+| The work is "done" when one specific tool fires | `ToolCalled("submit_order")` |
+| The model is confident and Reflexion agrees | `ConfidenceMet(0.85)` (requires `reflexion=True`) |
+| The agent is supposed to write text, not call more tools | `NoToolCalls()` |
+| The run ends when the model emits a sentinel | `TextMention(r"\bSHIP\b")` |
+| Custom predicate over `AgentState` | `CustomCondition(fn)` |
+
+## Getting started
+
+### 1. Pick one condition
+
+```python
+from locus import Agent
+from locus.core.termination import MaxIterations
+
 agent = Agent(
-    model=...,
-    tools=[search, send],
-    termination=(
-        # the work happened AND we believe it
-        (ToolCalled("send") & ConfidenceMet(0.9))
-        # … or we hit the safety cap
-        | MaxIterations(10)
-    ),
+    model="oci:openai.gpt-5",
+    tools=[search, summarise],
+    termination=MaxIterations(8),
+)
+```
+
+A single condition is a perfectly fine starting point. `MaxIterations`
+is the safety net every production agent should have.
+
+### 2. Combine with `&` and `|`
+
+```python
+from locus.core.termination import (
+    MaxIterations, ToolCalled, ConfidenceMet,
 )
+
+termination = (
+    ToolCalled("send_summary")        # the work happened
+    & ConfidenceMet(0.85)             # we believe the result
+) | MaxIterations(8)                  # …or the safety cap
+```
+
+`&` and `|` are real Python operator overloads (`__and__` / `__or__`)
+on `TerminationCondition`, so the result is a typed
+`AndCondition` / `OrCondition` you can keep composing, log, or pass
+through tests.
+
+### 3. Inspect what stopped the run
+
+```python
+result = agent.run_sync(prompt)
+print(result.termination_reason)
+# → "ToolCalled('send_summary') and ConfidenceMet(0.85)"
 ```
 
+Each condition has a `__repr__` that round-trips to its constructor,
+so logs and traces tell you *exactly* which branch of the algebra
+fired.
+
 ## Built-in conditions
 
-| Condition | Trigger |
+| Condition | Triggers when |
 |---|---|
-| `MaxIterations(n)` | n ReAct turns reached. |
-| `TokenLimit(n)` | Cumulative model tokens exceed n. |
+| `MaxIterations(n)` | The ReAct loop has run `n` turns. |
+| `TokenLimit(n)` | Cumulative model tokens exceed `n`. |
 | `TimeLimit(seconds)` | Wall-clock budget exceeded. |
-| `NoToolCalls()` | Last turn produced text and no tool calls. |
-| `ToolCalled(name)` | A specific tool fired (with optional args predicate). |
-| `ConfidenceMet(threshold)` | Reflexion / self-eval clears the bar. |
+| `NoToolCalls()` | The most recent turn produced text and zero tool calls. |
+| `ToolCalled(name, args=None)` | A specific tool fired (with optional args predicate). |
+| `ConfidenceMet(threshold)` | Reflexion confidence ≥ threshold. |
 | `TextMention(pattern)` | Final message contains a regex match. |
-| `CustomCondition(fn)` | Anything you can write as `(state) -> bool`. |
+| `CustomCondition(fn)` | `fn(state) -> bool` — anything you can write in Python. |
 
-## Composition
+Every condition takes `AgentState` and returns `bool`. They run after
+each iteration; the first `True` wins.
 
-Compose with the `&` (And) and `|` (Or) operators directly on the
-condition objects. The result is a typed `AndCondition` /
-`OrCondition` you can keep composing:
+## Custom conditions
+
+Write any predicate over `AgentState`:
 
 ```python
-termination=(
-    ToolCalled("submit")
-    & (ConfidenceMet(0.85) | MaxIterations(5))
-)
+from locus.core.termination import CustomCondition
+
+def revenue_extracted(state) -> bool:
+    return any(
+        "revenue_usd" in (e.result or {})
+        for e in state.tool_executions
+    )
+
+termination = CustomCondition(revenue_extracted) | MaxIterations(15)
 ```
 
-## Why algebra?
+Custom conditions compose with built-ins exactly the same way — `&`
+and `|` work across the whole hierarchy.
 
-Real agents have multiple stopping criteria — *"finish when X is done
-**and** we're confident, **or** time's up"*. Hand-rolling that as `if`
-statements gets painful fast. Termination conditions are explicit,
-inspectable, and unit-testable as ordinary classes.
+## Common gotchas
+
+| Symptom | Likely cause |
+|---|---|
+| Agent always stops at `MaxIterations` | The "happy-path" condition never fires — model isn't calling the tool you keyed on, or confidence never reaches the threshold. Lower the threshold or check the tool name. |
+| `&` / `\|` precedence surprises | Python's normal precedence applies: `&` binds tighter than `\|`. Add parentheses when in doubt — `(A & B) \| C` reads cleaner anyway. |
+| `ConfidenceMet` never trips | `reflexion=True` is required — without it, confidence stays at the default. |
+| `ToolCalled("x")` fires before the tool finishes | It checks the *call*, not the *result*. Pair with `ConfidenceMet` or a `CustomCondition` that inspects `tool_executions`. |
 
-## Tutorial
+## Source and tutorial
 
-[`tutorial_37_termination.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_37_termination.py).
+- [`tutorial_37_termination.py`](https://github.com/oracle-samples/locus/blob/main/examples/tutorial_37_termination.py) — runnable algebra examples.
+- [`locus.core.termination`](https://github.com/oracle-samples/locus/blob/main/src/locus/core/termination.py) — every condition class, plus `__or__` / `__and__`.
 
-## Source
+## See also
 
-`src/locus/core/termination.py`.
+- [Reasoning](reasoning.md) — pair `ConfidenceMet` with `reflexion=True`.
+- [Events](events.md) — `TerminateEvent.reason` carries the condition's `repr`.
+- [Agent loop](agent-loop.md) — where conditions evaluate inside the ReAct cycle.
diff --git a/docs/concepts/tools.md b/docs/concepts/tools.md
index 0d80735..3271f86 100644
--- a/docs/concepts/tools.md
+++ b/docs/concepts/tools.md
@@ -40,7 +40,7 @@ mark optional parameters.
 ### 2. Pass to the agent
 
 ```python
-agent = Agent(model="oci:openai.gpt-5.5", tools=[search])
+agent = Agent(model="oci:openai.gpt-5", tools=[search])
 ```
 
 That's the wiring. The model now sees `search` in its tool list and