From f86c1ada20125349541f8d26119694cecde77cf6 Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Mon, 30 Mar 2026 16:11:04 +0000 Subject: [PATCH] docs(lifecycle): add resume integrity troubleshooting runbooks Co-authored-by: Sara Loera --- docs/guides/human-in-the-loop.mdx | 22 ++++++++++++++++++++++ docs/reference/python-api.mdx | 22 +++++++++++++++++++++- 2 files changed, 43 insertions(+), 1 deletion(-) diff --git a/docs/guides/human-in-the-loop.mdx b/docs/guides/human-in-the-loop.mdx index 31a5bf4..64a291f 100644 --- a/docs/guides/human-in-the-loop.mdx +++ b/docs/guides/human-in-the-loop.mdx @@ -59,6 +59,28 @@ def run_with_approval(task: str, using): Prefer `resume_run(...)` over rerunning from scratch after approval. It preserves a continuous, append-only run history. +## Resume troubleshooting runbook + +When approval is granted but continuation fails, use this checklist before retrying: + +1. Confirm the run is still unsealed: + - `final.json` present means the run is already sealed; lifecycle writes fail with `RunSealedError`. +2. Confirm checkpoint artifacts are still anchored: + - `checkpoints//checkpoint.json` must still match current `events.jsonl` and `state.json`. +3. Confirm no artifact drift after pause: + - checkpoint validation compares event/history anchors and artifact digests; edits to run artifacts can trigger `CheckpointConsistencyError`. +4. Confirm adapter continuity: + - for non-minimal runs, `ns.resume_run(..., using=...)` must match the adapter label persisted in checkpoint/state. + +Common failure modes: + +| Error | Typical cause | What to do | +| --- | --- | --- | +| `RunSealedError` | run already terminal (`final.json` exists) | resume from a new run, not this sealed run | +| `CheckpointNotFoundError` | wrong or missing `checkpoint_id` | list checkpoint directories and retry with the correct id | +| `CheckpointConsistencyError` | checkpoint/event/state/artifact digest mismatch | treat run artifacts as drifted; restore known-good artifacts or checkpoint again before resume | +| `ResumeAdapterMismatchError` | `using=` differs from checkpoint adapter | pass the same adapter used when the checkpoint was created | + ## Policy-driven approval Create a policy that flags operations for approval: diff --git a/docs/reference/python-api.mdx b/docs/reference/python-api.mdx index 6f699ad..8ab2268 100644 --- a/docs/reference/python-api.mdx +++ b/docs/reference/python-api.mdx @@ -219,13 +219,33 @@ Adapter continuity: - `RunSealedError`: lifecycle writes and resume attempts are rejected once `final.json` seals the run. - `CheckpointNotFoundError`: `resume`/`resume_run` reference a checkpoint that does not exist. - `MissingCausalParentError`: checkpoint/interrupt cannot anchor to a causal parent event. -- `CheckpointConsistencyError`: checkpoint anchor (`event_offset`, `last_event_id`, `state_hash`) no longer matches artifacts. +- `CheckpointConsistencyError`: checkpoint no longer matches run artifacts (`event_offset`, `last_event_id`, `state_hash`, or `artifact_manifest_hash`). - `RunLifecycleTransitionError`: lifecycle mutation violates the run state-machine contract. - `ResumeAdapterRequiredError`: `resume_run` requires explicit `using` for non-minimal checkpoints. - `ResumeAdapterMismatchError`: `resume_run` adapter does not match checkpoint adapter contract. All errors above are defined in `noesis.domain.run_lifecycle`. +### Resume integrity runbook + +Use this checklist when `resume(...)` or `resume_run(...)` fails with lifecycle consistency errors. + +1. Verify the run is still unsealed: + - `final.json` present means the run is sealed and lifecycle writes will fail with `RunSealedError`. +2. Verify checkpoint metadata is still anchored: + - `checkpoints//checkpoint.json` should still match current `events.jsonl` and `state.json`. +3. Verify artifacts were not edited between checkpoint and resume: + - `events.jsonl`, `state.json`, and (when present) `manifest.json` are part of checkpoint consistency validation. +4. Verify explicit causal anchors: + - if you pass `caused_by=` to `ns.resume(...)`, it must match either the checkpoint `last_event_id` or the latest event id. + +Typical `CheckpointConsistencyError` causes: + +- checkpoint causal anchor drift (`last_event_id` mismatch) +- `state.json` hash drift after checkpoint +- artifact digest drift (for example `manifest.json` added/removed/changed after checkpoint) +- malformed checkpoint payload (`checkpoint.json` invalid JSON/object shape) + ### Verification helpers Use these helpers to build verification specs for `verify=...`.