A run is an append-only event log. Any run resumes after a crash from its last completed step, replays byte-identically, and never re-bills a completed model call. Bring your own framework — KEEL is the runtime that makes it survive production.
KEEL is to LLM agents what Temporal / Inngest / DBOS are to ordinary services: durable execution. The difference is the LLM-specific hard parts those tools don't address — non-deterministic model calls, token-metered cost, streaming, tool side effects, and human-in-the-loop gates — made to work inside a deterministic replay model.
Project status. Early-stage, single-maintainer project. The runtime is functional and every PR is CI-gated, but it has not had an independent security audit, a live multi-node soak test, or production deployments. See
docs/WHEN-NOT-TO-USE.mdfor honest non-fit cases and maturity. Every capability claim below is backed by a test or benchmark in CI; if it isn't proven, it isn't claimed.
$ pip install keel
$ python -m examples.crash_resume_demo
KEEL — crash/resume signature demo
clean run cost ............. $0.001000
crashed after 'research' commits (a kill before step.completed)
resumed in a fresh runtime: 3 model calls on resume (research replayed from the log, not re-issued)
cost after resume .......... $0.001000
cost-of-resume == clean-run cost? PASS ✅
(the completed model call was replayed from the log and never re-billed)A 4-step agent makes billed model calls, "crashes" mid-run, and resumes in a fresh
runtime — the completed call is replayed from the log (cost unchanged) and only the
remaining work executes. The cost equality is a checked assertion, gated in CI
(tests/chaos/test_crash_resume_demo.py), not a screenshot. Browse the same run in the
viewer: keel view.
- Durability you can see. Kill the process mid-run and resume in a fresh process: completed model/tool calls are replayed from the log (cost unchanged) and only the remaining work executes. Replay is byte-identical for the recorded run.
- Any recorded run is a regression test.
keel regress recordfreezes a run into a self-contained bundle (graph + event log + blobs); a GitHub Action replays it byte-identically on every PR — with no API key — and blocks the merge on determinism or behavioural drift. KEEL dogfoods this on its own runs. See docs/REGRESSION.md.
Supporting features — explicit per-run budgets, OTel GenAI export, an out-of-process tool sandbox, policy + RBAC, and a hash-chained audit log — exist because a runtime in the execution path needs them, not as headliners.
L4 KIR — the intermediate representation the executor runs (graphs of typed nodes)
L3 Services — model router, budgeter, tool gateway, eval harness, policy engine
L2 Durable Executor — event-sourced state machine; resume == normal scheduling
L1 Substrate — trace bus, storage adapters, OTel export, and the clock/id/rng/model/
tool *ports* through which all nondeterminism flows (record + replay)
Current state is a pure fold over the event log. Resume after a crash and normal
scheduling are the same code path — fold the log, compute the runnable frontier,
schedule it. Every nondeterminism source is funnelled through an L1 port that records
live and replays deterministically; the exact guarantees (and their current limits)
are the written contract in docs/DETERMINISM.md.
The L5
Agent/Task/CrewDSL is an optional convenience, not the product — it compiles to KIR like anything else. The headline path is KIR or a framework adapter. Seeexamples/authoring_dsl.py.
pip install keel # SQLite + content-addressed blobs, zero extra services
pip install 'keel[viewer]' # adds the local trace viewer
keel run --mock examples/research_pipeline.py # durable, traced, budgeted — no key
keel ls # list runs
keel show <run_id> # the full event timeline (the trace)
keel view # the dashboard: runs/steps/prompts/tokens/$The dashboard is a single-file SPA (no build step) styled like macOS and laid out like Kibana Discover: an Overview with KPI cards and inline charts, a faceted Discover event explorer (filter by type/node, full-text, histogram, payload drill-down), a Costs rollup, and inline gate approve/reject. Light/dark follows the OS.
The example is plain KIR — a graph of typed nodes, the thing the executor runs:
from keel.kir.schema import Graph, Node, Edge, NodeType
graph = Graph(
graph_id="research_pipeline",
nodes=[
Node(id="research", type=NodeType.LLM_STEP,
config={"model": "anthropic:claude-haiku-4-5", "prompt": "Research the topic."}),
Node(id="write", type=NodeType.LLM_STEP,
config={"model": "anthropic:claude-haiku-4-5", "prompt": "Write it up."}),
],
edges=[Edge.model_validate({"from": "research", "to": "write"})],
)→ Full tour: the quickstart walkthrough — run, pause at a human gate, resume in a fresh process, budget it, replay it byte-identically, and turn the run into a regression test.
Keep LangGraph / CrewAI / Pydantic-AI / the OpenAI Agents SDK / the Anthropic SDK — gain
durability, tracing, budgets, and byte-identical replay by running it under KEEL.
Adapters route the framework's model and tool calls through KEEL; no graph rewrite. One
conformance suite covers them all (docs/ADAPTERS.md), and adding a new one is a small,
well-scoped contribution (docs/ADAPTER-AUTHORS.md):
from keel.adapters import run_agent, AgentNode
async def research(model, inputs):
return (await model.complete([{"role": "user", "content": "research"}])).encode()
async def write(model, inputs):
return (await model.complete([{"role": "user", "content": "write it up"}])).encode()
run = await run_agent("agent",
[AgentNode("research", research), AgentNode("write", write, deps=["research"])],
model=my_model) # durable, traced, budgeted, replayablekeel run --mock examples/research_pipeline.py --run-id demo
keel replay demo # re-drive from the log: byte-identical
keel diff demo other # where two runs diverge (route/cost/payload)Resume and normal scheduling are the same fold over the log; completed model calls are replayed and never re-billed (asserted in CI, not screenshotted).
keel run | ls | show | resume | approve | replay | diff | simulate | test | audit | migrate | import | view
A first-party distroless, non-root image (runner + viewer), signed and SBOM-attested via the org security workflow:
docker build -t keel:local . # ~140 MB, runs as UID 65532
docker run --rm keel:local run --mock examples/research_pipeline.py
docker run --rm -p 8321:8321 keel:local # viewer on :8321 (default CMD)Run state goes to the /data volume (KEEL_DATA_DIR). Published images are at
ghcr.io/bobcatsfan33/keel; verify with cosign verify --certificate-identity-regexp 'github.com/Bobcatsfan33' ….
The product is the substrate (L1/L2): determinism and durability. Frameworks are
distribution, not competition — KEEL aims to run underneath LangGraph, CrewAI, the
OpenAI Agents SDK, and the Anthropic SDK. The strategy, milestones, and per-milestone
proof are in docs/STRATEGY.md; the determinism contract is
docs/DETERMINISM.md; reproducible numbers are in
docs/BENCHMARKS.md.
CI gates every PR on ruff + mypy --strict + import-linter layers +
unit/property/chaos tests + a nondeterminism lint gate + trace-overhead, viewer-render,
and latency-percentile (p50/p95/p99) benchmarks + a multi-worker crash soak (0 re-bills)
- byte-identical regression replay. Apache-2.0.
Contributions are welcome — start with CONTRIBUTING.md. Adding a
framework adapter is the most welcome and best-scoped contribution
(docs/ADAPTER-AUTHORS.md). See
GOVERNANCE.md for how decisions are made and how to become a
maintainer, SECURITY.md to report a vulnerability, and
docs/STABILITY.md for the stable-surface and schema-evolution
guarantees (a run recorded on schema n replays on n+1, gated by the golden corpus).