Your agent passes the demo. Does it pass the floor?
floorline measures the reliability floor of long-horizon agents, not the
capability ceiling everyone already reports. Feed it repeated runs of your
tasks; it tells you how often the agent succeeds every time — the number
that actually decides whether you can ship.
- Zero dependencies. Pure Python, stdlib only.
- Deterministic. No model calls, no randomness — the same logs always produce the same report.
- Framework-agnostic. Reads plain JSONL; works with any harness (Inspect, Langfuse, LangSmith, your own loop).
- One self-contained HTML report. No JS, no CDN — open it anywhere.
Benchmarks report pass@k: did the agent succeed at least once in k tries? That number looks great in a demo. But you don't deploy an agent that works sometimes — you deploy one that works every time. That's pass^k: the probability that all k runs succeed.
The two diverge violently as tasks get longer. A recent study found a frontier model at 78% pass@10 but only ~36% pass^10 on computer-use tasks (arXiv:2604.17849). The demo says 78%. The floor says 36%. floorline reports the floor.
k pass@k (ceiling) pass^k (floor) gap
1 55.8% 55.8% 0.0%
2 70.7% 40.9% 29.8%
5 85.5% 24.3% 61.2%
10 95.8% 16.7% 79.2% <- the 79-point gap your demo hid
pip install floorlineLog your runs as JSONL — one line per run. Only task_id and success are
required:
{"task_id": "checkout", "success": true, "steps": 14, "reward": 1.0, "actions": ["read","plan","click","submit"]}
{"task_id": "checkout", "success": false, "steps": 14, "reward": 0.6, "actions": ["read","plan","click","click","click"]}Then:
floorline summary runs.jsonl # text metrics to stdout
floorline report runs.jsonl -o out.html # self-contained HTML reportOr from Python:
from floorline import load_runs, build_report, render_html
report = build_report(load_runs("runs.jsonl"))
open("report.html", "w").write(render_html(report))
print(report.floor_at_k[10]) # the pass^10 reliability floorTry it on the bundled example:
python examples/make_example.py
floorline report examples/runs.jsonl -o examples/report.html| Field | Type | Required | Unlocks |
|---|---|---|---|
task_id |
str | yes | grouping repeated runs |
success |
bool | yes | pass@k, pass^k, reliability floor |
steps |
int | no | reliability decay curve (horizon) |
reward |
float 0–1 | no | graceful degradation score |
actions |
list[str] | no | meltdown-onset detector |
run_id |
str | no | cosmetic identifier |
Unknown keys are ignored — adapting another tool's logs is usually a one-line rename.
All definitions are deterministic arithmetic over outcomes. The metric family follows Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents (arXiv:2603.29231); the exact operationalizations below are floorline's own, documented so you can check them by hand.
- pass@k —
1 − C(n−c, k)/C(n, k). Probability at least one of k runs succeeds (unbiased estimator, Chen et al. 2021). The capability ceiling. - pass^k —
C(c, k)/C(n, k). Probability all k runs succeed. The reliability floor. - Reliability decay curve — pass^k bucketed by task horizon (step count). The downward slope is the long-horizon failure mode: per-step error compounds, so longer tasks are disproportionately less reliable.
- Variance Amplification Factor (VAF) — observed variance of per-task success rates ÷ the variance expected if every task were equally reliable. VAF ≈ 1 means outcomes look like fair coin flips at one rate; VAF ≫ 1 means tasks cluster into reliable vs brittle, and a single average is lying to you.
- Graceful Degradation Score (GDS) — average partial
rewardon runs that failed. Near 1: failures got most of the way. Near 0: failures are catastrophic. - Meltdown onset — slides a window over each run's
actionsand flags the first point where action-distribution entropy collapses (the agent stops exploring and loops). Reports the rate of looping runs and how early it starts.
pass@k rewards getting lucky once. Every product decision — can I let this
agent touch production, run unattended overnight, handle the long task without a
human — depends on it working repeatedly, which is pass^k. As agents move to
multi-hour, multi-step autonomy, the floor, not the ceiling, is the deployment
gate. floorline exists to put that one number in front of you, with the decay
curve and brittleness that explain it.
MIT © 2026 Vlad Moroz