floorline

Your agent passes the demo. Does it pass the floor?

floorline measures the reliability floor of long-horizon agents, not the capability ceiling everyone already reports. Feed it repeated runs of your tasks; it tells you how often the agent succeeds every time — the number that actually decides whether you can ship.

Zero dependencies. Pure Python, stdlib only.
Deterministic. No model calls, no randomness — the same logs always produce the same report.
Framework-agnostic. Reads plain JSONL; works with any harness (Inspect, Langfuse, LangSmith, your own loop).
One self-contained HTML report. No JS, no CDN — open it anywhere.

The one idea

Benchmarks report pass@k: did the agent succeed at least once in k tries? That number looks great in a demo. But you don't deploy an agent that works sometimes — you deploy one that works every time. That's pass^k: the probability that all k runs succeed.

The two diverge violently as tasks get longer. A recent study found a frontier model at 78% pass@10 but only ~36% pass^10 on computer-use tasks (arXiv:2604.17849). The demo says 78%. The floor says 36%. floorline reports the floor.

k    pass@k (ceiling)   pass^k (floor)   gap
1       55.8%              55.8%          0.0%
2       70.7%              40.9%         29.8%
5       85.5%              24.3%         61.2%
10      95.8%              16.7%         79.2%   <- the 79-point gap your demo hid

Install

pip install floorline

Quickstart

Log your runs as JSONL — one line per run. Only task_id and success are required:

{"task_id": "checkout", "success": true,  "steps": 14, "reward": 1.0, "actions": ["read","plan","click","submit"]}
{"task_id": "checkout", "success": false, "steps": 14, "reward": 0.6, "actions": ["read","plan","click","click","click"]}

Then:

floorline summary runs.jsonl              # text metrics to stdout
floorline report  runs.jsonl -o out.html  # self-contained HTML report

Or from Python:

from floorline import load_runs, build_report, render_html

report = build_report(load_runs("runs.jsonl"))
open("report.html", "w").write(render_html(report))
print(report.floor_at_k[10])   # the pass^10 reliability floor

Try it on the bundled example:

python examples/make_example.py
floorline report examples/runs.jsonl -o examples/report.html

Input fields

Field	Type	Required	Unlocks
`task_id`	str	yes	grouping repeated runs
`success`	bool	yes	pass@k, pass^k, reliability floor
`steps`	int	no	reliability decay curve (horizon)
`reward`	float 0–1	no	graceful degradation score
`actions`	list[str]	no	meltdown-onset detector
`run_id`	str	no	cosmetic identifier

Unknown keys are ignored — adapting another tool's logs is usually a one-line rename.

What it computes

All definitions are deterministic arithmetic over outcomes. The metric family follows Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents (arXiv:2603.29231); the exact operationalizations below are floorline's own, documented so you can check them by hand.

pass@k — 1 − C(n−c, k)/C(n, k). Probability at least one of k runs succeeds (unbiased estimator, Chen et al. 2021). The capability ceiling.
pass^k — C(c, k)/C(n, k). Probability all k runs succeed. The reliability floor.
Reliability decay curve — pass^k bucketed by task horizon (step count). The downward slope is the long-horizon failure mode: per-step error compounds, so longer tasks are disproportionately less reliable.
Variance Amplification Factor (VAF) — observed variance of per-task success rates ÷ the variance expected if every task were equally reliable. VAF ≈ 1 means outcomes look like fair coin flips at one rate; VAF ≫ 1 means tasks cluster into reliable vs brittle, and a single average is lying to you.
Graceful Degradation Score (GDS) — average partial reward on runs that failed. Near 1: failures got most of the way. Near 0: failures are catastrophic.
Meltdown onset — slides a window over each run's actions and flags the first point where action-distribution entropy collapses (the agent stops exploring and loops). Reports the rate of looping runs and how early it starts.

Why pass^k, not pass@k, is the number that ships

pass@k rewards getting lucky once. Every product decision — can I let this agent touch production, run unattended overnight, handle the long task without a human — depends on it working repeatedly, which is pass^k. As agents move to multi-hour, multi-step autonomy, the floor, not the ceiling, is the deployment gate. floorline exists to put that one number in front of you, with the decay curve and brittleness that explain it.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/floorline		src/floorline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

floorline

The one idea

Install

Quickstart

Input fields

What it computes

Why pass^k, not pass@k, is the number that ships

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

floorline

The one idea

Install

Quickstart

Input fields

What it computes

Why pass^k, not pass@k, is the number that ships

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages