Skip to content

moro3one/floorline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

floorline

Your agent passes the demo. Does it pass the floor?

floorline measures the reliability floor of long-horizon agents, not the capability ceiling everyone already reports. Feed it repeated runs of your tasks; it tells you how often the agent succeeds every time — the number that actually decides whether you can ship.

  • Zero dependencies. Pure Python, stdlib only.
  • Deterministic. No model calls, no randomness — the same logs always produce the same report.
  • Framework-agnostic. Reads plain JSONL; works with any harness (Inspect, Langfuse, LangSmith, your own loop).
  • One self-contained HTML report. No JS, no CDN — open it anywhere.

The one idea

Benchmarks report pass@k: did the agent succeed at least once in k tries? That number looks great in a demo. But you don't deploy an agent that works sometimes — you deploy one that works every time. That's pass^k: the probability that all k runs succeed.

The two diverge violently as tasks get longer. A recent study found a frontier model at 78% pass@10 but only ~36% pass^10 on computer-use tasks (arXiv:2604.17849). The demo says 78%. The floor says 36%. floorline reports the floor.

k    pass@k (ceiling)   pass^k (floor)   gap
1       55.8%              55.8%          0.0%
2       70.7%              40.9%         29.8%
5       85.5%              24.3%         61.2%
10      95.8%              16.7%         79.2%   <- the 79-point gap your demo hid

Install

pip install floorline

Quickstart

Log your runs as JSONL — one line per run. Only task_id and success are required:

{"task_id": "checkout", "success": true,  "steps": 14, "reward": 1.0, "actions": ["read","plan","click","submit"]}
{"task_id": "checkout", "success": false, "steps": 14, "reward": 0.6, "actions": ["read","plan","click","click","click"]}

Then:

floorline summary runs.jsonl              # text metrics to stdout
floorline report  runs.jsonl -o out.html  # self-contained HTML report

Or from Python:

from floorline import load_runs, build_report, render_html

report = build_report(load_runs("runs.jsonl"))
open("report.html", "w").write(render_html(report))
print(report.floor_at_k[10])   # the pass^10 reliability floor

Try it on the bundled example:

python examples/make_example.py
floorline report examples/runs.jsonl -o examples/report.html

Input fields

Field Type Required Unlocks
task_id str yes grouping repeated runs
success bool yes pass@k, pass^k, reliability floor
steps int no reliability decay curve (horizon)
reward float 0–1 no graceful degradation score
actions list[str] no meltdown-onset detector
run_id str no cosmetic identifier

Unknown keys are ignored — adapting another tool's logs is usually a one-line rename.


What it computes

All definitions are deterministic arithmetic over outcomes. The metric family follows Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents (arXiv:2603.29231); the exact operationalizations below are floorline's own, documented so you can check them by hand.

  • pass@k1 − C(n−c, k)/C(n, k). Probability at least one of k runs succeeds (unbiased estimator, Chen et al. 2021). The capability ceiling.
  • pass^kC(c, k)/C(n, k). Probability all k runs succeed. The reliability floor.
  • Reliability decay curve — pass^k bucketed by task horizon (step count). The downward slope is the long-horizon failure mode: per-step error compounds, so longer tasks are disproportionately less reliable.
  • Variance Amplification Factor (VAF) — observed variance of per-task success rates ÷ the variance expected if every task were equally reliable. VAF ≈ 1 means outcomes look like fair coin flips at one rate; VAF ≫ 1 means tasks cluster into reliable vs brittle, and a single average is lying to you.
  • Graceful Degradation Score (GDS) — average partial reward on runs that failed. Near 1: failures got most of the way. Near 0: failures are catastrophic.
  • Meltdown onset — slides a window over each run's actions and flags the first point where action-distribution entropy collapses (the agent stops exploring and loops). Reports the rate of looping runs and how early it starts.

Why pass^k, not pass@k, is the number that ships

pass@k rewards getting lucky once. Every product decision — can I let this agent touch production, run unattended overnight, handle the long task without a human — depends on it working repeatedly, which is pass^k. As agents move to multi-hour, multi-step autonomy, the floor, not the ceiling, is the deployment gate. floorline exists to put that one number in front of you, with the decay curve and brittleness that explain it.


License

MIT © 2026 Vlad Moroz

About

Deterministic reliability-floor metrics for long-horizon agents — pass^k, reliability decay curve, meltdown onset. Zero dependencies.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages