Skip to content

HTTP backends: reactor matches nginx (10x libuv), CI throughput tracker#3

Merged
junjihashimoto merged 6 commits into
mainfrom
feat/http-backends
Jul 2, 2026
Merged

HTTP backends: reactor matches nginx (10x libuv), CI throughput tracker#3
junjihashimoto merged 6 commits into
mainfrom
feat/http-backends

Conversation

@junjihashimoto

Copy link
Copy Markdown
Contributor

Summary

  • New HTTP backendsLeanTea.Net.FastServer (POSIX FFI, SO_REUSEPORT, thread-per-conn) and LeanTea.Net.ReactorServer (kqueue/epoll non-blocking event loop). Unified via LeanTea.Net.Backend + LEANTEA_HTTP_BACKEND env-var picker.
  • Reactor matches / slightly beats nginx at 128 keep-alive conns on M-series: 72 149 RPS vs nginx 69 428. libuv baseline was 6 218 RPS; that's ~10× on RPS and ~5-10× on p99 tail latency.
  • CI throughput tracker.github/workflows/bench.yml runs on every push to main, boots lean-tea + nginx side-by-side on the same runner, records both absolute RPS and the parity ratio through benchmark-action/github-action-benchmark, persists to bench-data/http-bench.json.
  • Typed LeanTea.Web.Route — Yesod-style inductive routes so Route.link refuses raw String hrefs; dead links become compile errors.

The measured story

server health RPS vs nginx p99 (ms)
libuv Server 6 218 9 % 17
FFI FastServer (SO_REUSEPORT) 64 297 90 % 2.1
Reactor (kqueue/epoll) 72 149 104 % 2.1
nginx (same box, same conf) 69 428 2.0

Full breakdown (three routes × three concurrency levels + saturation runs) in docs/BENCHMARKS.md.

Why the libuv version was slow

Every recv/send hopped through: Lean IO → AsyncTask alloc → .block submit → libuv epoll → completion callback → Lean scheduler wake → fiber resume. Profiled at 100–500 µs per hop; each request paid 3–5 hops. Matches the measured T=1 ceiling of 1 933 RPS ≈ 500 µs/req.

The FFI variants skip that entire path — recv() and send() are direct syscalls. The reactor additionally keeps every idle connection at ~100 bytes of C state (no OS thread per fd) so it scales to 10 k+ idle connections without the LEAN_NUM_THREADS >= concurrency sharp edge the FastServer has.

Which backend to use

LeanTea.Net.Backend.fromEnv picks one from LEANTEA_HTTP_BACKEND:

var pick best for
unset / reactor reactor default — HTTP APIs
fast / fast:16 FastServer N workers short-req APIs, no idle floods
libuv Server.serveConcurrent LLM proxy, WS, SSE, chat

Apps typically write:

def main : IO Unit := do
  let backend ← LeanTea.Net.Backend.fromEnv (default := .reactor)
  LeanTea.Net.Backend.serve backend 8080 "0.0.0.0" myHandler

Also in this PR

  • Fixed a Response.toBytes bug — used to append a hardcoded connection: close even when the annotator added connection: keep-alive, so responses carried both. ab was lenient; strict clients would have dropped. The serializer is now a single growing string + one toUTF8.
  • bench_server picks up the backend from Backend.fromEnv; the --fast N / --reactor CLI flags stay for explicit perf runs and win over the env var.
  • LeanTea.Web.Route (typed inductive route + Route.link compile-time dead-link check). Standalone commit at the bottom of the branch — not on the critical perf path but was blocked in the same session.

Test plan

  • lake build clean on macOS (176 targets)
  • Smoke test all three backends via env var:
    • LEANTEA_HTTP_BACKEND=reactor ./bench_server → curl /health /json /echo all 200 OK
    • LEANTEA_HTTP_BACKEND=libuv → same
    • LEANTEA_HTTP_BACKEND=fast:4 → same
    • Unknown value → warns to stderr, falls back to default
  • wrk -t8 -c128 -d10s http://.../health on all three backends produces the numbers in the table above
  • Reactor doesn't crash under wrk -t16 -c2000 -d10s (63 k RPS, matches nginx at same load)
  • CI bench.yml — first run will land after merge; watch it produce a populated bench-data/http-bench.json and not infinite-loop
  • Downstream repos (chuhan, meta) still build after merge (they require this via git; no user-facing API removed but they gain the new imports)

…r helpers

Grep of the tree turned up ~15 sites where handlers hand-wrote JSON
bodies as string literals ("{\"error\":\"…\"}") or, worse, string
concatenation ("{\"sub\":\"" ++ u.sub ++ "\","...). Two failure modes
that the compiler couldn't catch:

  1. Brace / quote balance is a lint at best. A stray missing } would
     have shipped as invalid JSON to any caller.
  2. Concat sites in LeanTea/Auth/Idp.lean (L177 access-token
     response, L198 /userinfo response) inlined attacker-controlled
     values (u.email, u.name) with no escaping. As long as those
     fields never contained a `"` the shape held — but that's the
     definition of a latent injection.

Three new helpers on Response so handlers can hand the codec the
problem:

  * Response.json  status body       -- body : Lean.Json
  * Response.jsonObj status v        -- v : α with [ToJson α]
  * Response.jsonError status msg    -- convenience for {"error": msg}

Sites migrated:

  LeanTea/Auth/Idp.lean         — 6 error responses + the 2 concat
                                  sites (Bearer token + /userinfo).
  LeanTea/Browser.lean          — CDP close message.
  examples/AgentDashboard/Serve.lean  — 5 sites.
  examples/LlmChatWeb/Serve.lean       — 1 site.
  examples/Docs/Ch04_TypedRpc.lean     — 1 site (matters for teaching).
  examples/Smoke/HttpClient.lean       — the JSON-RPC handshake body.

Untouched (intentionally):
  * The doc-comment JSON in LeanTea/Net/WebSocket.lean:28 — an
    illustration, not runtime code.
  * The startsWith needle in Smoke/HttpClient.lean — matching a
    prefix, not constructing.
  * The forged JWT in Tests/PureSpec.lean — the whole point of the
    test is a malformed token; must stay hand-authored.

Verified: lake build → 162/162 green.
Adds:
  * examples/BenchServer/Main.lean — 3 tiny routes (health / json /
    echo) exposed via serveConcurrent
  * lean_exe bench_server target
  * bench/run.sh — Apache Bench-based harness that varies
    LEAN_NUM_THREADS across {1,2,4,8,16} and dumps a compact
    RPS / p50 / p99 / avg table
  * bench/results-{health,json,echo}.txt — captured runs
  * docs/BENCHMARKS.md — writeup + interpretation

Headline finding: the current serveConcurrent does NOT scale past
one worker thread on this hardware. Peak throughput is at
LEAN_NUM_THREADS=1 (~6-7k RPS on all three routes) and adding
workers slightly regresses (task-spawn + scheduler coordination cost
exceeds the parallelism benefit for handlers this short). We are
1-2 orders of magnitude below nginx / warp on the same box.

Why: the accept loop is a single OS thread that hands each
accepted connection to IO.asTask. Every connection serialises on
one accept(); tiny handlers make the task-spawn overhead visible.
Next-round design notes captured in docs/BENCHMARKS.md
(SO_REUSEPORT + per-worker accept loops, an in-place synchronous
handler variant, HTTP/1.1 keep-alive).

Ships this doc BEFORE opening the branch to HN, so the front
page can drop the "on par with nginx / wai" language it currently
implies. That claim was ambition, not measurement.
The bench in the previous commit showed the server didn't scale
with LEAN_NUM_THREADS — adding workers slightly *lowered* RPS
because task-spawn overhead exceeded useful work on tiny handlers.
Root cause: every request opened + closed a fresh TCP connection,
so we paid for accept + shutdown syscalls on every request.

This commit teaches serveConcurrent to keep the connection open:

  * New `Request.version` field (HTTP/1.0 vs 1.1) set by parseRequest,
    so the keep-alive logic can pick the right default.
  * Server side loops on the same client until either the request
    carries `Connection: close`, HTTP/1.0 default, or the socket
    dies. Response gets a `Connection: keep-alive|close` header
    auto-annotated if the handler didn't set one.
  * Nagle off on the server socket (`Socket.Server.noDelay`) so tiny
    responses hit the wire immediately.
  * recvUntilRequest carries leftover bytes forward — pipelining
    tolerance + one syscall saved when the next request's headers
    arrived in the same TCP segment as the previous body.
  * Backlog bumped from 64 → 128.

Effect (ab -k -c 64 -n 50000, same host as before):

  T   RPS-before  RPS-after (health)
  ----- --------- ----------
  1     6657      1933
  2     5950      2485
  4     5663      3420
  8     5717      4469
  16    5656      6218    ← now the peak, up from ~5700

Absolute peak throughput is roughly unchanged (~6-7k RPS), but the
regime changed:

  * Before: without client-side keep-alive, T=1 was pathological
    peak because task-spawn was the bottleneck.
  * After: keep-alive amortises TCP setup so per-request cost falls;
    the bottleneck moves to the single-thread accept loop, and RPS
    scales with LEAN_NUM_THREADS up to that ceiling.

Neither number is close to nginx-class throughput (100k+ RPS) and
that stays true until we can bind N listener sockets to the same
port with SO_REUSEPORT — which needs a socket-option API in
Std.Net that Lean 4.31 doesn't expose. docs/BENCHMARKS.md now has
both rounds side by side and calls out the remaining work.

Build stays green (162/162).
…nk check

Yesod-style routing: apps declare an inductive Route type and derive
Route.toPath. Route.link takes a constructor + label and refuses raw
String hrefs, so renaming or removing a route constructor turns every
call site into a compile error rather than a broken href at deploy
time.

Follow-ups still on the roadmap:
  * fromPath : String -> Option Route (bidirectional dispatch codec)
  * Typed captures / query-string parameters at the type level (for
    those cases the RPC layer already covers, use LeanTea.Rpc).
Three HTTP backends now ship, all sharing the same
Handler = Request -> IO Response signature. LeanTea.Net.Backend
exposes them through one enum + Backend.fromEnv so an app's main
picks via LEANTEA_HTTP_BACKEND without touching handler code.

  * LeanTea.Net.Server (libuv, existing) — best for LLM proxy / WS /
    SSE / any workload with many idle connections that yield on
    .block.
  * LeanTea.Net.FastServer (c/leantea_fastnet.c) — POSIX socket()
    + SO_REUSEPORT + blocking recv/send behind @[extern]. N accept
    workers each with their own listener; kernel round-robins
    accepts. Skips the ~100-500 us libuv/task-scheduler hop that
    was capping the framework at 6 k RPS.
  * LeanTea.Net.ReactorServer (c/leantea_reactor.c) — kqueue on
    macOS/BSD, epoll on Linux. Single non-blocking event loop
    manages every fd. Per-conn state (recv accumulator + partial
    send remnant) lives in ~100 bytes of C, so idle keep-alive
    connections don't cost an OS thread. Default.

Measured on an M-series laptop (wrk t=8 c=128 15s):

  libuv Server         6 218 RPS   (9 % of nginx)
  FFI FastServer      64 297 RPS  (90 % of nginx)
  Reactor            72 149 RPS  (104 % of nginx)
  nginx (reference)   69 428 RPS

Full numbers with p50/p99 and c=2000 saturation runs live in
docs/BENCHMARKS.md.

Also included in this commit:

  * Response.toBytes bug: it used to append a hardcoded
    "connection: close" header at the terminator, so every keep-
    alive response actually carried both keep-alive and close.
    ab tolerated it; strict clients would have dropped. Fixed +
    replaced the s! ping-pong with a single growing string + one
    toUTF8. Applies to all three backends.
  * bench_server picks up the backend from Backend.fromEnv; --fast
    and --reactor CLI flags stay for explicit perf runs and win
    over the env var.
  * README claims parity with nginx (measured, not aspirational).
.github/workflows/bench.yml runs on every push to main:
  1. Boots bench_server (LEANTEA_HTTP_BACKEND=reactor) and a
     matching nginx side-by-side on the same ubuntu-latest runner.
  2. Hits both with wrk -t8 -c128 -d15s on /health, /json, /echo.
  3. Assembles a customBiggerIsBetter JSON payload that includes
     both absolute RPS AND the lean-tea/nginx % ratio per route.
  4. Feeds it to benchmark-action/github-action-benchmark, which
     appends to bench-data/http-bench.json and flags anything below
     80 % of the previous best.
  5. Commits the updated JSON back to main via the action bot.

The absolute RPS on a 4-vCPU runner will always trail M-series
numbers; the parity ratio is what to trend, since both servers
run on the same runner in the same job.

paths-ignore: bench-data/** on the trigger — without it the bot's
own commit would kick off another bench run.

fail-on-alert: false for now; flip once ~20 runs establish the
noise floor.
@junjihashimoto junjihashimoto merged commit 039935e into main Jul 2, 2026
1 check passed
@junjihashimoto junjihashimoto deleted the feat/http-backends branch July 2, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant