HTTP backends: reactor matches nginx (10x libuv), CI throughput tracker#3
Merged
Conversation
…r helpers
Grep of the tree turned up ~15 sites where handlers hand-wrote JSON
bodies as string literals ("{\"error\":\"…\"}") or, worse, string
concatenation ("{\"sub\":\"" ++ u.sub ++ "\","...). Two failure modes
that the compiler couldn't catch:
1. Brace / quote balance is a lint at best. A stray missing } would
have shipped as invalid JSON to any caller.
2. Concat sites in LeanTea/Auth/Idp.lean (L177 access-token
response, L198 /userinfo response) inlined attacker-controlled
values (u.email, u.name) with no escaping. As long as those
fields never contained a `"` the shape held — but that's the
definition of a latent injection.
Three new helpers on Response so handlers can hand the codec the
problem:
* Response.json status body -- body : Lean.Json
* Response.jsonObj status v -- v : α with [ToJson α]
* Response.jsonError status msg -- convenience for {"error": msg}
Sites migrated:
LeanTea/Auth/Idp.lean — 6 error responses + the 2 concat
sites (Bearer token + /userinfo).
LeanTea/Browser.lean — CDP close message.
examples/AgentDashboard/Serve.lean — 5 sites.
examples/LlmChatWeb/Serve.lean — 1 site.
examples/Docs/Ch04_TypedRpc.lean — 1 site (matters for teaching).
examples/Smoke/HttpClient.lean — the JSON-RPC handshake body.
Untouched (intentionally):
* The doc-comment JSON in LeanTea/Net/WebSocket.lean:28 — an
illustration, not runtime code.
* The startsWith needle in Smoke/HttpClient.lean — matching a
prefix, not constructing.
* The forged JWT in Tests/PureSpec.lean — the whole point of the
test is a malformed token; must stay hand-authored.
Verified: lake build → 162/162 green.
Adds:
* examples/BenchServer/Main.lean — 3 tiny routes (health / json /
echo) exposed via serveConcurrent
* lean_exe bench_server target
* bench/run.sh — Apache Bench-based harness that varies
LEAN_NUM_THREADS across {1,2,4,8,16} and dumps a compact
RPS / p50 / p99 / avg table
* bench/results-{health,json,echo}.txt — captured runs
* docs/BENCHMARKS.md — writeup + interpretation
Headline finding: the current serveConcurrent does NOT scale past
one worker thread on this hardware. Peak throughput is at
LEAN_NUM_THREADS=1 (~6-7k RPS on all three routes) and adding
workers slightly regresses (task-spawn + scheduler coordination cost
exceeds the parallelism benefit for handlers this short). We are
1-2 orders of magnitude below nginx / warp on the same box.
Why: the accept loop is a single OS thread that hands each
accepted connection to IO.asTask. Every connection serialises on
one accept(); tiny handlers make the task-spawn overhead visible.
Next-round design notes captured in docs/BENCHMARKS.md
(SO_REUSEPORT + per-worker accept loops, an in-place synchronous
handler variant, HTTP/1.1 keep-alive).
Ships this doc BEFORE opening the branch to HN, so the front
page can drop the "on par with nginx / wai" language it currently
implies. That claim was ambition, not measurement.
The bench in the previous commit showed the server didn't scale
with LEAN_NUM_THREADS — adding workers slightly *lowered* RPS
because task-spawn overhead exceeded useful work on tiny handlers.
Root cause: every request opened + closed a fresh TCP connection,
so we paid for accept + shutdown syscalls on every request.
This commit teaches serveConcurrent to keep the connection open:
* New `Request.version` field (HTTP/1.0 vs 1.1) set by parseRequest,
so the keep-alive logic can pick the right default.
* Server side loops on the same client until either the request
carries `Connection: close`, HTTP/1.0 default, or the socket
dies. Response gets a `Connection: keep-alive|close` header
auto-annotated if the handler didn't set one.
* Nagle off on the server socket (`Socket.Server.noDelay`) so tiny
responses hit the wire immediately.
* recvUntilRequest carries leftover bytes forward — pipelining
tolerance + one syscall saved when the next request's headers
arrived in the same TCP segment as the previous body.
* Backlog bumped from 64 → 128.
Effect (ab -k -c 64 -n 50000, same host as before):
T RPS-before RPS-after (health)
----- --------- ----------
1 6657 1933
2 5950 2485
4 5663 3420
8 5717 4469
16 5656 6218 ← now the peak, up from ~5700
Absolute peak throughput is roughly unchanged (~6-7k RPS), but the
regime changed:
* Before: without client-side keep-alive, T=1 was pathological
peak because task-spawn was the bottleneck.
* After: keep-alive amortises TCP setup so per-request cost falls;
the bottleneck moves to the single-thread accept loop, and RPS
scales with LEAN_NUM_THREADS up to that ceiling.
Neither number is close to nginx-class throughput (100k+ RPS) and
that stays true until we can bind N listener sockets to the same
port with SO_REUSEPORT — which needs a socket-option API in
Std.Net that Lean 4.31 doesn't expose. docs/BENCHMARKS.md now has
both rounds side by side and calls out the remaining work.
Build stays green (162/162).
…nk check
Yesod-style routing: apps declare an inductive Route type and derive
Route.toPath. Route.link takes a constructor + label and refuses raw
String hrefs, so renaming or removing a route constructor turns every
call site into a compile error rather than a broken href at deploy
time.
Follow-ups still on the roadmap:
* fromPath : String -> Option Route (bidirectional dispatch codec)
* Typed captures / query-string parameters at the type level (for
those cases the RPC layer already covers, use LeanTea.Rpc).
Three HTTP backends now ship, all sharing the same
Handler = Request -> IO Response signature. LeanTea.Net.Backend
exposes them through one enum + Backend.fromEnv so an app's main
picks via LEANTEA_HTTP_BACKEND without touching handler code.
* LeanTea.Net.Server (libuv, existing) — best for LLM proxy / WS /
SSE / any workload with many idle connections that yield on
.block.
* LeanTea.Net.FastServer (c/leantea_fastnet.c) — POSIX socket()
+ SO_REUSEPORT + blocking recv/send behind @[extern]. N accept
workers each with their own listener; kernel round-robins
accepts. Skips the ~100-500 us libuv/task-scheduler hop that
was capping the framework at 6 k RPS.
* LeanTea.Net.ReactorServer (c/leantea_reactor.c) — kqueue on
macOS/BSD, epoll on Linux. Single non-blocking event loop
manages every fd. Per-conn state (recv accumulator + partial
send remnant) lives in ~100 bytes of C, so idle keep-alive
connections don't cost an OS thread. Default.
Measured on an M-series laptop (wrk t=8 c=128 15s):
libuv Server 6 218 RPS (9 % of nginx)
FFI FastServer 64 297 RPS (90 % of nginx)
Reactor 72 149 RPS (104 % of nginx)
nginx (reference) 69 428 RPS
Full numbers with p50/p99 and c=2000 saturation runs live in
docs/BENCHMARKS.md.
Also included in this commit:
* Response.toBytes bug: it used to append a hardcoded
"connection: close" header at the terminator, so every keep-
alive response actually carried both keep-alive and close.
ab tolerated it; strict clients would have dropped. Fixed +
replaced the s! ping-pong with a single growing string + one
toUTF8. Applies to all three backends.
* bench_server picks up the backend from Backend.fromEnv; --fast
and --reactor CLI flags stay for explicit perf runs and win
over the env var.
* README claims parity with nginx (measured, not aspirational).
.github/workflows/bench.yml runs on every push to main:
1. Boots bench_server (LEANTEA_HTTP_BACKEND=reactor) and a
matching nginx side-by-side on the same ubuntu-latest runner.
2. Hits both with wrk -t8 -c128 -d15s on /health, /json, /echo.
3. Assembles a customBiggerIsBetter JSON payload that includes
both absolute RPS AND the lean-tea/nginx % ratio per route.
4. Feeds it to benchmark-action/github-action-benchmark, which
appends to bench-data/http-bench.json and flags anything below
80 % of the previous best.
5. Commits the updated JSON back to main via the action bot.
The absolute RPS on a 4-vCPU runner will always trail M-series
numbers; the parity ratio is what to trend, since both servers
run on the same runner in the same job.
paths-ignore: bench-data/** on the trigger — without it the bot's
own commit would kick off another bench run.
fail-on-alert: false for now; flip once ~20 runs establish the
noise floor.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LeanTea.Net.FastServer(POSIX FFI, SO_REUSEPORT, thread-per-conn) andLeanTea.Net.ReactorServer(kqueue/epoll non-blocking event loop). Unified viaLeanTea.Net.Backend+LEANTEA_HTTP_BACKENDenv-var picker..github/workflows/bench.ymlruns on every push tomain, boots lean-tea + nginx side-by-side on the same runner, records both absolute RPS and the parity ratio throughbenchmark-action/github-action-benchmark, persists tobench-data/http-bench.json.LeanTea.Web.Route— Yesod-style inductive routes soRoute.linkrefuses rawStringhrefs; dead links become compile errors.The measured story
ServerFastServer(SO_REUSEPORT)Full breakdown (three routes × three concurrency levels + saturation runs) in
docs/BENCHMARKS.md.Why the libuv version was slow
Every recv/send hopped through: Lean IO → AsyncTask alloc →
.blocksubmit → libuv epoll → completion callback → Lean scheduler wake → fiber resume. Profiled at 100–500 µs per hop; each request paid 3–5 hops. Matches the measured T=1 ceiling of 1 933 RPS ≈ 500 µs/req.The FFI variants skip that entire path —
recv()andsend()are direct syscalls. The reactor additionally keeps every idle connection at ~100 bytes of C state (no OS thread per fd) so it scales to 10 k+ idle connections without theLEAN_NUM_THREADS >= concurrencysharp edge theFastServerhas.Which backend to use
LeanTea.Net.Backend.fromEnvpicks one fromLEANTEA_HTTP_BACKEND:reactorfast/fast:16FastServerN workerslibuvServer.serveConcurrentApps typically write:
Also in this PR
Response.toBytesbug — used to append a hardcodedconnection: closeeven when the annotator addedconnection: keep-alive, so responses carried both.abwas lenient; strict clients would have dropped. The serializer is now a single growing string + onetoUTF8.bench_serverpicks up the backend fromBackend.fromEnv; the--fast N/--reactorCLI flags stay for explicit perf runs and win over the env var.LeanTea.Web.Route(typed inductive route +Route.linkcompile-time dead-link check). Standalone commit at the bottom of the branch — not on the critical perf path but was blocked in the same session.Test plan
lake buildclean on macOS (176 targets)LEANTEA_HTTP_BACKEND=reactor ./bench_server→ curl /health /json /echo all 200 OKLEANTEA_HTTP_BACKEND=libuv→ sameLEANTEA_HTTP_BACKEND=fast:4→ samewrk -t8 -c128 -d10s http://.../healthon all three backends produces the numbers in the table abovewrk -t16 -c2000 -d10s(63 k RPS, matches nginx at same load)bench.yml— first run will land after merge; watch it produce a populatedbench-data/http-bench.jsonand not infinite-looprequirethis via git; no user-facing API removed but they gain the new imports)