diff --git a/docs/design/IOURING_DATAPATH.md b/docs/design/IOURING_DATAPATH.md new file mode 100644 index 0000000..f751a98 --- /dev/null +++ b/docs/design/IOURING_DATAPATH.md @@ -0,0 +1,107 @@ +# Design: io_uring net fast path with registered buffers and multishot ops + +Issue: #28. Decisions: ADR-0002 (shared-nothing thread-per-core), ADR-0003 +(determinism / Env seam). Related: #25 (core runtime, RUNTIME.md), #27 (runtime +abstraction seam), #26 (runtime bake-off), #67 (persistence write path). + +## Goal and scope + +IronCache's hot path is network I/O: accept, read a request, write a reply, +repeated across millions of ops per second per node. This spec designs the Linux +io_uring fast path so the steady state pays zero per-request heap allocation and +one `io_uring_enter` amortized across a batch. Scope: the per-shard ring +topology, the registered fixed-buffer slab and buffer groups, multishot accept +and recv with a one-shot fallback chosen by a startup probe, and how the path +degrades on older kernels. The cross-platform seam (the epoll/kqueue fallback +backend) lives in #27; this spec is the Linux datapath behind that seam. The +low-level binding is the `io-uring` crate [io-uring-crate-version]. + +## Design + +### Ring topology + +- One io_uring per shard, pinned to the same core as the shard's reactor + (ADR-0002). A per-shard ring avoids cross-core completion routing, cache-line + bouncing, and any locking on submission or completion; locks are unnecessary + by construction [glommio-locks-never-necessary] [seastar-shared-nothing]. This + matches Dragonfly's helio over io_uring with an epoll fallback + [dragonfly-iouring-helio]. No buffer or completion ever crosses a shard. + +### Registered fixed-buffer slab and buffer groups + +- At shard init the shard allocates one contiguous slab, registers it with the + ring once, and splits it into a kernel buffer group indexed by buffer-group id. + Registration removes per-request pin/unpin and malloc. On recv completion the + kernel returns the buffer id it filled; the shard parses the request in place + and returns the buffer to the group on reply completion. No buffer leaves the + shard, so there is no synchronization on the buffer pool. The pool is the + per-shard owned-buffer pool the runtime seam exposes (#27). + +### Multishot ops with one-shot fallback, chosen by startup probe + +- Accept uses multishot accept where the kernel supports it: one SQE posts a CQE + per new connection [io-uring-multishot-accept-kernel] (kernel ~5.19+), cutting + SQE churn on the listener. Connection reads use multishot recv with the + ring-provided buffer group [io-uring-multishot-recv-kernel] (kernel 6.0+), + which removes per-read submission and per-read buffer handoff. +- A startup feature probe selects the path per ring rather than a compile-time + cfg, so one Linux artifact runs across kernel versions: where multishot or + provided buffers are absent, the ring falls back to re-armed one-shot accept + and one-shot `Read`/`Recv` over owned fixed buffers [io-uring-read-opcode-kernel] + (kernel 5.6+). Below 5.6 the io_uring path is not used at all; the #27 + epoll/kqueue backend serves that host [monoio-min-kernel-fallback]. + +### Pipelining and the shared persistence write path + +- Multishot recv batches naturally: one completion can carry several pipelined + commands, which the parser drains before the buffer returns to the group, and a + batch of replies coalesces into one submission (#25). Persistence (#67) reuses + this substrate rather than forking it: the snapshot/tiering writer gets its own + registered write buffers from the same per-shard slab and submits fixed-buffer + writes on the same per-shard ring, so durability shares the fast path. + +### Resolved open decisions + +- Buffer-group sizing under memory pressure: the per-shard slab is a fixed budget + set at startup and counted against the shard's maxmemory share, not grown on + demand; when the group is drained the shard applies read back-pressure (defers + re-arming recv) instead of allocating, so a burst cannot blow the memory bound. +- SQPOLL vs explicit enter: default to explicit `io_uring_enter` with batched + submission, because SQPOLL burns a kernel poller core at idle; SQPOLL is an + opt-in for dedicated-core deployments where that idle cost is acceptable. +- Completion-queue overflow when a shard stalls: rely on the kernel CQ overflow + list (no completions are lost), drain it on the next enter, and pair it with + the read back-pressure above so a stalled shard sheds new reads rather than + overrunning its CQ; a sustained-overflow counter feeds observability. +- Minimum kernel for the fast path: the multishot tier where the probe finds it, + one-shot io_uring on 5.6+ otherwise, and the #27 epoll/kqueue backend below + that. The fast path is always an optimization behind the #27 stable interface. + +## Open questions + +- The exact per-shard slab budget as a fraction of the shard memory share, and + how it interacts with eviction (#48) under sustained read pressure. +- Whether reply-side writes should also use registered fixed buffers in the + one-shot fallback tier, or only in the multishot tier. + +## Acceptance and test hooks + +- The steady-state read path performs zero heap allocations per request. +- The startup probe selects multishot accept/recv [io-uring-multishot-accept-kernel] + [io-uring-multishot-recv-kernel] when present and falls back to one-shot ops + [io-uring-read-opcode-kernel] otherwise, verified on two kernel versions. +- One ring per shard, pinned to the shard core, with no cross-core completion + routing and no buffer crossing a shard. +- The read and persistence (#67) write paths share the same registered slab. +- A benchmark shows reduced syscalls/op versus the epoll baseline (#26 host + sensitivity, runtime-bakeoff.md). + +## References + +- ADR-0002, ADR-0003; issues #25, #27, #26, #67; docs/design/RUNTIME.md, + docs/design/RUNTIME_ABSTRACTION.md, docs/experiments/runtime-bakeoff.md, + docs/research/concurrency-runtime-rust.md, docs/research/dragonfly.md. +- Claims: [io-uring-crate-version], [io-uring-multishot-accept-kernel], + [io-uring-multishot-recv-kernel], [io-uring-read-opcode-kernel], + [monoio-min-kernel-fallback], [dragonfly-iouring-helio], + [glommio-locks-never-necessary], [seastar-shared-nothing]. diff --git a/docs/design/README.md b/docs/design/README.md index 574b040..62f85be 100644 --- a/docs/design/README.md +++ b/docs/design/README.md @@ -114,3 +114,8 @@ Specs added as the M1 milestone progresses. - [ADMIN_COMMANDS.md](ADMIN_COMMANDS.md): the admin/introspection command family (CLIENT LIST/INFO/KILL/PAUSE/NO-EVICT/NO-TOUCH with byte-faithful-vs-synthesized fields, COMMAND DOCS/INFO/COUNT/GETKEYS, RESET semantics) (#150). +- [RUNTIME_ABSTRACTION.md](RUNTIME_ABSTRACTION.md): the Runtime/IO trait seam + (owned buffers, monomorphization, Cargo-feature backend select) keeping + monoio/glommio/tokio swappable (#27). +- [IOURING_DATAPATH.md](IOURING_DATAPATH.md): the Linux io_uring net fast path + (per-shard ring, registered fixed buffers, multishot + one-shot fallback) (#28). diff --git a/docs/design/RUNTIME_ABSTRACTION.md b/docs/design/RUNTIME_ABSTRACTION.md new file mode 100644 index 0000000..20c1dec --- /dev/null +++ b/docs/design/RUNTIME_ABSTRACTION.md @@ -0,0 +1,114 @@ +# Design: Runtime/IO abstraction keeping monoio/glommio/tokio swappable + +Issue: #27. Decisions: ADR-0002 (shared-nothing thread-per-core), ADR-0003 +(determinism / Env seam). Related: #25 (core runtime, RUNTIME.md), #26 (runtime +bake-off), #28 (io_uring fast path), #29 (cross-shard coordinator), #34 (storage +API). + +## Goal and scope + +The bake-off (#26) can only decide monoio vs glommio vs tokio+epoll cheaply if +the cache core is written once against a trait surface and swapping the concrete +runtime costs a Cargo feature, not a rewrite. This spec fixes that seam: a thin +`Runtime` trait the command core compiles against, with the concrete runtime +chosen at the binary's edge and zero indirection on the hot path. Scope: the +trait surface, the owned-buffer model, the monomorphization and feature-selection +strategy, and `spawn_on_shard`. This spec does NOT pick the default runtime; the +#26 bake-off owns that behind this seam (runtime-bakeoff.md). Out of scope: the +shard-per-core topology itself (#25) and the io_uring opcode/buffer design (#28). + +## Design + +### The Runtime trait surface + +- A minimal trait: `accept`, `recv`, `send`, `timer`, and `spawn_on_shard`, with + associated types fixing the listener, stream, and buffer concretely per + backend. The surface is deliberately small because the thread-per-core + backends produce `!Send` futures (a completion-model property of monoio and + glommio); a fat ecosystem trait cannot be satisfied by a thread-per-core + runtime, while a minimal set is implementable by all three including tokio. + There is no global `spawn`: work pins to its owning core through + `spawn_on_shard`, matching shared-nothing (ADR-0002) where locks are + unnecessary by construction [glommio-locks-never-necessary] + [seastar-shared-nothing]. + +### Owned-buffer model + +- All I/O is owned-buffer (`IoBuf` / `IoBufMut`), never borrowed `&mut [u8]`. + io_uring's completion model requires the buffer to outlive the kernel, so + owned buffers are the only model the io_uring backends can satisfy; tokio's + readiness model [tokio-workstealing-readiness-model] adapts by copying into an + owned buffer. That copy is paid only on the portable fallback, never on the + io_uring fast path (#28), so the seam costs the fast path nothing. The pool + itself is per-shard and exposed through the trait (resolved below) so the + io_uring datapath owns registration (#28). + +### Monomorphization, not dyn dispatch + +- The core is generic over the `Runtime` trait, so the request loop monomorphizes + with no vtable: there is no `dyn Runtime` on the hot path. Trait objects are + rejected because they add dynamic dispatch to every `recv`/`send`. The request + loop carries no `cfg`; backend differences live entirely behind the associated + types. + +### Backend selection by Cargo feature + +- Exactly one backend is active per build, chosen by `--features monoio | glommio + | tokio`. Per-build selection keeps each binary monomorphized and lean; a + runtime flag would link all three and reintroduce dispatch. This makes #26 a + feature-flag swap with no core source change, and lets the win, which is + conditional and workload-dependent [monoio-vs-tokio-scaling], be measured per + workload rather than assumed. + +### Resolved open decisions + +- Single-binary story: ship per-backend builds, a Linux-io_uring build and a + portable epoll/kqueue build, rather than one fat binary. Compile-time backend + selection is incompatible with a single static artifact. The split is by + runtime backend, not by architecture or kernel version: each build is still + one binary per architecture (preserving CLI_BINARY.md's promise), and the + io_uring build stays a single Linux artifact that spans kernel tiers via the + #28 startup probe rather than forking per kernel. This scopes RUNTIME.md's + "single binary runs everywhere" to per-backend, while keeping the fast path an + optimization behind the stable interface. +- Timer abstraction: the trait's `timer` is backed by a per-shard timer wheel as + the canonical timer (it is what TTL #51 and connection reaping already need), + and the backend's native timer is used only to arm the wheel's next deadline. + This keeps timer semantics identical across backends and keeps the Env seam + (ADR-0003) the single source of time for deterministic replay. +- tokio+epoll is a first-class release target, not dev/portability only, because + it is the only backend that runs on kernels without io_uring (the Compatible + tenet) and on macOS/BSD via kqueue [monoio-min-kernel-fallback]. +- Owned-buffer pool ownership: per-shard and exposed through the trait, not + backend-internal, so the io_uring backend registers it once (#28) and the + tokio adapter allocates an equivalent per-shard pool; the core sees one model. + +## Open questions + +- Whether `spawn_on_shard` needs a bounded mailbox depth surfaced to the core, or + whether the cross-shard coordinator (#29) owns all back-pressure. +- The workspace MSRV is the max across active backends (glommio 1.70 + [glommio-version-msrv], tokio 1.71 [tokio-version-msrv]; monoio's MSRV is + unpinned [monoio-version]); confirm monoio's floor and whether it raises the + portable build's MSRV above tokio's. + +## Acceptance and test hooks + +- The `Runtime` trait defines `accept`, `recv`, `send`, `timer`, and + `spawn_on_shard` with associated listener/stream/buffer types. +- The command core compiles against monoio, glommio, and tokio with no `cfg` in + the request loop and no `dyn Runtime` in generated code (inspected). +- The tokio+epoll backend runs on a pre-5.6 Linux kernel and the kqueue backend + on macOS [monoio-min-kernel-fallback]. +- Exactly one backend is active per build, selected by a Cargo feature; #26 swaps + backends by feature flag only. + +## References + +- ADR-0002, ADR-0003; issues #25, #26, #28, #29, #34; docs/design/RUNTIME.md, + docs/design/CLI_BINARY.md, docs/experiments/runtime-bakeoff.md, + docs/research/concurrency-runtime-rust.md. +- Claims: [monoio-version], [glommio-version-msrv], [tokio-version-msrv], + [tokio-workstealing-readiness-model], [monoio-vs-tokio-scaling], + [glommio-locks-never-necessary], [seastar-shared-nothing], + [monoio-min-kernel-fallback], [io-uring-read-opcode-kernel].