ELares · ELares · Jun 14, 2026 · Jun 14, 2026
@@ -0,0 +1,107 @@
+# Design: io_uring net fast path with registered buffers and multishot ops
+
+Issue: #28. Decisions: ADR-0002 (shared-nothing thread-per-core), ADR-0003
+(determinism / Env seam). Related: #25 (core runtime, RUNTIME.md), #27 (runtime
+abstraction seam), #26 (runtime bake-off), #67 (persistence write path).
+
+## Goal and scope
+
+IronCache's hot path is network I/O: accept, read a request, write a reply,
+repeated across millions of ops per second per node. This spec designs the Linux
+io_uring fast path so the steady state pays zero per-request heap allocation and
+one `io_uring_enter` amortized across a batch. Scope: the per-shard ring
+topology, the registered fixed-buffer slab and buffer groups, multishot accept
+and recv with a one-shot fallback chosen by a startup probe, and how the path
+degrades on older kernels. The cross-platform seam (the epoll/kqueue fallback
+backend) lives in #27; this spec is the Linux datapath behind that seam. The
+low-level binding is the `io-uring` crate [io-uring-crate-version].
+
+## Design
+
+### Ring topology
+
+- One io_uring per shard, pinned to the same core as the shard's reactor
+  (ADR-0002). A per-shard ring avoids cross-core completion routing, cache-line
+  bouncing, and any locking on submission or completion; locks are unnecessary
+  by construction [glommio-locks-never-necessary] [seastar-shared-nothing]. This
+  matches Dragonfly's helio over io_uring with an epoll fallback
+  [dragonfly-iouring-helio]. No buffer or completion ever crosses a shard.
+
+### Registered fixed-buffer slab and buffer groups
+
+- At shard init the shard allocates one contiguous slab, registers it with the
+  ring once, and splits it into a kernel buffer group indexed by buffer-group id.
+  Registration removes per-request pin/unpin and malloc. On recv completion the
+  kernel returns the buffer id it filled; the shard parses the request in place
+  and returns the buffer to the group on reply completion. No buffer leaves the
+  shard, so there is no synchronization on the buffer pool. The pool is the
+  per-shard owned-buffer pool the runtime seam exposes (#27).
+
+### Multishot ops with one-shot fallback, chosen by startup probe
+
+- Accept uses multishot accept where the kernel supports it: one SQE posts a CQE
+  per new connection [io-uring-multishot-accept-kernel] (kernel ~5.19+), cutting
+  SQE churn on the listener. Connection reads use multishot recv with the
+  ring-provided buffer group [io-uring-multishot-recv-kernel] (kernel 6.0+),
+  which removes per-read submission and per-read buffer handoff.
+- A startup feature probe selects the path per ring rather than a compile-time
+  cfg, so one Linux artifact runs across kernel versions: where multishot or
+  provided buffers are absent, the ring falls back to re-armed one-shot accept
+  and one-shot `Read`/`Recv` over owned fixed buffers [io-uring-read-opcode-kernel]
+  (kernel 5.6+). Below 5.6 the io_uring path is not used at all; the #27
+  epoll/kqueue backend serves that host [monoio-min-kernel-fallback].
+
+### Pipelining and the shared persistence write path
+
+- Multishot recv batches naturally: one completion can carry several pipelined
+  commands, which the parser drains before the buffer returns to the group, and a
+  batch of replies coalesces into one submission (#25). Persistence (#67) reuses
+  this substrate rather than forking it: the snapshot/tiering writer gets its own
+  registered write buffers from the same per-shard slab and submits fixed-buffer
+  writes on the same per-shard ring, so durability shares the fast path.
+
+### Resolved open decisions
+
+- Buffer-group sizing under memory pressure: the per-shard slab is a fixed budget
+  set at startup and counted against the shard's maxmemory share, not grown on
+  demand; when the group is drained the shard applies read back-pressure (defers
+  re-arming recv) instead of allocating, so a burst cannot blow the memory bound.
+- SQPOLL vs explicit enter: default to explicit `io_uring_enter` with batched
+  submission, because SQPOLL burns a kernel poller core at idle; SQPOLL is an
+  opt-in for dedicated-core deployments where that idle cost is acceptable.
+- Completion-queue overflow when a shard stalls: rely on the kernel CQ overflow
+  list (no completions are lost), drain it on the next enter, and pair it with
+  the read back-pressure above so a stalled shard sheds new reads rather than
+  overrunning its CQ; a sustained-overflow counter feeds observability.
+- Minimum kernel for the fast path: the multishot tier where the probe finds it,
+  one-shot io_uring on 5.6+ otherwise, and the #27 epoll/kqueue backend below
+  that. The fast path is always an optimization behind the #27 stable interface.
+
+## Open questions
+
+- The exact per-shard slab budget as a fraction of the shard memory share, and
+  how it interacts with eviction (#48) under sustained read pressure.
+- Whether reply-side writes should also use registered fixed buffers in the
+  one-shot fallback tier, or only in the multishot tier.
+
+## Acceptance and test hooks
+
+- The steady-state read path performs zero heap allocations per request.
+- The startup probe selects multishot accept/recv [io-uring-multishot-accept-kernel]
+  [io-uring-multishot-recv-kernel] when present and falls back to one-shot ops
+  [io-uring-read-opcode-kernel] otherwise, verified on two kernel versions.
+- One ring per shard, pinned to the shard core, with no cross-core completion
+  routing and no buffer crossing a shard.
+- The read and persistence (#67) write paths share the same registered slab.
+- A benchmark shows reduced syscalls/op versus the epoll baseline (#26 host
+  sensitivity, runtime-bakeoff.md).
+
+## References
+
+- ADR-0002, ADR-0003; issues #25, #27, #26, #67; docs/design/RUNTIME.md,
+  docs/design/RUNTIME_ABSTRACTION.md, docs/experiments/runtime-bakeoff.md,
+  docs/research/concurrency-runtime-rust.md, docs/research/dragonfly.md.
+- Claims: [io-uring-crate-version], [io-uring-multishot-accept-kernel],
+  [io-uring-multishot-recv-kernel], [io-uring-read-opcode-kernel],
+  [monoio-min-kernel-fallback], [dragonfly-iouring-helio],
+  [glommio-locks-never-necessary], [seastar-shared-nothing].
@@ -114,3 +114,8 @@ Specs added as the M1 milestone progresses.
 - [ADMIN_COMMANDS.md](ADMIN_COMMANDS.md): the admin/introspection command family
   (CLIENT LIST/INFO/KILL/PAUSE/NO-EVICT/NO-TOUCH with byte-faithful-vs-synthesized
   fields, COMMAND DOCS/INFO/COUNT/GETKEYS, RESET semantics) (#150).
+- [RUNTIME_ABSTRACTION.md](RUNTIME_ABSTRACTION.md): the Runtime/IO trait seam
+  (owned buffers, monomorphization, Cargo-feature backend select) keeping
+  monoio/glommio/tokio swappable (#27).
+- [IOURING_DATAPATH.md](IOURING_DATAPATH.md): the Linux io_uring net fast path
+  (per-shard ring, registered fixed buffers, multishot + one-shot fallback) (#28).
@@ -0,0 +1,114 @@
+# Design: Runtime/IO abstraction keeping monoio/glommio/tokio swappable
+
+Issue: #27. Decisions: ADR-0002 (shared-nothing thread-per-core), ADR-0003
+(determinism / Env seam). Related: #25 (core runtime, RUNTIME.md), #26 (runtime
+bake-off), #28 (io_uring fast path), #29 (cross-shard coordinator), #34 (storage
+API).
+
+## Goal and scope
+
+The bake-off (#26) can only decide monoio vs glommio vs tokio+epoll cheaply if
+the cache core is written once against a trait surface and swapping the concrete
+runtime costs a Cargo feature, not a rewrite. This spec fixes that seam: a thin
+`Runtime` trait the command core compiles against, with the concrete runtime
+chosen at the binary's edge and zero indirection on the hot path. Scope: the
+trait surface, the owned-buffer model, the monomorphization and feature-selection
+strategy, and `spawn_on_shard`. This spec does NOT pick the default runtime; the
+#26 bake-off owns that behind this seam (runtime-bakeoff.md). Out of scope: the
+shard-per-core topology itself (#25) and the io_uring opcode/buffer design (#28).
+
+## Design
+
+### The Runtime trait surface
+
+- A minimal trait: `accept`, `recv`, `send`, `timer`, and `spawn_on_shard`, with
+  associated types fixing the listener, stream, and buffer concretely per
+  backend. The surface is deliberately small because the thread-per-core
+  backends produce `!Send` futures (a completion-model property of monoio and
+  glommio); a fat ecosystem trait cannot be satisfied by a thread-per-core
+  runtime, while a minimal set is implementable by all three including tokio.
+  There is no global `spawn`: work pins to its owning core through
+  `spawn_on_shard`, matching shared-nothing (ADR-0002) where locks are
+  unnecessary by construction [glommio-locks-never-necessary]
+  [seastar-shared-nothing].
+
+### Owned-buffer model
+
+- All I/O is owned-buffer (`IoBuf` / `IoBufMut`), never borrowed `&mut [u8]`.
+  io_uring's completion model requires the buffer to outlive the kernel, so
+  owned buffers are the only model the io_uring backends can satisfy; tokio's
+  readiness model [tokio-workstealing-readiness-model] adapts by copying into an
+  owned buffer. That copy is paid only on the portable fallback, never on the
+  io_uring fast path (#28), so the seam costs the fast path nothing. The pool
+  itself is per-shard and exposed through the trait (resolved below) so the
+  io_uring datapath owns registration (#28).
+
+### Monomorphization, not dyn dispatch
+
+- The core is generic over the `Runtime` trait, so the request loop monomorphizes
+  with no vtable: there is no `dyn Runtime` on the hot path. Trait objects are
+  rejected because they add dynamic dispatch to every `recv`/`send`. The request
+  loop carries no `cfg`; backend differences live entirely behind the associated
+  types.
+
+### Backend selection by Cargo feature
+
+- Exactly one backend is active per build, chosen by `--features monoio | glommio
+  | tokio`. Per-build selection keeps each binary monomorphized and lean; a
+  runtime flag would link all three and reintroduce dispatch. This makes #26 a
+  feature-flag swap with no core source change, and lets the win, which is
+  conditional and workload-dependent [monoio-vs-tokio-scaling], be measured per
+  workload rather than assumed.
+
+### Resolved open decisions
+
+- Single-binary story: ship per-backend builds, a Linux-io_uring build and a
+  portable epoll/kqueue build, rather than one fat binary. Compile-time backend
+  selection is incompatible with a single static artifact. The split is by
+  runtime backend, not by architecture or kernel version: each build is still
+  one binary per architecture (preserving CLI_BINARY.md's promise), and the
+  io_uring build stays a single Linux artifact that spans kernel tiers via the
+  #28 startup probe rather than forking per kernel. This scopes RUNTIME.md's
+  "single binary runs everywhere" to per-backend, while keeping the fast path an
+  optimization behind the stable interface.
+- Timer abstraction: the trait's `timer` is backed by a per-shard timer wheel as
+  the canonical timer (it is what TTL #51 and connection reaping already need),
+  and the backend's native timer is used only to arm the wheel's next deadline.
+  This keeps timer semantics identical across backends and keeps the Env seam
+  (ADR-0003) the single source of time for deterministic replay.
+- tokio+epoll is a first-class release target, not dev/portability only, because
+  it is the only backend that runs on kernels without io_uring (the Compatible
+  tenet) and on macOS/BSD via kqueue [monoio-min-kernel-fallback].
+- Owned-buffer pool ownership: per-shard and exposed through the trait, not
+  backend-internal, so the io_uring backend registers it once (#28) and the
+  tokio adapter allocates an equivalent per-shard pool; the core sees one model.
+
+## Open questions
+
+- Whether `spawn_on_shard` needs a bounded mailbox depth surfaced to the core, or
+  whether the cross-shard coordinator (#29) owns all back-pressure.
+- The workspace MSRV is the max across active backends (glommio 1.70
+  [glommio-version-msrv], tokio 1.71 [tokio-version-msrv]; monoio's MSRV is
+  unpinned [monoio-version]); confirm monoio's floor and whether it raises the
+  portable build's MSRV above tokio's.
+
+## Acceptance and test hooks
+
+- The `Runtime` trait defines `accept`, `recv`, `send`, `timer`, and
+  `spawn_on_shard` with associated listener/stream/buffer types.
+- The command core compiles against monoio, glommio, and tokio with no `cfg` in
+  the request loop and no `dyn Runtime` in generated code (inspected).
+- The tokio+epoll backend runs on a pre-5.6 Linux kernel and the kqueue backend
+  on macOS [monoio-min-kernel-fallback].
+- Exactly one backend is active per build, selected by a Cargo feature; #26 swaps
+  backends by feature flag only.
+
+## References
+
+- ADR-0002, ADR-0003; issues #25, #26, #28, #29, #34; docs/design/RUNTIME.md,
+  docs/design/CLI_BINARY.md, docs/experiments/runtime-bakeoff.md,
+  docs/research/concurrency-runtime-rust.md.
+- Claims: [monoio-version], [glommio-version-msrv], [tokio-version-msrv],
+  [tokio-workstealing-readiness-model], [monoio-vs-tokio-scaling],
+  [glommio-locks-never-necessary], [seastar-shared-nothing],
+  [monoio-min-kernel-fallback], [io-uring-read-opcode-kernel].