Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions docs/design/IOURING_DATAPATH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Design: io_uring net fast path with registered buffers and multishot ops

Issue: #28. Decisions: ADR-0002 (shared-nothing thread-per-core), ADR-0003
(determinism / Env seam). Related: #25 (core runtime, RUNTIME.md), #27 (runtime
abstraction seam), #26 (runtime bake-off), #67 (persistence write path).

## Goal and scope

IronCache's hot path is network I/O: accept, read a request, write a reply,
repeated across millions of ops per second per node. This spec designs the Linux
io_uring fast path so the steady state pays zero per-request heap allocation and
one `io_uring_enter` amortized across a batch. Scope: the per-shard ring
topology, the registered fixed-buffer slab and buffer groups, multishot accept
and recv with a one-shot fallback chosen by a startup probe, and how the path
degrades on older kernels. The cross-platform seam (the epoll/kqueue fallback
backend) lives in #27; this spec is the Linux datapath behind that seam. The
low-level binding is the `io-uring` crate [io-uring-crate-version].

## Design

### Ring topology

- One io_uring per shard, pinned to the same core as the shard's reactor
(ADR-0002). A per-shard ring avoids cross-core completion routing, cache-line
bouncing, and any locking on submission or completion; locks are unnecessary
by construction [glommio-locks-never-necessary] [seastar-shared-nothing]. This
matches Dragonfly's helio over io_uring with an epoll fallback
[dragonfly-iouring-helio]. No buffer or completion ever crosses a shard.

### Registered fixed-buffer slab and buffer groups

- At shard init the shard allocates one contiguous slab, registers it with the
ring once, and splits it into a kernel buffer group indexed by buffer-group id.
Registration removes per-request pin/unpin and malloc. On recv completion the
kernel returns the buffer id it filled; the shard parses the request in place
and returns the buffer to the group on reply completion. No buffer leaves the
shard, so there is no synchronization on the buffer pool. The pool is the
per-shard owned-buffer pool the runtime seam exposes (#27).

### Multishot ops with one-shot fallback, chosen by startup probe

- Accept uses multishot accept where the kernel supports it: one SQE posts a CQE
per new connection [io-uring-multishot-accept-kernel] (kernel ~5.19+), cutting
SQE churn on the listener. Connection reads use multishot recv with the
ring-provided buffer group [io-uring-multishot-recv-kernel] (kernel 6.0+),
which removes per-read submission and per-read buffer handoff.
- A startup feature probe selects the path per ring rather than a compile-time
cfg, so one Linux artifact runs across kernel versions: where multishot or
provided buffers are absent, the ring falls back to re-armed one-shot accept
and one-shot `Read`/`Recv` over owned fixed buffers [io-uring-read-opcode-kernel]
(kernel 5.6+). Below 5.6 the io_uring path is not used at all; the #27
epoll/kqueue backend serves that host [monoio-min-kernel-fallback].

### Pipelining and the shared persistence write path

- Multishot recv batches naturally: one completion can carry several pipelined
commands, which the parser drains before the buffer returns to the group, and a
batch of replies coalesces into one submission (#25). Persistence (#67) reuses
this substrate rather than forking it: the snapshot/tiering writer gets its own
registered write buffers from the same per-shard slab and submits fixed-buffer
writes on the same per-shard ring, so durability shares the fast path.

### Resolved open decisions

- Buffer-group sizing under memory pressure: the per-shard slab is a fixed budget
set at startup and counted against the shard's maxmemory share, not grown on
demand; when the group is drained the shard applies read back-pressure (defers
re-arming recv) instead of allocating, so a burst cannot blow the memory bound.
- SQPOLL vs explicit enter: default to explicit `io_uring_enter` with batched
submission, because SQPOLL burns a kernel poller core at idle; SQPOLL is an
opt-in for dedicated-core deployments where that idle cost is acceptable.
- Completion-queue overflow when a shard stalls: rely on the kernel CQ overflow
list (no completions are lost), drain it on the next enter, and pair it with
the read back-pressure above so a stalled shard sheds new reads rather than
overrunning its CQ; a sustained-overflow counter feeds observability.
- Minimum kernel for the fast path: the multishot tier where the probe finds it,
one-shot io_uring on 5.6+ otherwise, and the #27 epoll/kqueue backend below
that. The fast path is always an optimization behind the #27 stable interface.

## Open questions

- The exact per-shard slab budget as a fraction of the shard memory share, and
how it interacts with eviction (#48) under sustained read pressure.
- Whether reply-side writes should also use registered fixed buffers in the
one-shot fallback tier, or only in the multishot tier.

## Acceptance and test hooks

- The steady-state read path performs zero heap allocations per request.
- The startup probe selects multishot accept/recv [io-uring-multishot-accept-kernel]
[io-uring-multishot-recv-kernel] when present and falls back to one-shot ops
[io-uring-read-opcode-kernel] otherwise, verified on two kernel versions.
- One ring per shard, pinned to the shard core, with no cross-core completion
routing and no buffer crossing a shard.
- The read and persistence (#67) write paths share the same registered slab.
- A benchmark shows reduced syscalls/op versus the epoll baseline (#26 host
sensitivity, runtime-bakeoff.md).

## References

- ADR-0002, ADR-0003; issues #25, #27, #26, #67; docs/design/RUNTIME.md,
docs/design/RUNTIME_ABSTRACTION.md, docs/experiments/runtime-bakeoff.md,
docs/research/concurrency-runtime-rust.md, docs/research/dragonfly.md.
- Claims: [io-uring-crate-version], [io-uring-multishot-accept-kernel],
[io-uring-multishot-recv-kernel], [io-uring-read-opcode-kernel],
[monoio-min-kernel-fallback], [dragonfly-iouring-helio],
[glommio-locks-never-necessary], [seastar-shared-nothing].
5 changes: 5 additions & 0 deletions docs/design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,3 +114,8 @@ Specs added as the M1 milestone progresses.
- [ADMIN_COMMANDS.md](ADMIN_COMMANDS.md): the admin/introspection command family
(CLIENT LIST/INFO/KILL/PAUSE/NO-EVICT/NO-TOUCH with byte-faithful-vs-synthesized
fields, COMMAND DOCS/INFO/COUNT/GETKEYS, RESET semantics) (#150).
- [RUNTIME_ABSTRACTION.md](RUNTIME_ABSTRACTION.md): the Runtime/IO trait seam
(owned buffers, monomorphization, Cargo-feature backend select) keeping
monoio/glommio/tokio swappable (#27).
- [IOURING_DATAPATH.md](IOURING_DATAPATH.md): the Linux io_uring net fast path
(per-shard ring, registered fixed buffers, multishot + one-shot fallback) (#28).
114 changes: 114 additions & 0 deletions docs/design/RUNTIME_ABSTRACTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Design: Runtime/IO abstraction keeping monoio/glommio/tokio swappable

Issue: #27. Decisions: ADR-0002 (shared-nothing thread-per-core), ADR-0003
(determinism / Env seam). Related: #25 (core runtime, RUNTIME.md), #26 (runtime
bake-off), #28 (io_uring fast path), #29 (cross-shard coordinator), #34 (storage
API).

## Goal and scope

The bake-off (#26) can only decide monoio vs glommio vs tokio+epoll cheaply if
the cache core is written once against a trait surface and swapping the concrete
runtime costs a Cargo feature, not a rewrite. This spec fixes that seam: a thin
`Runtime` trait the command core compiles against, with the concrete runtime
chosen at the binary's edge and zero indirection on the hot path. Scope: the
trait surface, the owned-buffer model, the monomorphization and feature-selection
strategy, and `spawn_on_shard`. This spec does NOT pick the default runtime; the
#26 bake-off owns that behind this seam (runtime-bakeoff.md). Out of scope: the
shard-per-core topology itself (#25) and the io_uring opcode/buffer design (#28).

## Design

### The Runtime trait surface

- A minimal trait: `accept`, `recv`, `send`, `timer`, and `spawn_on_shard`, with
associated types fixing the listener, stream, and buffer concretely per
backend. The surface is deliberately small because the thread-per-core
backends produce `!Send` futures (a completion-model property of monoio and
glommio); a fat ecosystem trait cannot be satisfied by a thread-per-core
runtime, while a minimal set is implementable by all three including tokio.
There is no global `spawn`: work pins to its owning core through
`spawn_on_shard`, matching shared-nothing (ADR-0002) where locks are
unnecessary by construction [glommio-locks-never-necessary]
[seastar-shared-nothing].

### Owned-buffer model

- All I/O is owned-buffer (`IoBuf` / `IoBufMut`), never borrowed `&mut [u8]`.
io_uring's completion model requires the buffer to outlive the kernel, so
owned buffers are the only model the io_uring backends can satisfy; tokio's
readiness model [tokio-workstealing-readiness-model] adapts by copying into an
owned buffer. That copy is paid only on the portable fallback, never on the
io_uring fast path (#28), so the seam costs the fast path nothing. The pool
itself is per-shard and exposed through the trait (resolved below) so the
io_uring datapath owns registration (#28).

### Monomorphization, not dyn dispatch

- The core is generic over the `Runtime` trait, so the request loop monomorphizes
with no vtable: there is no `dyn Runtime` on the hot path. Trait objects are
rejected because they add dynamic dispatch to every `recv`/`send`. The request
loop carries no `cfg`; backend differences live entirely behind the associated
types.

### Backend selection by Cargo feature

- Exactly one backend is active per build, chosen by `--features monoio | glommio
| tokio`. Per-build selection keeps each binary monomorphized and lean; a
runtime flag would link all three and reintroduce dispatch. This makes #26 a
feature-flag swap with no core source change, and lets the win, which is
conditional and workload-dependent [monoio-vs-tokio-scaling], be measured per
workload rather than assumed.

### Resolved open decisions

- Single-binary story: ship per-backend builds, a Linux-io_uring build and a
portable epoll/kqueue build, rather than one fat binary. Compile-time backend
selection is incompatible with a single static artifact. The split is by
runtime backend, not by architecture or kernel version: each build is still
one binary per architecture (preserving CLI_BINARY.md's promise), and the
io_uring build stays a single Linux artifact that spans kernel tiers via the
#28 startup probe rather than forking per kernel. This scopes RUNTIME.md's
"single binary runs everywhere" to per-backend, while keeping the fast path an
optimization behind the stable interface.
- Timer abstraction: the trait's `timer` is backed by a per-shard timer wheel as
the canonical timer (it is what TTL #51 and connection reaping already need),
and the backend's native timer is used only to arm the wheel's next deadline.
This keeps timer semantics identical across backends and keeps the Env seam
(ADR-0003) the single source of time for deterministic replay.
- tokio+epoll is a first-class release target, not dev/portability only, because
it is the only backend that runs on kernels without io_uring (the Compatible
tenet) and on macOS/BSD via kqueue [monoio-min-kernel-fallback].
- Owned-buffer pool ownership: per-shard and exposed through the trait, not
backend-internal, so the io_uring backend registers it once (#28) and the
tokio adapter allocates an equivalent per-shard pool; the core sees one model.

## Open questions

- Whether `spawn_on_shard` needs a bounded mailbox depth surfaced to the core, or
whether the cross-shard coordinator (#29) owns all back-pressure.
- The workspace MSRV is the max across active backends (glommio 1.70
[glommio-version-msrv], tokio 1.71 [tokio-version-msrv]; monoio's MSRV is
unpinned [monoio-version]); confirm monoio's floor and whether it raises the
portable build's MSRV above tokio's.

## Acceptance and test hooks

- The `Runtime` trait defines `accept`, `recv`, `send`, `timer`, and
`spawn_on_shard` with associated listener/stream/buffer types.
- The command core compiles against monoio, glommio, and tokio with no `cfg` in
the request loop and no `dyn Runtime` in generated code (inspected).
- The tokio+epoll backend runs on a pre-5.6 Linux kernel and the kqueue backend
on macOS [monoio-min-kernel-fallback].
- Exactly one backend is active per build, selected by a Cargo feature; #26 swaps
backends by feature flag only.

## References

- ADR-0002, ADR-0003; issues #25, #26, #28, #29, #34; docs/design/RUNTIME.md,
docs/design/CLI_BINARY.md, docs/experiments/runtime-bakeoff.md,
docs/research/concurrency-runtime-rust.md.
- Claims: [monoio-version], [glommio-version-msrv], [tokio-version-msrv],
[tokio-workstealing-readiness-model], [monoio-vs-tokio-scaling],
[glommio-locks-never-necessary], [seastar-shared-nothing],
[monoio-min-kernel-fallback], [io-uring-read-opcode-kernel].
Loading