krish-reddy · krish-reddy · Apr 24, 2026
diff --git a/docs/design/v1/distributed/l2_adapters/nvmeof_l2_adapter.md b/docs/design/v1/distributed/l2_adapters/nvmeof_l2_adapter.md
@@ -0,0 +1,138 @@
+# NVMeOF L2 Adapter
+
+**Source**: `lmcache/v1/distributed/l2_adapters/nvmeof_l2_adapter.py`
+**Storage plugin counterpart**: `lmcache/v1/storage_backend/plugins/nvmeof_backend.py`
+
+## Overview
+
+`NVMeOFL2Adapter` is an `L2AdapterInterface` implementation for multi-process
+(MP) mode that stores KV cache objects on an NVMeOF-attached block device
+formatted and mounted as a filesystem.  It follows the same event-fd protocol
+as `FSL2Adapter` (three distinct fds: store, lookup, load) and uses
+*aiofiles* for non-blocking filesystem I/O on a background asyncio event loop.
+
+The adapter self-registers as type `"nvmeof"` via
+`register_l2_adapter_type` / `register_l2_adapter_factory` at import time.
+The `__init__.py` auto-discovers it through `pkgutil` — no other changes are
+required to make it available.
+
+## Architecture
+
+```
+StoreController / PrefetchController
+        │
+        │  submit_{store,lookup,load}_task()
+        ▼
+NVMeOFL2Adapter
+  ├── _store_efd   ─── signals completed stores
+  ├── _lookup_efd  ─── signals completed lookups
+  ├── _load_efd    ─── signals completed loads
+  │
+  └── background asyncio loop (daemon thread "nvmeof-l2-loop")
+          │
+          ├── _execute_store()   ── aiofiles.open / O_DIRECT write
+          ├── _execute_lookup()  ── aiofiles.os.path.exists
+          └── _execute_load()    ── aiofiles.open / O_DIRECT read
+                      │
+                      ▼
+            /mnt/nvmeof/<encoded_filename>.nvmeof
+```
+
+## Configuration
+
+Passed as a JSON object to `--l2-adapter`:
+
+| Field | Type | Default | Description |
+|---|---|---|---|
+| `type` | str | _(required)_ | Must be `"nvmeof"` |
+| `mount_path` | str | _(required)_ | Filesystem mount point |
+| `transport` | str | `"tcp"` | NVMeOF transport: `rdma`, `tcp`, `fc` |
+| `target_addr` | str | `""` | Target IP/hostname (auto-connect only) |
+| `target_port` | str | `"4420"` | Target port (auto-connect only) |
+| `target_nqn` | str | `""` | Target NQN (auto-connect / auto-disconnect) |
+| `auto_connect` | bool | `false` | Run `nvme connect` at init |
+| `auto_disconnect` | bool | `false` | Run `nvme disconnect` at close |
+| `use_odirect` | bool | `false` | Bypass page cache via `O_DIRECT` |
+
+### Example
+
+```bash
+--l2-adapter '{
+  "type": "nvmeof",
+  "mount_path": "/mnt/nvmeof",
+  "transport": "tcp",
+  "target_addr": "192.168.1.100",
+  "target_port": "4420",
+  "target_nqn": "nqn.2023-01.io.example:nvme-sub",
+  "auto_connect": false,
+  "auto_disconnect": false,
+  "use_odirect": false
+}'
+```
+
+## File Naming
+
+```
+<mount_path>/<safe_model>@0x<kv_rank_hex>@<chunk_hash_hex>[@ <cache_salt>].nvmeof
+```
+
+- `/` in `model_name` is replaced with `-SEP-`.
+- `kv_rank` is hex-encoded with `0x` prefix.
+- `chunk_hash` is hex-encoded.
+- `cache_salt` is appended only when non-empty.
+- Extension `.nvmeof` distinguishes files from other adapters (e.g. `.data`
+  from `FSL2Adapter`) allowing coexistence on the same filesystem.
+
+## Event-fd Protocol
+
+Each of the three event fds is distinct (enforced by three separate
+`os.eventfd()` calls):
+
+| Event fd | Signalled by | Read by |
+|---|---|---|
+| `_store_efd` | `_execute_store` on completion | StoreController |
+| `_lookup_efd` | `_execute_lookup` on completion | PrefetchController |
+| `_load_efd` | `_execute_load` on completion | PrefetchController |
+
+Each async coroutine appends to the matching completed-tasks dict under
+`_lock`, then writes `1` to the event fd.  The controller drains the dict
+via `pop_completed_store_tasks()` or `query_{lookup,load}_result()`.
+
+## Lifecycle
+
+1. **Init**: create mount dir; optionally `nvme connect`; open three event
+   fds; start background event loop thread.
+2. **Store**: `submit_store_task` → `asyncio.run_coroutine_threadsafe` →
+   `_execute_store` writes each file atomically (tmp → rename) → signal
+   `_store_efd`.
+3. **Lookup**: `submit_lookup_and_lock_task` → `_execute_lookup` checks
+   `aiofiles.os.path.exists` for each key → populate Bitmap → signal
+   `_lookup_efd`.
+4. **Unlock**: no-op (filesystem storage has no eviction race between lookup
+   and load).
+5. **Load**: `submit_load_task` → `_execute_load` reads each file into the
+   caller-provided buffer → populate Bitmap → signal `_load_efd`.
+6. **Delete**: synchronous `Path.unlink(missing_ok=True)` for each key;
+   fires `on_l2_keys_deleted` listener notifications.
+7. **Close**: cancel pending tasks; stop event loop; join thread; close event
+   fds; optionally `nvme disconnect`.
+
+## Thread Safety
+
+- `_lock` protects all three completed-task dicts and `_next_task_id`.
+- All I/O runs on the dedicated event loop thread; task submission is
+  thread-safe via `asyncio.run_coroutine_threadsafe`.
+
+## O_DIRECT
+
+When `use_odirect=true`, buffer sizes must be aligned to
+`os.statvfs(mount_path).f_bsize`.  Unaligned writes fall back to standard
+buffered I/O with a warning.  O_DIRECT reads and writes run in the default
+executor (thread pool) via `loop.run_in_executor` so they don't block the
+event loop.
+
+## Persistence
+
+Files are never deleted on normal shutdown — data persists across restarts.
+Use the `delete()` method (called by the L2 eviction controller when
+`"eviction"` is configured in the adapter JSON) to remove stale entries.
diff --git a/docs/design/v1/storage_backend/plugins/nvmeof_backend.md b/docs/design/v1/storage_backend/plugins/nvmeof_backend.md
@@ -0,0 +1,127 @@
+# NVMeOF Storage Plugin Backend
+
+**Source**: `lmcache/v1/storage_backend/plugins/nvmeof_backend.py`
+**L2 counterpart**: `lmcache/v1/distributed/l2_adapters/nvmeof_l2_adapter.py`
+
+## Overview
+
+`NVMeOFBackend` is a `StoragePluginInterface` implementation that stores KV
+cache chunks on an NVMe-over-Fabrics (NVMeOF) attached block device formatted
+and mounted as a filesystem (ext4, XFS, etc.).  Each KV chunk is a separate
+file under `nvmeof.mount_path`.
+
+The backend fits into the single-process storage stack via
+`storage_plugin_launcher()` in `lmcache/v1/storage_backend/__init__.py`.  For
+multi-process (MP) mode, see the companion L2 adapter.
+
+## Architecture
+
+```
+LMCache engine
+    │
+    ▼
+StorageBackendInterface
+    │
+    ▼
+NVMeOFBackend  ──── _NVMeOFWorker (AsyncPQThreadPoolExecutor)
+    │                       │
+    │            ┌──────────┴───────────┐
+    │         put (p=2)  delete (p=1)  prefetch (p=0)
+    │
+    ▼
+NVMeOF mount point  (e.g. /mnt/nvmeof, formatted as ext4/XFS)
+    │
+    ▼
+NVMeOF block device  (connected via nvme-cli or pre-mounted)
+```
+
+### Key design decisions
+
+| Decision | Choice | Rationale |
+|---|---|---|
+| Access mode | Filesystem (not raw block) | Simpler portability; works with any FS driver |
+| Auto-connect | Off by default | Device is typically pre-mounted by the operator |
+| Eviction | Pluggable policy (LRU/LFU/FIFO/MRU) | Matches LocalDiskBackend behavior |
+| O_DIRECT | Optional | Avoids double-buffering; requires aligned buffers |
+| Per-file storage | One file per KV chunk | Simple; enables parallel reads/writes |
+| Async I/O | `AsyncPQThreadPoolExecutor` | Priority ordering: prefetch > delete > put |
+
+## Configuration
+
+All keys are prefixed `nvmeof.` in `extra_config`:
+
+| Key | Type | Default | Description |
+|---|---|---|---|
+| `nvmeof.mount_path` | str | _(required)_ | Filesystem mount point |
+| `nvmeof.transport` | str | `"tcp"` | NVMeOF transport: `rdma`, `tcp`, `fc` |
+| `nvmeof.target_addr` | str | `""` | Target IP/hostname (auto-connect only) |
+| `nvmeof.target_port` | str | `"4420"` | Target port (auto-connect only) |
+| `nvmeof.target_nqn` | str | `""` | Target NQN (auto-connect / auto-disconnect) |
+| `nvmeof.auto_connect` | bool | `false` | Run `nvme connect` at init |
+| `nvmeof.auto_disconnect` | bool | `false` | Run `nvme disconnect` at close |
+| `nvmeof.use_odirect` | bool | `false` | Use `O_DIRECT` for I/O |
+| `nvmeof.max_capacity_gb` | float | `0` (unlimited) | Max filesystem usage in GiB |
+| `nvmeof.cache_policy` | str | `"lru"` | Eviction policy: `lru`, `lfu`, `fifo`, `mru` |
+
+### Example `extra_config`
+
+```yaml
+storage_plugins: ["nvmeof"]
+extra_config:
+  storage_plugin.nvmeof.module_path: lmcache.v1.storage_backend.plugins.nvmeof_backend
+  storage_plugin.nvmeof.class_name: NVMeOFBackend
+  nvmeof.mount_path: /mnt/nvmeof
+  nvmeof.transport: tcp
+  nvmeof.target_addr: 192.168.1.100
+  nvmeof.target_port: "4420"
+  nvmeof.target_nqn: nqn.2023-01.io.example:nvme-sub
+  nvmeof.auto_connect: false
+  nvmeof.auto_disconnect: false
+  nvmeof.use_odirect: false
+  nvmeof.max_capacity_gb: 100.0
+  nvmeof.cache_policy: lru
+```
+
+## File Naming
+
+```
+<mount_path>/<key.to_string().replace("/", "-")>.nvmeof
+```
+
+`CacheEngineKey.to_string()` produces a slash-separated string; slashes are
+replaced with `-` to produce a flat filename with no subdirectory structure.
+
+## Lifecycle
+
+1. **Init**: `os.makedirs(mount_path, exist_ok=True)`.  If `auto_connect=true`,
+   calls `nvme connect` via subprocess and raises `RuntimeError` on failure.
+2. **Put**: `batched_submit_put_task` → evict if at capacity → schedule
+   `_write_chunk` on the priority executor (priority 2).
+3. **Get (blocking)**: `get_blocking` reads the file synchronously into a CPU
+   staging buffer provided by `local_cpu_backend`.
+4. **Get (async/prefetch)**: `batched_get_non_blocking` schedules
+   `_batched_read_files` on the executor at priority 0 (highest).
+5. **Eviction**: `_cache_policy.get_evict_candidates()` selects keys;
+   `remove()` deletes the file and updates the in-memory index.
+6. **Close**: Drains the executor, then calls `nvme disconnect` if
+   `auto_disconnect=true`.
+
+## Thread Safety
+
+- `_disk_lock` protects the in-memory index (`_dict`) and `_current_cache_size`.
+- `_NVMeOFWorker._put_lock` protects the in-flight put-task list.
+- Async tasks run on the caller-provided `loop` via `asyncio.run_coroutine_threadsafe`.
+
+## O_DIRECT
+
+When `use_odirect=true`, buffer sizes must be aligned to the filesystem block
+size (`os.statvfs(mount_path).f_bsize`).  Unaligned writes fall back to
+standard buffered I/O with a warning log.  NVMeOF devices typically have a
+512 B or 4 KiB block size.
+
+## Extension Points
+
+- **Custom eviction**: implement a new policy and register it in
+  `lmcache/v1/storage_backend/cache_policy.py`.
+- **Raw block access**: replace `_write_file` / `_read_file` with pread/pwrite
+  calls (see `rust_raw_block_backend.py` for the pattern).