Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions docs/design/v1/distributed/l2_adapters/nvmeof_l2_adapter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# NVMeOF L2 Adapter

**Source**: `lmcache/v1/distributed/l2_adapters/nvmeof_l2_adapter.py`
**Storage plugin counterpart**: `lmcache/v1/storage_backend/plugins/nvmeof_backend.py`

## Overview

`NVMeOFL2Adapter` is an `L2AdapterInterface` implementation for multi-process
(MP) mode that stores KV cache objects on an NVMeOF-attached block device
formatted and mounted as a filesystem. It follows the same event-fd protocol
as `FSL2Adapter` (three distinct fds: store, lookup, load) and uses
*aiofiles* for non-blocking filesystem I/O on a background asyncio event loop.

The adapter self-registers as type `"nvmeof"` via
`register_l2_adapter_type` / `register_l2_adapter_factory` at import time.
The `__init__.py` auto-discovers it through `pkgutil` — no other changes are
required to make it available.

## Architecture

```
StoreController / PrefetchController
│ submit_{store,lookup,load}_task()
NVMeOFL2Adapter
├── _store_efd ─── signals completed stores
├── _lookup_efd ─── signals completed lookups
├── _load_efd ─── signals completed loads
└── background asyncio loop (daemon thread "nvmeof-l2-loop")
├── _execute_store() ── aiofiles.open / O_DIRECT write
├── _execute_lookup() ── aiofiles.os.path.exists
└── _execute_load() ── aiofiles.open / O_DIRECT read
/mnt/nvmeof/<encoded_filename>.nvmeof
```

## Configuration

Passed as a JSON object to `--l2-adapter`:

| Field | Type | Default | Description |
|---|---|---|---|
| `type` | str | _(required)_ | Must be `"nvmeof"` |
| `mount_path` | str | _(required)_ | Filesystem mount point |
| `transport` | str | `"tcp"` | NVMeOF transport: `rdma`, `tcp`, `fc` |
| `target_addr` | str | `""` | Target IP/hostname (auto-connect only) |
| `target_port` | str | `"4420"` | Target port (auto-connect only) |
| `target_nqn` | str | `""` | Target NQN (auto-connect / auto-disconnect) |
| `auto_connect` | bool | `false` | Run `nvme connect` at init |
| `auto_disconnect` | bool | `false` | Run `nvme disconnect` at close |
| `use_odirect` | bool | `false` | Bypass page cache via `O_DIRECT` |

### Example

```bash
--l2-adapter '{
"type": "nvmeof",
"mount_path": "/mnt/nvmeof",
"transport": "tcp",
"target_addr": "192.168.1.100",
"target_port": "4420",
"target_nqn": "nqn.2023-01.io.example:nvme-sub",
"auto_connect": false,
"auto_disconnect": false,
"use_odirect": false
}'
```

## File Naming

```
<mount_path>/<safe_model>@0x<kv_rank_hex>@<chunk_hash_hex>[@ <cache_salt>].nvmeof
```

- `/` in `model_name` is replaced with `-SEP-`.
- `kv_rank` is hex-encoded with `0x` prefix.
- `chunk_hash` is hex-encoded.
- `cache_salt` is appended only when non-empty.
- Extension `.nvmeof` distinguishes files from other adapters (e.g. `.data`
from `FSL2Adapter`) allowing coexistence on the same filesystem.

## Event-fd Protocol

Each of the three event fds is distinct (enforced by three separate
`os.eventfd()` calls):

| Event fd | Signalled by | Read by |
|---|---|---|
| `_store_efd` | `_execute_store` on completion | StoreController |
| `_lookup_efd` | `_execute_lookup` on completion | PrefetchController |
| `_load_efd` | `_execute_load` on completion | PrefetchController |

Each async coroutine appends to the matching completed-tasks dict under
`_lock`, then writes `1` to the event fd. The controller drains the dict
via `pop_completed_store_tasks()` or `query_{lookup,load}_result()`.

## Lifecycle

1. **Init**: create mount dir; optionally `nvme connect`; open three event
fds; start background event loop thread.
2. **Store**: `submit_store_task` → `asyncio.run_coroutine_threadsafe` →
`_execute_store` writes each file atomically (tmp → rename) → signal
`_store_efd`.
3. **Lookup**: `submit_lookup_and_lock_task` → `_execute_lookup` checks
`aiofiles.os.path.exists` for each key → populate Bitmap → signal
`_lookup_efd`.
4. **Unlock**: no-op (filesystem storage has no eviction race between lookup
and load).
5. **Load**: `submit_load_task` → `_execute_load` reads each file into the
caller-provided buffer → populate Bitmap → signal `_load_efd`.
6. **Delete**: synchronous `Path.unlink(missing_ok=True)` for each key;
fires `on_l2_keys_deleted` listener notifications.
7. **Close**: cancel pending tasks; stop event loop; join thread; close event
fds; optionally `nvme disconnect`.

## Thread Safety

- `_lock` protects all three completed-task dicts and `_next_task_id`.
- All I/O runs on the dedicated event loop thread; task submission is
thread-safe via `asyncio.run_coroutine_threadsafe`.

## O_DIRECT

When `use_odirect=true`, buffer sizes must be aligned to
`os.statvfs(mount_path).f_bsize`. Unaligned writes fall back to standard
buffered I/O with a warning. O_DIRECT reads and writes run in the default
executor (thread pool) via `loop.run_in_executor` so they don't block the
event loop.

## Persistence

Files are never deleted on normal shutdown — data persists across restarts.
Use the `delete()` method (called by the L2 eviction controller when
`"eviction"` is configured in the adapter JSON) to remove stale entries.
127 changes: 127 additions & 0 deletions docs/design/v1/storage_backend/plugins/nvmeof_backend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# NVMeOF Storage Plugin Backend

**Source**: `lmcache/v1/storage_backend/plugins/nvmeof_backend.py`
**L2 counterpart**: `lmcache/v1/distributed/l2_adapters/nvmeof_l2_adapter.py`

## Overview

`NVMeOFBackend` is a `StoragePluginInterface` implementation that stores KV
cache chunks on an NVMe-over-Fabrics (NVMeOF) attached block device formatted
and mounted as a filesystem (ext4, XFS, etc.). Each KV chunk is a separate
file under `nvmeof.mount_path`.

The backend fits into the single-process storage stack via
`storage_plugin_launcher()` in `lmcache/v1/storage_backend/__init__.py`. For
multi-process (MP) mode, see the companion L2 adapter.

## Architecture

```
LMCache engine
StorageBackendInterface
NVMeOFBackend ──── _NVMeOFWorker (AsyncPQThreadPoolExecutor)
│ │
│ ┌──────────┴───────────┐
│ put (p=2) delete (p=1) prefetch (p=0)
NVMeOF mount point (e.g. /mnt/nvmeof, formatted as ext4/XFS)
NVMeOF block device (connected via nvme-cli or pre-mounted)
```

### Key design decisions

| Decision | Choice | Rationale |
|---|---|---|
| Access mode | Filesystem (not raw block) | Simpler portability; works with any FS driver |
| Auto-connect | Off by default | Device is typically pre-mounted by the operator |
| Eviction | Pluggable policy (LRU/LFU/FIFO/MRU) | Matches LocalDiskBackend behavior |
| O_DIRECT | Optional | Avoids double-buffering; requires aligned buffers |
| Per-file storage | One file per KV chunk | Simple; enables parallel reads/writes |
| Async I/O | `AsyncPQThreadPoolExecutor` | Priority ordering: prefetch > delete > put |

## Configuration

All keys are prefixed `nvmeof.` in `extra_config`:

| Key | Type | Default | Description |
|---|---|---|---|
| `nvmeof.mount_path` | str | _(required)_ | Filesystem mount point |
| `nvmeof.transport` | str | `"tcp"` | NVMeOF transport: `rdma`, `tcp`, `fc` |
| `nvmeof.target_addr` | str | `""` | Target IP/hostname (auto-connect only) |
| `nvmeof.target_port` | str | `"4420"` | Target port (auto-connect only) |
| `nvmeof.target_nqn` | str | `""` | Target NQN (auto-connect / auto-disconnect) |
| `nvmeof.auto_connect` | bool | `false` | Run `nvme connect` at init |
| `nvmeof.auto_disconnect` | bool | `false` | Run `nvme disconnect` at close |
| `nvmeof.use_odirect` | bool | `false` | Use `O_DIRECT` for I/O |
| `nvmeof.max_capacity_gb` | float | `0` (unlimited) | Max filesystem usage in GiB |
| `nvmeof.cache_policy` | str | `"lru"` | Eviction policy: `lru`, `lfu`, `fifo`, `mru` |

### Example `extra_config`

```yaml
storage_plugins: ["nvmeof"]
extra_config:
storage_plugin.nvmeof.module_path: lmcache.v1.storage_backend.plugins.nvmeof_backend
storage_plugin.nvmeof.class_name: NVMeOFBackend
nvmeof.mount_path: /mnt/nvmeof
nvmeof.transport: tcp
nvmeof.target_addr: 192.168.1.100
nvmeof.target_port: "4420"
nvmeof.target_nqn: nqn.2023-01.io.example:nvme-sub
nvmeof.auto_connect: false
nvmeof.auto_disconnect: false
nvmeof.use_odirect: false
nvmeof.max_capacity_gb: 100.0
nvmeof.cache_policy: lru
```

## File Naming

```
<mount_path>/<key.to_string().replace("/", "-")>.nvmeof
```

`CacheEngineKey.to_string()` produces a slash-separated string; slashes are
replaced with `-` to produce a flat filename with no subdirectory structure.

## Lifecycle

1. **Init**: `os.makedirs(mount_path, exist_ok=True)`. If `auto_connect=true`,
calls `nvme connect` via subprocess and raises `RuntimeError` on failure.
2. **Put**: `batched_submit_put_task` → evict if at capacity → schedule
`_write_chunk` on the priority executor (priority 2).
3. **Get (blocking)**: `get_blocking` reads the file synchronously into a CPU
staging buffer provided by `local_cpu_backend`.
4. **Get (async/prefetch)**: `batched_get_non_blocking` schedules
`_batched_read_files` on the executor at priority 0 (highest).
5. **Eviction**: `_cache_policy.get_evict_candidates()` selects keys;
`remove()` deletes the file and updates the in-memory index.
6. **Close**: Drains the executor, then calls `nvme disconnect` if
`auto_disconnect=true`.

## Thread Safety

- `_disk_lock` protects the in-memory index (`_dict`) and `_current_cache_size`.
- `_NVMeOFWorker._put_lock` protects the in-flight put-task list.
- Async tasks run on the caller-provided `loop` via `asyncio.run_coroutine_threadsafe`.

## O_DIRECT

When `use_odirect=true`, buffer sizes must be aligned to the filesystem block
size (`os.statvfs(mount_path).f_bsize`). Unaligned writes fall back to
standard buffered I/O with a warning log. NVMeOF devices typically have a
512 B or 4 KiB block size.

## Extension Points

- **Custom eviction**: implement a new policy and register it in
`lmcache/v1/storage_backend/cache_policy.py`.
- **Raw block access**: replace `_write_file` / `_read_file` with pread/pwrite
calls (see `rust_raw_block_backend.py` for the pattern).
Loading