From 032860cda9a47f25d61eeaee0094c830ca87082c Mon Sep 17 00:00:00 2001 From: Pushkinist <4850452+Pushkinist@users.noreply.github.com> Date: Thu, 18 Jun 2026 12:56:15 +0700 Subject: [PATCH 1/2] fix(serve): bound registry eager-preload to --max-loaded-models (#133) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Registry serve eagerly preloaded EVERY model at startup, serially, even at --max-loaded-models 1. A 13-model registry = ~5-min startup loading all 13 once, contradicting the documented load-on-demand + idle-unload model and ignoring the resident cap. ensure_loaded() LRU-evicts when slots.len() >= cap, so preloading N models at cap=K loaded N and kept only the last K — the first N-K loads were pure waste plus transient memory pressure. Bound the eager-preload loop to the first min(cap, N) registry entries (BTreeMap order, deterministic). This warms exactly the resident-set size; the rest stay lazy and load on first request via the existing on-demand path. Best-effort semantics, spawn_blocking off-runtime execution, and per-model idle timers via ensure_loaded are preserved. Update the multi-model lifecycle e2e comments (runner.rs/manifest.toml) to describe the new bounded behavior; leg (c)'s defensive explicit load-B already forces the LRU swap, so the test stays green and meaningful. Document the cap-bounded preload on --registry and --max-loaded-models in docs/CLI.md. --- crates/rmlx-cli/src/commands/serve.rs | 19 ++++++++++++++++--- crates/rmlx-cli/tests/e2e/manifest.toml | 5 +++-- crates/rmlx-cli/tests/e2e/runner.rs | 18 +++++++++++------- docs/CLI.md | 4 ++-- 4 files changed, 32 insertions(+), 14 deletions(-) diff --git a/crates/rmlx-cli/src/commands/serve.rs b/crates/rmlx-cli/src/commands/serve.rs index e22c7f0..3808438 100644 --- a/crates/rmlx-cli/src/commands/serve.rs +++ b/crates/rmlx-cli/src/commands/serve.rs @@ -1132,15 +1132,28 @@ pub(crate) fn run_serve( tts_model: Arc::new(parking_lot::RwLock::new(None)), }; - // Eager model preload — load every registry entry before - // serving requests so cold TTFT does not include model-load overhead. + // Eager model preload — warm the resident set before serving requests + // so cold TTFT does not include model-load overhead. Bounded to AT MOST + // `max_loaded_models` entries (the first `cap` in registry order): + // anything beyond the resident cap would be evicted by the next + // `ensure_loaded` (see `AppState::ensure_loaded` LRU swap), so preloading + // it is pure load-cost + transient memory pressure with nothing kept. + // The rest stay lazy — the documented load-on-demand / idle-unload path + // handles them on first request. // `ensure_loaded` is synchronous (CPU-bound disk + dequant); run it in // the blocking-thread pool so we do not stall the async runtime. // Best-effort: a load failure logs a warning but does not abort startup // (the first real request will attempt the load again via the normal // on-demand path and surface a 503 if it still fails). { - let ids: Vec = state.registry.list().iter().map(|e| e.id.clone()).collect(); + let cap = max_loaded_models.max(1); + let ids: Vec = state + .registry + .list() + .iter() + .take(cap) + .map(|e| e.id.clone()) + .collect(); let state_ref = state.clone(); tokio::task::spawn_blocking(move || { for id in &ids { diff --git a/crates/rmlx-cli/tests/e2e/manifest.toml b/crates/rmlx-cli/tests/e2e/manifest.toml index 6ccf3a4..a34a5b6 100644 --- a/crates/rmlx-cli/tests/e2e/manifest.toml +++ b/crates/rmlx-cli/tests/e2e/manifest.toml @@ -715,8 +715,9 @@ tags = ["phase2"] # Multi-model lifecycle: the runner owns a registry-mode serve (Bonsai + a 2nd # model resolved from GEMMA4_E2B) under single-MLX discipline. Proves: -# (a) load A → loaded; (c) cap=1 eager preload of [A,B] → B resident, A LRU- -# evicted (status flips); (d) explicit unload B → loaded:false, 2nd unload → +# (a) load A → loaded; (c) cap=1 eager preload of [A,B] warms only A (bounded +# to cap), then explicit load B forces the LRU swap → B resident, A evicted +# (status flips); (d) explicit unload B → loaded:false, 2nd unload → # 404; (e) claim enforcement — a 2nd `rmlx serve` on the HELD port is rejected # (exit 11, no competing Metal context). When GEMMA4_E2B is absent the runner # runs the single-model subset (leg a + claim leg) and marks the 2-model legs diff --git a/crates/rmlx-cli/tests/e2e/runner.rs b/crates/rmlx-cli/tests/e2e/runner.rs index 16deaea..cb59bdd 100644 --- a/crates/rmlx-cli/tests/e2e/runner.rs +++ b/crates/rmlx-cli/tests/e2e/runner.rs @@ -1957,9 +1957,10 @@ fn assert_cache_hit_equivalence( /// registry path (multiple model entries) instead of a single `--model`, and /// leaves `RUST_LOG=warn` (the lifecycle proof reads the HTTP API, not logs). /// -/// Registry mode eagerly pre-loads every entry at startup, bounded by the slot -/// LRU at `cap` — so on a green `/health` the resident set is already the -/// `cap`-survivor of the eager preload (see serve.rs "Eager model preload"). +/// Registry mode eagerly pre-loads AT MOST `cap` entries at startup (the first +/// `cap` in registry order; see serve.rs "Eager model preload") — so on a green +/// `/health` the first `min(cap, N)` registry entries are already resident and +/// the rest stay lazy until their first request. fn spawn_serve_registry( registry_json: &std::path::Path, port: u16, @@ -2131,9 +2132,11 @@ fn assert_model_lifecycle( } // ── Legs (a)+(c)+(d): cap=1 registry with A (+B when present). ─────────── - // With cap=1 + eager preload, the LAST registry entry survives the preload. - // Order [A, B] → B survives → A evicted. That proves leg (c) LRU eviction - // directly out of the eager preload. With only A, A is resident (leg a). + // With cap=1, eager preload warms only the FIRST registry entry (bounded to + // `min(cap, N)`). Order [A, B] → A resident, B lazy. Leg (c) then forces the + // LRU swap by explicitly loading B (see the defensive load below): the + // resident set flips to B, A evicted — proving the cap=1 LRU eviction. With + // only A, A is resident (leg a). let entries_1: Vec<(&str, &std::path::Path)> = match (&model_b, &id_b) { (Some(pb), Some(idb)) => vec![(id_a, model_a), (idb.as_str(), pb.as_path())], _ => vec![(id_a, model_a)], @@ -2158,7 +2161,8 @@ fn assert_model_lifecycle( // Two-model legs (a)/(c)/(d) when B is present; single-model leg (a) only // otherwise. if let Some(idb) = id_b.clone() { - // Eager preload of [A,B] at cap=1 → B resident, A evicted: leg (c). + // Eager preload of [A,B] at cap=1 warms only A (first entry); B stays + // lazy. The defensive load-B below forces the LRU swap: leg (c). let a_loaded = match model_loaded(port, id_a) { Ok(v) => v, Err(e) => return fail_lc(&cap1_guard, &lc_home, mk, format!("status A (cap1): {e}")), diff --git a/docs/CLI.md b/docs/CLI.md index 018a2af..b8fee33 100644 --- a/docs/CLI.md +++ b/docs/CLI.md @@ -58,7 +58,7 @@ mutually exclusive. | Flag | Type | Default | Description | |---|---|---|---| | `--model` | path | — | Path to a model snapshot directory. Mutually exclusive with `--registry`. | -| `--registry` | path | — | Path to a JSON registry file. Format: `{"models":[{"id":"name","path":"/abs/path"},…]}`. Mutually exclusive with `--model`. | +| `--registry` | path | — | Path to a JSON registry file. Format: `{"models":[{"id":"name","path":"/abs/path"},…]}`. Mutually exclusive with `--model`. At startup the server eagerly warms **at most `--max-loaded-models`** entries (the first `cap` in registry order); the rest stay lazy and load on first request (load-on-demand + idle-unload). A large registry therefore does not pull every model through GPU memory at boot. | | `--profile` | string | — | Named launch profile from `/profiles.toml`. CLI flags override profile values. See `rmlx profile list`. | | `--port` | u16 | 8080 | TCP port to listen on. | | `--host` | string | `127.0.0.1` | Host or IP to bind. | @@ -83,7 +83,7 @@ mutually exclusive. | `--turbo-flash-lock` | bool flag | off | Enable TurboFlash lock variant. Has no effect unless `--turbo-flash` or `RMLX_TURBO_FLASH=1` is also active. | | `--planar-flash-decode` | `on` \| `off` \| `auto` | `auto` | PlanarK single-pass flash-decode MSL kernel. `auto` (default): resolves OFF on every host — validation confirmed the kernel is bit-for-bit identical to the fused chain (`update_and_sdpa_planar_k_fused` dispatch_delta>0 with output matching OFF byte-for-byte on Bonsai) but did not deliver a measurable decode-TPS gain (-0.19% mean at 4k canary; well below the ≥10% Auto-flip gate). A pre-existing PlanarK-on-Bonsai long-prompt chunked-prefill bug (`docs/KV_QUANT.md` §"Correctness gap") also prevented the NIAH correctness anchor from passing on the only reachable arch. `on` forces `RMLX_PLANAR_FLASH_DECODE=1` (opt-in ablation). `off` **hard-overrides** — removes any pre-existing `RMLX_PLANAR_FLASH_DECODE` from the env so a stale `=1` cannot latch the OnceLock. | | `--require-smoke-probe` | bool flag | off | Run 8-token smoke probe on every model load; reject `BrokenPunctLoop` / `BrokenNan` results with HTTP 503. | -| `--max-loaded-models` | usize | 1 | Maximum models held resident in GPU memory. LRU eviction when exceeded. | +| `--max-loaded-models` | usize | 1 | Maximum models held resident in GPU memory. LRU eviction when exceeded. Also bounds registry eager-preload: only the first `min(cap, N)` registry entries are warmed at boot (anything beyond the cap would be evicted by the next load, so preloading it is pure waste). | | `--max-queue-depth` | usize | 64 | FIFO admission queue depth. Requests beyond this limit receive HTTP 429. `0` = unlimited. | | `--adaptive-admission` | bool flag | off | Enable the in-process adaptive admission controller. When set, the controller adjusts `max_queue_depth` dynamically based on SLA telemetry and rejects requests with HTTP 503 + `Retry-After: 5` when the end-to-end step estimate exceeds `2 × step-target-ms`. When absent, the static `--max-queue-depth` is used unchanged. | | `--step-target-ms` | u64 | 500 | End-to-end step SLA target in milliseconds for the adaptive controller. Anticipatory 503 fires when `est_step > 2 × this`. Requires `--adaptive-admission`. `--ttft-target-ms` is accepted as a hidden alias for backward compatibility. | From d35cc88d66357fd1015210dc9ef664298155d7f8 Mon Sep 17 00:00:00 2001 From: Pushkinist <4850452+Pushkinist@users.noreply.github.com> Date: Thu, 18 Jun 2026 13:01:09 +0700 Subject: [PATCH 2/2] docs(serve): fix preload-order description: alphabetical-by-id not JSON order MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up to the bounded eager-preload fix. Comments and docs said "the first cap in registry order" / "first cap registry entries", implying JSON array order. The registry iterates a BTreeMap sorted by id, so the actual selection is the alphabetically-first cap model ids — independent of JSON array order. Update serve.rs comment, docs/CLI.md (--registry and --max-loaded-models rows), and the e2e runner/manifest comments to say "alphabetically-first cap model ids". The runner.rs lifecycle test comments also drop the over-specific "Order [A,B] → A resident" claim (which of A/B sorts first depends on the real id basenames); the test's defensive load-B path is robust to either ordering regardless. --- crates/rmlx-cli/src/commands/serve.rs | 3 ++- crates/rmlx-cli/tests/e2e/manifest.toml | 4 ++-- crates/rmlx-cli/tests/e2e/runner.rs | 24 +++++++++++++----------- docs/CLI.md | 4 ++-- 4 files changed, 19 insertions(+), 16 deletions(-) diff --git a/crates/rmlx-cli/src/commands/serve.rs b/crates/rmlx-cli/src/commands/serve.rs index 3808438..0ede8bf 100644 --- a/crates/rmlx-cli/src/commands/serve.rs +++ b/crates/rmlx-cli/src/commands/serve.rs @@ -1134,7 +1134,8 @@ pub(crate) fn run_serve( // Eager model preload — warm the resident set before serving requests // so cold TTFT does not include model-load overhead. Bounded to AT MOST - // `max_loaded_models` entries (the first `cap` in registry order): + // `max_loaded_models` entries (the `cap` alphabetically-first ids, since + // `registry.list()` iterates a BTreeMap sorted by id — not JSON order): // anything beyond the resident cap would be evicted by the next // `ensure_loaded` (see `AppState::ensure_loaded` LRU swap), so preloading // it is pure load-cost + transient memory pressure with nothing kept. diff --git a/crates/rmlx-cli/tests/e2e/manifest.toml b/crates/rmlx-cli/tests/e2e/manifest.toml index a34a5b6..fb2a689 100644 --- a/crates/rmlx-cli/tests/e2e/manifest.toml +++ b/crates/rmlx-cli/tests/e2e/manifest.toml @@ -715,8 +715,8 @@ tags = ["phase2"] # Multi-model lifecycle: the runner owns a registry-mode serve (Bonsai + a 2nd # model resolved from GEMMA4_E2B) under single-MLX discipline. Proves: -# (a) load A → loaded; (c) cap=1 eager preload of [A,B] warms only A (bounded -# to cap), then explicit load B forces the LRU swap → B resident, A evicted +# (a) load A → loaded; (c) cap=1 warms the alphabetically-first id (one entry, +# bounded to cap); explicit load B forces the LRU swap → B resident, A evicted # (status flips); (d) explicit unload B → loaded:false, 2nd unload → # 404; (e) claim enforcement — a 2nd `rmlx serve` on the HELD port is rejected # (exit 11, no competing Metal context). When GEMMA4_E2B is absent the runner diff --git a/crates/rmlx-cli/tests/e2e/runner.rs b/crates/rmlx-cli/tests/e2e/runner.rs index cb59bdd..528ad2b 100644 --- a/crates/rmlx-cli/tests/e2e/runner.rs +++ b/crates/rmlx-cli/tests/e2e/runner.rs @@ -1957,10 +1957,11 @@ fn assert_cache_hit_equivalence( /// registry path (multiple model entries) instead of a single `--model`, and /// leaves `RUST_LOG=warn` (the lifecycle proof reads the HTTP API, not logs). /// -/// Registry mode eagerly pre-loads AT MOST `cap` entries at startup (the first -/// `cap` in registry order; see serve.rs "Eager model preload") — so on a green -/// `/health` the first `min(cap, N)` registry entries are already resident and -/// the rest stay lazy until their first request. +/// Registry mode eagerly pre-loads AT MOST `cap` entries at startup (the +/// alphabetically-first `min(cap, N)` model ids, since registry entries are +/// BTreeMap-sorted by id; see serve.rs "Eager model preload") — so on a green +/// `/health` those ids are already resident and the rest stay lazy until their +/// first request. fn spawn_serve_registry( registry_json: &std::path::Path, port: u16, @@ -2132,11 +2133,11 @@ fn assert_model_lifecycle( } // ── Legs (a)+(c)+(d): cap=1 registry with A (+B when present). ─────────── - // With cap=1, eager preload warms only the FIRST registry entry (bounded to - // `min(cap, N)`). Order [A, B] → A resident, B lazy. Leg (c) then forces the - // LRU swap by explicitly loading B (see the defensive load below): the - // resident set flips to B, A evicted — proving the cap=1 LRU eviction. With - // only A, A is resident (leg a). + // With cap=1, eager preload warms whichever id sorts alphabetically first + // (registry entries are BTreeMap-sorted by id, not by JSON order). The + // defensive load-B below forces the LRU swap regardless of which was + // preloaded: the resident set ends up B-resident, A-evicted — proving + // cap=1 LRU eviction (leg c). With only A, A is resident (leg a). let entries_1: Vec<(&str, &std::path::Path)> = match (&model_b, &id_b) { (Some(pb), Some(idb)) => vec![(id_a, model_a), (idb.as_str(), pb.as_path())], _ => vec![(id_a, model_a)], @@ -2161,8 +2162,9 @@ fn assert_model_lifecycle( // Two-model legs (a)/(c)/(d) when B is present; single-model leg (a) only // otherwise. if let Some(idb) = id_b.clone() { - // Eager preload of [A,B] at cap=1 warms only A (first entry); B stays - // lazy. The defensive load-B below forces the LRU swap: leg (c). + // Eager preload at cap=1 warms the alphabetically-first id (one entry); + // the other stays lazy. The defensive load-B below forces the LRU + // swap regardless of which was preloaded: leg (c). let a_loaded = match model_loaded(port, id_a) { Ok(v) => v, Err(e) => return fail_lc(&cap1_guard, &lc_home, mk, format!("status A (cap1): {e}")), diff --git a/docs/CLI.md b/docs/CLI.md index b8fee33..a2f5e39 100644 --- a/docs/CLI.md +++ b/docs/CLI.md @@ -58,7 +58,7 @@ mutually exclusive. | Flag | Type | Default | Description | |---|---|---|---| | `--model` | path | — | Path to a model snapshot directory. Mutually exclusive with `--registry`. | -| `--registry` | path | — | Path to a JSON registry file. Format: `{"models":[{"id":"name","path":"/abs/path"},…]}`. Mutually exclusive with `--model`. At startup the server eagerly warms **at most `--max-loaded-models`** entries (the first `cap` in registry order); the rest stay lazy and load on first request (load-on-demand + idle-unload). A large registry therefore does not pull every model through GPU memory at boot. | +| `--registry` | path | — | Path to a JSON registry file. Format: `{"models":[{"id":"name","path":"/abs/path"},…]}`. Mutually exclusive with `--model`. At startup the server eagerly warms **at most `--max-loaded-models`** entries (the alphabetically-first `cap` model ids, since the registry iterates entries sorted by id — not JSON array order); the rest stay lazy and load on first request (load-on-demand + idle-unload). A large registry therefore does not pull every model through GPU memory at boot. | | `--profile` | string | — | Named launch profile from `/profiles.toml`. CLI flags override profile values. See `rmlx profile list`. | | `--port` | u16 | 8080 | TCP port to listen on. | | `--host` | string | `127.0.0.1` | Host or IP to bind. | @@ -83,7 +83,7 @@ mutually exclusive. | `--turbo-flash-lock` | bool flag | off | Enable TurboFlash lock variant. Has no effect unless `--turbo-flash` or `RMLX_TURBO_FLASH=1` is also active. | | `--planar-flash-decode` | `on` \| `off` \| `auto` | `auto` | PlanarK single-pass flash-decode MSL kernel. `auto` (default): resolves OFF on every host — validation confirmed the kernel is bit-for-bit identical to the fused chain (`update_and_sdpa_planar_k_fused` dispatch_delta>0 with output matching OFF byte-for-byte on Bonsai) but did not deliver a measurable decode-TPS gain (-0.19% mean at 4k canary; well below the ≥10% Auto-flip gate). A pre-existing PlanarK-on-Bonsai long-prompt chunked-prefill bug (`docs/KV_QUANT.md` §"Correctness gap") also prevented the NIAH correctness anchor from passing on the only reachable arch. `on` forces `RMLX_PLANAR_FLASH_DECODE=1` (opt-in ablation). `off` **hard-overrides** — removes any pre-existing `RMLX_PLANAR_FLASH_DECODE` from the env so a stale `=1` cannot latch the OnceLock. | | `--require-smoke-probe` | bool flag | off | Run 8-token smoke probe on every model load; reject `BrokenPunctLoop` / `BrokenNan` results with HTTP 503. | -| `--max-loaded-models` | usize | 1 | Maximum models held resident in GPU memory. LRU eviction when exceeded. Also bounds registry eager-preload: only the first `min(cap, N)` registry entries are warmed at boot (anything beyond the cap would be evicted by the next load, so preloading it is pure waste). | +| `--max-loaded-models` | usize | 1 | Maximum models held resident in GPU memory. LRU eviction when exceeded. Also bounds registry eager-preload: only the alphabetically-first `min(cap, N)` model ids are warmed at boot (anything beyond the cap would be evicted by the next load, so preloading it is pure waste). | | `--max-queue-depth` | usize | 64 | FIFO admission queue depth. Requests beyond this limit receive HTTP 429. `0` = unlimited. | | `--adaptive-admission` | bool flag | off | Enable the in-process adaptive admission controller. When set, the controller adjusts `max_queue_depth` dynamically based on SLA telemetry and rejects requests with HTTP 503 + `Retry-After: 5` when the end-to-end step estimate exceeds `2 × step-target-ms`. When absent, the static `--max-queue-depth` is used unchanged. | | `--step-target-ms` | u64 | 500 | End-to-end step SLA target in milliseconds for the adaptive controller. Anticipatory 503 fires when `est_step > 2 × this`. Requires `--adaptive-admission`. `--ttft-target-ms` is accepted as a hidden alias for backward compatibility. |