Skip to content

fix(cache): boundary snapshots dropped during prefill because uid mapping not yet populated#1471

Closed
cfbraun wants to merge 30 commits into
jundot:mainfrom
cfbraun:pr/prefill-snapshot-uid-fix
Closed

fix(cache): boundary snapshots dropped during prefill because uid mapping not yet populated#1471
cfbraun wants to merge 30 commits into
jundot:mainfrom
cfbraun:pr/prefill-snapshot-uid-fix

Conversation

@cfbraun
Copy link
Copy Markdown
Contributor

@cfbraun cfbraun commented May 27, 2026

Summary

Prefill-side boundary snapshots were being silently dropped for ArraysCache / GDN / hybrid models, causing every non-last cached block to store a placeholder and every identical-prefix re-upload to re-prefill from scratch.

_emit_prefill_boundary_snapshot (scheduler.py:2549) routed through:

self._on_prefill_boundary_snapshot(
    self.request_id_to_uid.get(request.request_id, -1),  # → -1
    snapshot_cache,
    total_tokens,
)

But request_id_to_uid is only populated when the request is inserted into BatchGenerator via _insert_prefilled_request, which runs after prefill completes. During prefill the lookup defaults to -1. _on_prefill_boundary_snapshot then does:

request_id = self.uid_to_request_id.get(uid)  # uid=-1 → None
if request_id is None:
    return  # silent

→ every boundary-snapshot call during prefill was a no-op.

For sliceable cache types (KVCache, BatchKVCache, etc.) this is invisible — those layers don't need snapshots. For ArraysCache / GDN layers, every non-last block in prefix_cache.store_cache stores a (mx.zeros((1,)), mx.zeros((1,))) placeholder when no snapshot is available. The next identical-prefix request matches blocks by hash, finds a placeholder shape (1,) in the last matched block, and the rejection at prefix_cache.py:1986 correctly bails — but the resulting "Request will reprocess from scratch" branch re-prefills the entire prompt.

User-visible symptom

Hybrid-architecture models (e.g. Qwen3-Next / Qwen3.6-A3B) re-prefilling the full prompt on every chat continuation, even when only one or two new turns were appended. Log signature:

ArraysCache layer 0: partial prefix match detected (placeholder in last matched block).
Rejecting cache to prevent stale GDN state. Request will reprocess from scratch.

Repeating every ~few minutes under normal chat use.

The fix

Pass request_id through directly. _on_prefill_boundary_snapshot already keyed everything by request_id internally — the uid round-trip was pointless.

self._on_prefill_boundary_snapshot(
    request.request_id,
    snapshot_cache,
    total_tokens,
)

And the helper's signature changes from uid: int to request_id: str, dropping the now-unused uid_to_request_id lookup.

The other prefill-side caller, _maybe_capture_boundary_snapshot (called during generation from _step after the request has been inserted, where uid IS valid), writes snapshots inline without going through this helper, so it was unaffected. That asymmetry explains the bimodal pattern in production logs: short-prompt + long-generation requests captured 20+ intermediate snapshots cleanly (boundaries crossed during generation), while long-prompt + short-generation requests captured 0 (all boundaries in prefill, where the uid lookup failed).

Diagnosis evidence

End-to-end repro on a Qwen3.6-35B-A3B (GDN layer 0) instance, 90k-token prompt:

Before fix (every block boundary):

SNAP_EMIT request=d6655b20-... tokens=2048 non_none_layers=30/40   ← emit fires
SNAP_SKIP tokens=2048 uid=-1 reason=no_request_id_for_uid           ← dropped
SNAP_EMIT request=... tokens=4096 non_none_layers=30/40
SNAP_SKIP tokens=4096 uid=-1 reason=no_request_id_for_uid
...
Using boundary cache snapshot for ...: storing 88064/90919 tokens (skipping trailing partial block, 0 intermediate snapshots)

After fix:

SNAP_EMIT request=1a58e4f5-... tokens=2048 non_none_layers=30/40
SNAP_OK request=1a58e4f5-... tokens=2048 saved_to_ssd=True
...
Using boundary cache snapshot for 1a58e4f5-...: storing 90112/90919 tokens (skipping trailing partial block, 43 intermediate snapshots)

Subsequent identical-prefix request started prefill at token 51200 (previous cached boundary) instead of 0 — no rejection line.

Test plan

  • Updated test_prefill_boundary_snapshot_records_rotating_cache to NOT pre-populate request_id_to_uid / uid_to_request_id — matches actual prefill state.
  • Updated test_prefill_boundary_snapshot_ignores_non_boundary_token_count similarly.
  • Added test_emit_prefill_boundary_snapshot_persists_before_uid_assignment — pins the regression: calling _emit_prefill_boundary_snapshot while the request has not been inserted into BatchGenerator must still record the snapshot.
  • pytest tests/test_scheduler.py -k "prefill_boundary or emit_prefill_boundary" — 3 passed.
  • Confirmed end-to-end via instrumentation on a real 90k-token chat continuation against Qwen3.6-35B-A3B-UD-MLX-4bit.

cfbraun added 30 commits May 26, 2026 09:40
When the SSD cache hits its size limit under sustained save load, every
save_block triggers eviction. The previous design routed each evicted
block's unlink as ("unlink", path) onto the same _write_queue that
carries pending writes — so eviction could never free queue capacity, it
could only add more work to the queue. Combined with the pre-eviction
_write_queue.full() short-circuit, this kept the cache permanently full
once the queue saturated.

Inline the unlinks on the eviction-calling thread instead. Eviction
typically removes a single block per save (loop in evict_until_size
stops as soon as size drops below target), so this is one syscall per
save, not N. The deferred-unlink justification ("avoid blocking the
inference thread with N file delete syscalls") doesn't actually
materialise in steady state — and inlining removes the bounded-queue
contention entirely.

With unlinks off _write_queue the pre-eviction full() short-circuit is
no longer needed (eviction never made the queue worse). The eviction
call at the top of save_block — which ran before tensors were built and
so could only pass a 1 MiB placeholder size — is also dropped; the
single eviction call at save_block:1339 (post-tensor-build) with the
real estimated_size is enough.

Bounded inline-eviction burst. The ENOSPC recovery path invalidates
the disk-usage snapshot, which shrinks _get_effective_max_size sharply
on the next save — evict_until_size can then return hundreds of
entries at once and the inline-unlink loop would stall the inference
thread on a syscall storm. Cap the burst at
_MAX_INLINE_UNLINKS_PER_SAVE = 32 and reinsert the deferred metadata
into the index so subsequent saves drain the remainder; bounds
per-call latency at the cost of taking multiple saves to fully
reconverge.

Genuine writer saturation (the writer can't keep up with sustained
inserts) remains possible. save_block and _enqueue_ssd_write now call
_write_queue.put(item, timeout=0.25) instead of put_nowait so a
transient burst rides over short writer-backlog windows; only sustained
saturation drops the block. The 250 ms budget is well below human-
perceptible latency and typically covers one or two writer iterations
on a healthy SSD. The blocking shutdown-flush path keeps the 500 ms
timeout.

Estimate computes the actual safetensors metadata-JSON length plus a
256 B JSON-envelope margin and 128 B/tensor for the per-tensor header,
replacing the previous fixed 1 KiB metadata constant. Deep-layer models
with large layer_meta_states JSON were already capable of exceeding the
1 KiB budget; the new estimate tracks the real value so the index can't
briefly overshoot _effective_max after a write.

ENOSPC failures invalidate the 30 s disk_usage snapshot under
self._lock so a writer-thread invalidation cannot interleave with an
inference-thread read-check-write and produce a fresh-value-stale-
timestamp pair. The invalidation only protects the NEXT round of
save_block calls — in-flight saves that already passed the quota gate
are still queued and may re-ENOSPC. The error message reflects that.
Logged at ERROR because save_block has already returned True /
incremented _stats["saves"] by the time the writer fails, so the slot
is silently lost; surfacing this at ERROR makes the drift visible to
operators.

Enqueue-vs-persist counters split. _stats["saves"] continues to count
blocks that passed the quota gate (incremented in save_block on the
inference thread), and a new _stats["saves_persisted"] counter is
incremented after the writer's atomic rename succeeds — surfaced as
PagedSSDCacheStats.saves_persisted. saves - saves_persisted is the
steady-state in-flight depth; dashboards that need true durability
should switch to the new field. save_rate's existing formula
(saves / (saves + errors)) keeps backwards compatibility.

Tests (tests/test_paged_ssd_cache.py::TestEvictionAndQueueSaturation):
- test_eviction_does_not_enqueue_unlink_tasks: forces eviction and
  asserts no ("unlink", ...) entries leak into _write_queue.
- test_save_uses_timeout_not_put_nowait: verifies save_block calls
  _write_queue.put with a positive timeout (regression for the prior
  put_nowait that dropped on first burst).
- test_enospc_invalidates_disk_usage_snapshot: an ENOSPC writer
  failure clears _disk_usage_cache so the next effective-max call
  recomputes against true free-space.
- test_saves_persisted_increments_only_after_rename: pins the
  documented enqueue-vs-persist semantic of the two counters.
- test_inline_eviction_burst_is_capped: a forced large-burst eviction
  is bounded at _MAX_INLINE_UNLINKS_PER_SAVE and reconverges across
  subsequent calls.

82/82 paged_ssd_cache tests pass.
Ships xgrammar==0.2.0 + apache-tvm-ffi==0.1.11 in the DMG (~+37 MB)
instead of skipping structured outputs. xgrammar's torch>=1.10.0
runtime dep is load-bearing only at import time; omlx/_torch_stub.py
satisfies those references (dtypes, Tensor aliases, cuda/version/nn
submodules) without the ~500 MB torch wheel. oMLX's hot path is
unchanged — bitmasks are numpy int32, applied via the existing MLX
kernel.

Wiring:
- omlx/_torch_stub.py: install() is a no-op when real torch is found.
  The real-torch path now also leaves TVM_FFI_DISABLE_TORCH_C_DLPACK
  untouched — that env var is only set when the stub itself is being
  installed, so a user with real torch can still use tvm-ffi's
  torch-C-DLPack fast path. The version-drift check is hoisted out
  of the install lock so a slow cold-disk xgrammar import doesn't
  block concurrent callers behind the lock.
- omlx/{api/grammar,server,admin/routes}.py: install before xgrammar
- tests/conftest.py: install at collection so @patch decorators resolve
- packaging/build.py: idempotent _install_xgrammar() with a versioned
  sentinel file. The sentinel write is now atomic (tmp + os.replace)
  so a SIGKILL/ENOSPC mid-Path.write_text can't leave a zero-length
  sentinel that the next run accepts. Before writing, the package
  roots (xgrammar/__init__.py, tvm_ffi/__init__.py) are checked for
  presence and non-zero size — a truncated zipfile.extractall pass
  would otherwise also be accepted by the bare-existence check.
  Hoisted into main() so --skip-venv runs it too.
- omlx/_torch_stub.py: the top-level torch.__getattr__ now logs the
  missing attribute name at WARNING (rate-limited to once per name
  per process) before raising AttributeError, so a future xgrammar
  release reaching for an unstubbed torch surface is diagnosable
  from logs even when the AttributeError surfaces in a request
  handler that doesn't separately log it.
- server.py: DMG error path no longer claims unavailability

Tests (tests/test_torch_stub.py, 16 cases — 4 new on top of the
12 pre-existing): install / idempotency / real-torch-present
detection / dtype dict-keying / isinstance(Tensor) / unsupported
helpers raising / env-var conditional on no-real-torch path /
top-level __getattr__ AttributeError + WARNING log contract /
__spec__ is a real ModuleSpec for every stubbed submodule /
torch.utils.dlpack.to_dlpack raises. Guards against a future xgrammar
or tvm-ffi version reaching for a new torch attribute at import time.

Verified on bundled app Python: import xgrammar succeeds without
torch on disk; create_grammar_compiler + compile_json_schema +
GrammarConstraintProcessor mask logits correctly with a real HF
tokenizer; tests/test_grammar.py = 57 passed / 3 pre-existing skips.
…th the writer thread

test_cleanup_all_drains_queue failed ~20% of the time, in isolation
and in suites. The race:

  T0 save() puts item in queue
  T1 writer queue.get() pulls item (already off the queue)
  T2 cleanup_all() drains queue — finds it empty
  T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir)
  T4 writer mkdir(req_dir) + temp write + rename(final)

Result: ``req-X/N.safetensors`` survives the supposed cleanup, the
caller (scheduler / shutdown) believes the snapshot dir is empty, and
the leftover file is the disk shadow of a request that should have
been forgotten.

Add ``_writer_busy`` (threading.Lock). The writer holds it for the
entire duration of each ``_process_write_item`` call. ``cleanup_all()``
drains the queue first (no new items can start) and then acquires
``_writer_busy``, so any item the writer had already pulled finishes
before rmtree runs.

Same class of race exists for ``cleanup_request()`` — writer pulls a
request's item before cleanup_request sets the cancelled-counter, then
mkdir's the req_dir back into existence after the rmtree, and the
late-rename rescue at the tail of ``_process_write_item`` catches it
in most timings but a small window remains. Apply the same
``_writer_busy`` barrier (bounded, see below) so the two cleanup paths
are symmetric.

Per-path timeouts so cleanup_request can yield faster than
cleanup_all:

- ``_CLEANUP_ALL_TIMEOUT_S = 5.0`` — cleanup_all runs at reset /
  startup where blocking longer is tolerable in exchange for a
  stronger orphan-avoidance guarantee.
- ``_CLEANUP_REQUEST_TIMEOUT_S = 2.0`` — cleanup_request is called
  from the scheduler's abort hot path (~3 sites) where bounding
  latency matters more than chasing one in-flight write.

The bounded acquires have a logged-warning fallback. Without that
bound a slow disk inside ``_write_safetensors_no_mx`` could deadlock
the scheduler's hot path — cleanup_all() is called from request abort
/ reset (scheduler.py:5400 / 5596 / 5737), not just shutdown. The
worst-case fallback behaviour is an orphaned file under the recreated
snapshot dir until next startup's constructor cleanup, which is the
exact same state the pre-fix code produced on every cleanup_all.

Counter pop on cleanup_request now only runs when the
``_writer_busy`` acquire succeeded. On the timeout fallback the
writer is still mid-item and may not yet have consulted
``_is_cancelled``; popping the counter there would defeat the late-
rename rescue that the docstring advertises as the timeout-
fallback safety net. The previous code popped unconditionally.

Counter bump itself is now additive (``get + count``) and skipped
entirely when ``count == 0``. Two distinct bugs that closes:

- Skip-on-zero: cleanup_request("X") for an rid with NO pending
  items previously wrote ``cancelled[X] = 0`` then popped on the
  acquired path. On the timeout fallback the pop never ran and the
  ``X: 0`` entry lingered for the process lifetime — every later
  ``save()`` under that rid (or any reuse of the string) was
  silently discarded by the writer's ``_is_cancelled`` gates,
  which check key membership not value > 0. Counter must only
  exist when there is at least one in-flight item to drain it.
  Regression: test_cleanup_request_no_pending_does_not_pin_counter
  _on_timeout (new).
- Additive: a re-entrant cleanup_request("X") for an rid that
  already has an in-flight cancellation must NOT overwrite the
  previous count. The writer's ``cleared_by_cleanup`` branch +
  ``_writer_busy`` lock together close the file-write race today
  (so no orphan file slips out), but the per-item dec_cancelled
  bookkeeping has to balance — overwriting drops remaining decs on
  the floor and on the next ``save()`` under the same rid the
  writer would see a non-zero counter from the earlier batch and
  silently discard the new item.

shutdown() now accepts ``cleanup=True`` so callers that want both
operations can express the cleanup-before-shutdown ordering in one
call instead of sequencing them manually. The cleanup_all() warning
at the top of the function catches misordered callers that still
call cleanup_all() AFTER shutdown — the writer no longer reacquires
``_writer_busy`` past the sentinel, so a post-shutdown cleanup
degrades to an in-memory-only clear.

Tests:
- ``test_cleanup_all_drains_queue`` — no longer sleeps; relies on the
  lock to guarantee ordering, runs deterministically.
- ``test_cleanup_all_blocks_until_writer_finishes_pinned_item`` —
  monkey-patches ``_write_safetensors_no_mx`` to block on an Event,
  pins the writer mid-item, asserts cleanup_all does NOT return until
  the writer releases. Without the lock this test fails
  deterministically rather than the original ~20% flake rate.
- ``test_cleanup_request_blocks_until_writer_finishes_pinned_item`` —
  symmetric for cleanup_request.
- ``test_cleanup_request_keeps_counter_on_timeout`` — regression for
  the bug where the timeout-fallback pop dropped the late-rename
  rescue's safety net.
- ``test_cleanup_request_timeout_drains_counter_on_writer_early_return``
  — the writer's cleared_by_cleanup early-return must still
  dec_cancelled or the counter pins the rid forever.
- ``test_cleanup_request_no_pending_does_not_pin_counter_on_timeout``
  (new) — regression for the skip-on-zero bug above. Pins the writer
  with an unrelated save, calls cleanup_request("never-saved-rid")
  past the 0.1s timeout, asserts the rid does NOT appear in
  ``_cancelled_requests`` AND that a subsequent save under the same
  rid is not silently discarded.
- ``test_shutdown_cleanup_true_runs_cleanup_before_setting_flag`` —
  pins the cleanup-before-shutdown ordering of the new
  ``shutdown(cleanup=True)`` path.

24/24 boundary-store tests pass.
Scheduler._preflight_memory_check, MemoryMonitor.estimate_prefill_peak_bytes,
and ProcessMemoryEnforcer._propagate_memory_limit existed for a while but
never actually fired in practice — five distinct gaps along the chain
short-circuited the guard on every request:

1. Scheduler.__init__ left self.memory_monitor=None. The monitor is built
   by cache/factory.create_full_cache_stack, but tiered_manager passes
   memory_monitor=None for paged-SSD-only mode (the only mode oMLX uses),
   so the scheduler's reference stayed None and _preflight_memory_check
   returned at the "if self.memory_monitor is None" gate.

2. _set_model_info_for_monitor expected num_hidden_layers /
   num_attention_heads / head_dim on the top-level model.config. VLM
   configs (e.g. Qwen3.6-VL, Gemma-4 family) nest those under
   config.text_config and put architectures / vision_config at the top
   level. The "if num_layers and num_kv_heads and head_dim" gate failed
   for every VLM and set_model_info() was never called.

3. ProcessMemoryEnforcer._get_hard_limit_bytes returned
   max(system_ram - 4 GiB, max_bytes). On a 48 GiB host with
   max_process_memory=40 GiB that's a 44 GiB threshold — large enough
   that a 110k-token Qwen3.6-VL prefill (28 GiB baseline + ~14 GiB
   SDPA-fallback peak) slipped past the rejection check. Metal then
   aborted at kIOGPUCommandBufferCallbackErrorOutOfMemory or, worse,
   the underlying Apple IOGPUFamily kernel bug (FB22091885,
   ml-explore/mlx#3186) fired and the host panicked.

4. _propagate_memory_limit was only called on memory-pressure-level
   transitions (ok→soft, soft→hard, etc.). EnginePool lazy-loads engines
   on first request, well after enforcer.start() has propagated to its
   then-empty entry map. The freshly-loaded scheduler kept
   _prefill_memory_guard at its False default until pressure first
   crossed soft — by which point the heavy request was already in MLX.

5. _propagate_memory_limit reached for getattr(entry.engine, "scheduler",
   None). Neither BatchedEngine nor VLMBatchedEngine exposes .scheduler
   at the wrapper level — both keep it at self._engine.engine.scheduler.
   getattr returned None for every real engine; the loop body did
   nothing. Existing tests masked this because they assigned .scheduler
   directly onto a MagicMock — a contract the real engines never satisfy.

Fix:

- Scheduler.__init__: build a real MemoryMonitor unconditionally and
  immediately call _set_model_info_for_monitor() so the SDPA-peak math
  is available from the first request. The 4 GiB max_kv_cache_memory
  ctor arg is required positive but unused on the prefill-peak path
  (which reads only head_dim / num_attention_heads /
  num_kv_cache_layers); _check_memory_pressure is a hardcoded ``pass``
  in paged-SSD-only mode and the eviction helpers
  (_evict_blocks_permanently / _evict_blocks_to_cold) now short-
  circuit on ``memory_monitor.eviction_enabled`` before they would
  call into the disabled estimate_blocks_to_free path. A future
  re-wiring for paged-SSD eviction would need to thread the real KV
  budget through.

- MemoryMonitor.eviction_enabled is a new public predicate so callers
  can branch on the eviction wiring without reaching into a private
  attribute and without provoking the RuntimeError that
  estimate_blocks_to_free emits when eviction is disabled.

- _set_model_info_for_monitor: ALWAYS prefer the LM sub-config when one
  of text_config / language_config / llm_config exposes
  num_hidden_layers OR the legacy n_layer, even if the top-level config
  also has one. Some VLM packs (older Gemma-3, Llava variants, HF auto-
  wrappers) surface num_hidden_layers at the top level referring to the
  *vision encoder* not the LM; accepting that would silently miscalibrate
  the SDPA-peak estimate by a constant factor. Probing the legacy
  ``n_layer`` alias keeps GPT-style nested configs from being shadowed
  by the same vision-encoder leakage. Falls back to top-level only
  when no sub-config has either field.

- _get_hard_limit_bytes: return self._max_bytes directly so the guard's
  threshold tracks the user-configured ceiling. Two downstream sites
  consume this value — _preflight_memory_check (admission rejection)
  AND the mid-prefill runtime abort in _do_external_prefill /
  _step_prefill_chunk. Both intentionally tighten in lockstep: a
  request whose estimated peak slips past preflight (e.g. cache state
  changes between scheduling and execution) is also aborted in-flight
  rather than being allowed to consume up to the system ceiling and
  potentially trigger the Apple kernel bug. The enforcer's separate
  soft/hard watermarks (_soft_bytes / _hard_bytes) still drive LRU
  eviction independently.

- _check_and_enforce: call _propagate_memory_limit() every poll
  iteration, not only on level changes. Newly-loaded engines now pick
  up the limits + guard flag within one poll interval of load. Iteration
  uses ``list(self._engine_pool._entries.values())`` to snapshot the
  view so a future EnginePool mutator moved to a worker thread cannot
  trigger ``RuntimeError: dictionary changed size during iteration``
  or — worse — silently miss an engine and leave it without the
  propagated guard, regressing the dead-guard bug this commit fixes.
  The cost is one cheap copy of value references.

- _propagate_memory_limit + new _resolve_scheduler(engine) helper:
  walk engine._engine.engine.scheduler when the direct attribute is
  absent so real BatchedEngine / VLMBatchedEngine wrappers receive
  the propagation. Test mocks that assign .scheduler directly still
  work. When the wrapper-chain traversal returns None, the
  propagation logs a WARNING naming the engine type (rate-limited
  to once per type per process) so the silent no-op failure mode
  that originally hid the dead guard is loud in CI / oncall.

- _MemoryLimitState frozen dataclass + Scheduler._memory_state: the
  four scheduler fields the enforcer publishes
  (memory_limit_bytes, memory_hard_limit_bytes,
  prefill_memory_guard, admission_paused) form a logical bundle —
  guard True implies hard_limit > 0, soft matches the configured
  ceiling, admission_paused tracks the enforcer's pressure level.
  The API hot-path reader (_preflight_memory_check) must see all
  four consistently; observing a partially-published combination
  (e.g. guard=True, hard_limit=0 mid-write) would let a too-large
  request slip past rejection and back the kernel-panic bug this
  commit closes.

  Bundling them into an immutable dataclass and publishing a single
  reference store makes the publication atomic regardless of Python
  memory model — including PEP 703 free-threading where per-attribute
  STORE_ATTRs are no longer GIL-serialized into a coherent order
  from another thread's perspective. The reader does ``state =
  scheduler._memory_state`` once and reads fields off the local
  snapshot.

  The four legacy attribute names (_memory_limit_bytes,
  _memory_hard_limit_bytes, _prefill_memory_guard,
  _admission_paused) are kept as @Property accessors backed by
  ``_memory_state``, so the dozens of test setattrs and the
  secondary readers in _do_external_prefill / _step_prefill_chunk /
  _schedule_waiting work unchanged. Each setter rebuilds the bundle
  via ``replace`` (frozen dataclass) so an individual assignment is
  still a single atomic ref store of the new bundle. The setters
  fall back to a fresh _MemoryLimitState if invoked before
  __init__ finishes — covers tests that build Schedulers via
  __new__ and set attrs one at a time.

Tests (108 pass across the touched suites):

- tests/test_scheduler_prefill_memory_guard.py — covers plain LLM
  config, VLM nested config, top-level-num_hidden_layers referring
  to vision encoder, language_config / llm_config aliases, legacy
  n_layer / n_head / n_embd fallback, VLM nested-via-legacy-n_layer
  (new regression for the widened sub-config probe), broken-config
  exception swallow, guard on/off, cached-vs-uncached.
- tests/test_process_memory_enforcer.py — includes regressions for
  the load-after-start race (test_check_and_enforce_propagates_every_poll),
  the wrapper-traversal path (test_propagates_through_batched_engine_wrapper),
  and the rate-limited unreachable-scheduler warning
  (test_unresolvable_scheduler_logs_warning_once). Two new tests pin
  the bundle contract:
    test_propagate_publishes_atomic_memory_state_bundle — asserts the
    writer publishes a frozen _MemoryLimitState with all four fields
    coherent, so the reader's single-snapshot read can never observe
    a mixed (guard=True, hard_limit=0) combination.
    test_memory_state_bundle_matches_guard_off_path — symmetric for
    the OFF path.
- tests/test_memory_monitor.py — adds two cases pinning the public
  ``eviction_enabled`` predicate so Scheduler's dormant-eviction
  short-circuits don't silently flip behavior on a future refactor.

Verified end-to-end against Qwen3.6-35B-A3B-oQ6 on a 48 GiB Mac16,8
with max_process_memory=40 GiB: a 110k-token prefill that previously
either OOM-aborted the Python process or triggered the Apple
IOGPUFamily kernel panic is now rejected by _preflight_memory_check
before reaching MLX. Broader concurrent / cache-replay torture cycled
memory pressure soft↔ok throughout but never crossed hard.
When the prefill memory guard rejects a request, surface it as a proper
HTTP 413 with the diagnostic message in the body, instead of an
unhandled 500 + truncated chunked read.

Two pieces are needed because the chat/completions path wraps the
response body in a StreamingResponse to send SSE keepalive bytes during
long prefill, and starlette emits http.response.start (status 200)
before iterating the body generator. A typed exception thrown from
within that body lands as "Caught handled exception, but response
already started" — the registered 413 handler is correctly invoked but
can no longer change the status code, so the client sees an incomplete
chunked read.

1. Typed-exception path.

   Scheduler.add_request now runs _preflight_memory_check synchronously
   and raises PrefillMemoryExceededError if rejection is warranted —
   before the request is appended to self.waiting. Register an
   @app.exception_handler(PrefillMemoryExceededError) that returns
   HTTP 413 with the OpenAI-style {"error": {"message": ...}} body on
   /v1/* routes and a plain {"detail": ...} elsewhere. The async
   re-check in _schedule_waiting stays as a race-safety net. The
   docstring on add_request now spells out that admission preprocessing
   (tokenisation, prefix-cache lookup, block-table acquisition,
   SpecPrefill scoring) HAS run by the time the typed exception fires
   — preprocessing state is rolled back on the raise path so the
   rejection doesn't leak resources.

2. Server-side preflight before StreamingResponse.

   Add Scheduler.preflight_or_raise(num_prompt_tokens, cached_tokens=0)
   that runs the token-count form of _preflight_memory_check and
   raises PrefillMemoryExceededError on rejection (with a
   logger.warning so observability matches the async path). Split out
   _preflight_memory_check_tokens(num_prompt_tokens, cached_tokens) so
   the existing Request-shaped check still works.

   Add engine.preflight_chat(messages, tools=None, **kwargs) and
   engine.preflight_completion(prompt, **kwargs) on BatchedEngine and
   VLMBatchedEngine. Both apply the chat template, tokenize, traverse
   the wrapper chain to reach the scheduler, and call
   preflight_or_raise. If the wrapper chain resolves to None (partial
   init failure) the preflight no-ops with a logged warning rather
   than silently skipping the safety check. The warning is routed
   through ``_warn_scheduler_unreachable_once`` in engine.base —
   rate-limited per (engine_class, method) pair per process so a
   misconfigured engine doesn't spam logs at request rate, only once
   at first occurrence (enough to alert CI / oncall).

   The VLM ``preflight_chat`` strips image content-parts via
   ``extract_images_from_messages`` BEFORE applying the chat template,
   mirroring the real ``_process_chat_messages`` flow. Without this,
   modern HF chat templates (Qwen2.5-VL, Gemma-Vision, Llama-3.2-Vision)
   render image content-parts as literal placeholder strings inline
   with the text — the tokenized prompt then already contains image-
   placeholder tokens AND the per-image ``_IMAGE_TOKEN_UPPER_BOUND``
   budget is added on top, double-counting and producing spurious
   413s on borderline image-bearing requests that the real chat path
   would have admitted. After the strip, the per-image budget is the
   only source of image-token accounting and the upper bound is sharp.

   The VLM preflight_chat keeps the 1280-tokens-per-image budget
   (the high end of what Qwen-VL / Gemma-Vision packs expand each
   image to). Over-estimates somewhat for small images (false-positive
   413s for borderline-and-image cases) but never under-counts, which
   is the property the guard needs to stay safe against the Apple
   IOGPUFamily panic path. VLM ``preflight_chat`` also routes tools
   through ``convert_tools_for_template`` to match the real chat path
   — without conversion, Pydantic ``ToolDefinition`` callers would
   hit the template's TypeError-retry fallback and silently drop
   tools, miscalibrating the token count.

   Wire the preflights into every prompt-bearing route handler BEFORE
   the StreamingResponse return:

     POST /v1/chat/completions  (OpenAI standard)
     POST /v1/completions       (OpenAI legacy)
     POST /v1/messages          (Anthropic)
     POST /v1/responses         (OpenAI Responses)

   ``_is_api_route`` keeps its path-prefix-only implementation. The
   server is currently mounted at root (no sub-app prefix); a comment
   on the helper records the assumption so a future ``app.mount(
   "/api", ...)`` deployment can move it to scope-based matching.

Conservative-by-design choices documented for follow-ups:

- The server preflight passes cached_tokens=0 (does not thread the
  prefix-cache lookup forward). Under a hot prefix cache an admission
  that would have fit may be rejected with 413; the synchronous
  in-add_request recheck inside the scheduler runs after the prefix-
  cache resolution and can still accept it, but the early rejection
  costs the caller a retry. Tracked for a future PR that pre-resolves
  the cache key.
- The estimator is prompt-only; decode growth from ``max_tokens`` is
  not modelled. The tightened in-flight abort threshold (now equal to
  max_bytes — see the memory-guard commit) catches the runtime
  overflow if it happens, but the request will see a stream-shape
  error rather than upfront 413. Acceptable given decode growth on
  borderline prompts is unusual.

Tests:

- tests/test_server_prefill_memory_handler.py — 4 cases:
  - handler shape on /v1/* (OpenAI ``{"error": ...}`` wrapper) and
    non-/v1/ (plain ``{"detail": ...}``);
  - end-to-end ``/v1/responses`` regression hitting the real route
    with a force-rejecting preflight, asserting the production
    handler returns 413 with the OpenAI body shape (catches any
    future route that constructs a StreamingResponse without
    awaiting preflight first).
- tests/test_engine_preflight.py — 20 cases: scheduler raise path
  (within budget / over budget / guard disabled / fully cached),
  BatchedEngine + VLMBatchedEngine preflight_chat with a stubbed
  wrapper chain, preflight_completion, the VLM image-token budget,
  the new VLM strip-images-before-template behaviour (regression for
  the double-count bug), Pydantic-tools conversion through
  convert_tools_for_template, and the wrapper-chain-unreachable
  warning log.

Tokenizer hardening: preflight_chat and preflight_completion
on both engines now catch exceptions from tokenizer.encode (BPE
malformed input, HF Rust "Already borrowed", UnicodeDecodeError on
exotic content) and log + skip the memory check rather than letting
them surface as a new 500. The real chat / completion path will hit
the same error and route it through the existing handler chain, so
the user-visible response shape stays consistent with the
pre-preflight behaviour for these cases. A future PR can map raw
tokenizer errors to 400 for both paths.

Verified end-to-end against Qwen3.6-35B-A3B-oQ6 on a 48 GiB host: a
110k-token POST /v1/chat/completions now returns HTTP 413 in 0.13 s
with the full diagnostic body ("Prefill would require ~43.47 GB peak
(current 27.91 GB + KV+SDPA 15.56 GB) but limit is 40.00 GB."), and
the request never enters MLX prefill so the underlying Apple
IOGPUFamily kernel bug stays cold.
Closes a long-standing gap noticed during the upstream merge: the class
ships full to_dict/from_dict but had no tests, so field drift would have
gone unnoticed (cf. upstream jundot#1286 which had to retrofit the same
coverage for SchedulerSettings/MemorySettings).
…stream

Adopts the keyword-only `cached_tokens` parameter that upstream PR
jundot#1326 settled on (commit efcc858). Same behavior — same correctness
fix on the SDPA span — just a different API shape:

  before: (new_tokens, cached_tokens, chunk_size)         # required pos
  after:  (new_tokens, chunk_size, *, cached_tokens=0)    # kw-only

Done now so the next upstream merge auto-merges the function
definition line instead of conflicting on it. The body still keeps
the local correctness extras (early return on new_tokens<=0, negative
cached_tokens clamp, eff_chunk floor) that upstream lacks.
Closes the four ModelSettings fields that had no references anywhere
in tests/: turboquant_kv_bits, turboquant_skip_last, vlm_mtp_draft_model,
vlm_mtp_draft_block_size. All four are wired into admin/routes.py and
classified in model_profiles.py, so silent drift between defaults and
the routes layer would have been invisible.

Pattern matches the existing per-field convention in this file:
defaults, to_dict/from_dict roundtrip, and (for Optional fields)
excluded-when-none.
Verified unused: zero instantiations across the entire git history
(`git log -S 'TieredCacheManager('` returns only the initial-commit
declaration), zero imports outside the module's own __init__ re-export,
zero test references. The class was a planned coordinator between
PagedCacheManager / BlockAwarePrefixCache / PagedSSDCacheManager /
MemoryMonitor — Scheduler.__init__ does that wiring directly, making
the abstraction redundant.

Removes 353 lines plus the re-export from omlx.cache.__all__.
omlx/api/mcp_routes.py was wired into server.py at module init but had
zero direct test coverage — test_mcp_*.py files exercised the manager
and executor layers but not the FastAPI handlers, leaving the request
parsing, no-manager fallbacks, and the tool/tool_name alias
(upstream PR jundot#1285) unguarded.

15 tests via TestClient covering all four entry points:
- set_mcp_manager_getter wiring + None propagation
- GET /v1/mcp/tools: empty-list short-circuit, namespaced full_name
  serialization
- GET /v1/mcp/servers: enum-to-string flattening, error-field passthrough
- POST /v1/mcp/execute: 503 when unconfigured, success / handled-error
  paths, tool-vs-tool_name alias regression guard, default-arguments
  fallback, 422 on missing tool name

The autouse fixture resets the module-global `_get_mcp_manager` callback
between tests so state can't leak across cases.
omlx/api/rerank_models.py defines the request/response contract for
the /v1/rerank endpoint (Cohere/Jina-compatible) but had no schema
tests — only the engine layer was exercised by test_reranker_*.py.
Field renames or default changes would have silently broken clients.

24 tests covering all four classes:
- RerankRequest: text + multimodal (dict) query/documents shapes,
  defaults (return_documents=True, top_n=None), validation rejection
  on missing required fields, JSON round-trip with return_documents=False
- RerankResult: document=None path, multimodal document passthrough
  (preserves image data-URIs), required field enforcement
- RerankUsage: required total_tokens
- RerankResponse: auto-id 'rerank-' prefix + 8 hex chars (Cohere
  clients filter on this), per-instance uniqueness, explicit-id
  override, optional usage, JSON round-trip with multiple results
omlx/admin/oq_manager.py is instantiated into server_state at startup
and drives all admin-panel quantization workflows, but had no direct
tests — test_oq.py covers the omlx/oq.py library underneath, not the
async task manager on top.

51 tests across the synchronous and validation surface:
- QuantStatus enum + str-inheritance
- QuantTask.to_dict shape, progress rounding, status enum.value
- _dir_size and _format_size helpers (parameterized)
- _phase_label including the quantizing_eta|c|t|eta parser edge cases
  (zero total, non-numeric parts)
- OQManager.__init__ defaults and the first-dir-wins output_dir rule
- update_model_dirs refresh + empty-update no-op invariant
- get_tasks / is_quantizing across active vs terminal statuses
- remove_task: unknown / refuses-active / clears _cancelled set
- cancel_quantization rejection paths (unknown, non-active)
- start_quantization validation: invalid oq_level, invalid dtype,
  missing model dir, missing config.json, output collision, duplicate
  active task, and that completed tasks don't block resubmits
- list_quantizable_models: empty / skip-malformed / MTP detection /
  is_quantized flag / dedup across dirs
- shutdown: no-op when idle, cancels every active task

The full _run_quantization / _estimate_progress lifecycle and the
cooperative-cancellation flow are left for a future pass — they need
end-to-end mocking of quantize_oq_streaming + Metal cleanup, which is
brittle and not the highest-value gap right now.
Extends test_oq_manager.py with the lifecycle paths that needed
quantize_oq_streaming mocking — left as a follow-up in the previous
commit because they require threading-coordinated stubs.

13 new tests across four classes:
- TestRunQuantizationHappyPath: completion fields (status, progress,
  output_size, started_at/completed_at ordering), sync and async
  on_complete callbacks, on_complete exception isolation (a buggy
  callback must not flip status to FAILED), progress-callback wiring
- TestRunQuantizationFailure: exception → FAILED with error captured,
  partial output cleaned from disk
- TestRunQuantizationPreCancel: _cancelled set before semaphore entry
  → quantize_oq_streaming never invoked (shutdown-race guard)
- TestCancelCooperativeExit: full cancel via the progress-callback
  path (the design upstream chose over hard-cancelling the asyncio
  wrapper — see the comment block in cancel_quantization), plus the
  cancel-during-LOADING edge case where no progress_cb has fired yet
- TestEstimateProgress: unknown-task no-op, parent task's finally
  clause cancels the progress estimator on both success and failure

Uses threading.Event coordination in fake_quantize so the cooperative
cancel path exits as soon as the flag is set, not after fixed delays —
the full test class runs in <0.3s.
omlx/mcp/config.py is consulted on every server startup but had no
direct tests — test_mcp_types.py covered the MCPConfig / MCPServerConfig
dataclasses underneath, not the file discovery, JSON/YAML loading, or
the validate_config schema checks layered on top.

38 tests across eight classes:
- validate_config input: non-dict / empty / servers-must-be-dict /
  the falsy-non-dict-servers quirk (`[]` silently falls through to
  mcpServers) / entry-must-be-dict with server name in the message
- validate_config server loading: stdio happy path, key-as-name auto
  injection, key-overrides-explicit-name, unknown-kwarg surfaces a
  ValueError with the offending server name, the Claude Desktop
  `mcpServers` alias for drop-in compatibility, `servers` precedence
  over `mcpServers` (or-chain semantics), empty mcpServers fall-through
- validate_config globals: max_tool_calls valid/zero/negative/non-int
  plus the bool-is-int quirk (documented), default_timeout int/float/
  zero/negative/string rejection
- Path discovery: explicit existing path, missing explicit path →
  FileNotFoundError, tilde expansion, OMLX_MCP_CONFIG honored,
  env-var-missing-file falls through with a warning rather than
  crashing the server startup, search-path order, all-paths-missing
  returns an empty MCPConfig (not None, not error)
- File formats: JSON load + malformed-JSON propagation, YAML load
  via .yaml and .yml (gated on PyYAML availability)
- create_example_config: valid JSON, round-trips through validate,
  showcases multiple transports

All discovery tests use an `isolated_env` fixture that monkeypatches
CONFIG_SEARCH_PATHS to [], clears the env var, and chdirs to tmp_path
so real ~/.config/omlx/mcp.json or stray ./mcp.json files can't leak
into the test.
omlx/optimizations.py is a thin hardware/MLX status helper consumed by
the admin dashboard. The re-exported HardwareInfo / detect_hardware /
get_total_memory_gb were already covered by test_utils_hardware.py;
this commit pins the dict shape produced by get_optimization_status
and the flash-attention probe.

10 tests:
- Re-exports are identity-equal to the canonical hardware symbols
  (otherwise external scripts importing from omlx.optimizations would
  silently get stale copies); __all__ matches the documented surface
- get_optimization_status returns three top-level keys (hardware,
  mlx_memory, mlx_lm_features) with the expected sub-key sets
- mlx_memory counters are non-negative ints sourced from mx.get_*_memory
- mlx_lm_features static strings pinned verbatim — the dashboard shows
  these verbatim, so a typo here would surface in production
- flash_attention reports 'built-in' on real MLX, 'not available' when
  mx.fast lacks scaled_dot_product_attention (simulated via
  patch.object(mx, 'fast', MagicMock(spec=[])))
- active_bytes actually reflects MLX state — allocating a 4 MiB array
  must register in the delta, proving the value isn't hardcoded
Closes two remaining gaps with real production impact:

omlx/models/base_model.py — mean_pooling and normalize_embeddings are
the pooling/L2-norm primitives used by omlx/models/xlm_roberta.py (the
reranker model). A silent change to either would corrupt embedding
output. 12 tests:
- BaseModelArgs/BaseModelOutput dataclass shape (defaults + full fields)
- mean_pooling: uniform mask = simple mean, partial mask excludes
  padded positions (the load-bearing invariant — pre-mask sums would
  let padding tokens taint embeddings), all-zero-mask guard prevents
  NaN/Inf (the clip(..., a_min=1e-9) defense), batch-dim passthrough,
  fp16 dtype propagation via the mask.astype(hidden_states.dtype) cast
- normalize_embeddings: unit-norm output, axis=-1 only (catches
  wrong-axis bugs that would silently destroy similarity comparisons),
  higher-rank shape preservation, idempotence

omlx/eval/mbpp.py — _extract_code is the regex/heuristic parser that
pulls Python source out of MBPP-task LLM responses before the
subprocess executor runs them. The branch coverage matters: 50% of
runs use unspecified-language fences. 15 tests:
- Fenced blocks: ```python wins over ```, multiline, whitespace strip
- Heuristic line scan: def/class/import/from/# all trigger code-region
  start; everything after is included (no end detection by design)
- Fallback: no markers → whole response stripped, empty/whitespace
  inputs, lone print() statement documents the known limitation
…_block

Upstream 1b666af added the ssd_write_drops counter at three drop sites,
including a pre-tensor-extraction guard in save_block that returns False
when the write queue is already full. The earlier inline-LRU-unlinks
change (b9d7161) argued the short-circuit was redundant because
eviction no longer contended for the queue, and the rebase resolution
dropped it. But eviction was never the only path to saturation: when
the writer itself can't keep up with sustained saves, the short-circuit
avoids running _enforce_size_limit_for_new_block and the GPU tensor
extraction only to drop the block at the put step a few hundred lines
later.

Gated on `not self._hot_cache_enabled`. In hot-cache mode writes are
deferred to LRU spill via _enqueue_ssd_write, which has its own
timeout-put + drop accounting and shouldn't double-count here.

Tests:
- restore test_ssd_write_drops_increments_on_cold_store_preflight which
  patches _write_queue.full() to True
- reword test_ssd_write_drops_increments_on_cold_store_late_exception
  to clarify it now covers the put-after-preflight race
- test_ssd_write_drops_increments_on_hot_eviction_queue_full now
  patches put (the actual call in _enqueue_ssd_write) instead of
  put_nowait, matching the local timeout-based put semantics
The 0e6cbc5 commit (now part of the rebased local history) tightened
_get_hard_limit_bytes to always return max_bytes, arguing that auto
mode's system_ram - 4GB headroom let Qwen3.6-VL head_dim=256 prefills
burst past the cap before Metal aborted. Upstream acd0533 has since
landed an adaptive prefill throttle (_prefill_safe_zone_ratio /
_prefill_min_chunk_tokens) that shrinks the per-chunk transient as
usage approaches the cap — the original failure mode is now bounded
by the throttle before reaching Metal.

Restore upstream's two-branch behavior:
- user_explicit_max=True -> max_bytes (user value IS the ceiling)
- auto                   -> max(system_ram - 4GB, max_bytes)

Tests:
- remove the now-obsolete test_hard_limit_equals_max_bytes added by
  0e6cbc5 (coverage is preserved by upstream's
  test_hard_limit_honors_user_explicit_max and
  test_hard_limit_auto_mode_still_uses_system_minus_4gb)
- enforcer fixture now sets user_explicit_max=True so propagation
  tests can assert the exact hard_limit value without entanglement
  with the host's system memory
upstream acd0533 added max_process_memory_is_explicit,
prefill_safe_zone_ratio, and prefill_min_chunk_tokens to
MemorySettings.to_dict() but the to_dict test was not updated (last
touched in jundot#1286 on 2026-05-19, predating acd0533 by a week). The
test was already failing on upstream/main — this just closes the gap.
…butes

The bundle (introduced in b6a69c4) wrapped the four memory-guard fields
(_memory_limit_bytes, _memory_hard_limit_bytes, _prefill_memory_guard,
_admission_paused) in a frozen dataclass so ProcessMemoryEnforcer could
publish them via a single reference store. The motivation was PEP 703
free-threading, where per-attribute STORE_ATTRs are no longer
GIL-serialized into a consistent order from another thread's view.

Under current CPython the GIL already serializes the per-attribute
writes within a single enforcer poll, so the hot-path reader
(_preflight_memory_check_tokens) sees a coherent (guard, hard_limit)
pair without the bundle. Keeping the bundle traded non-trivial
complexity (frozen dataclass + four property/setter pairs +
``dataclasses.replace`` rebuilds on every test setattr) for PEP 703
prep we don't need today. Drop it; revisit if/when we ship nogil.

- omlx/scheduler.py: remove _MemoryLimitState dataclass, remove the
  four properties and _current_memory_state helper, init the four
  fields as plain attributes (matches upstream).
- omlx/process_memory_enforcer.py: _propagate_memory_limit writes the
  four fields directly instead of building a new bundle.
- tests/test_process_memory_enforcer.py: drop _MemoryLimitState
  imports and bundle setup; collapse the two redundant
  bundle-atomicity tests into one guard-off test.

Net: -250 / +51 lines.
The post-rebase add/add merge with upstream 1010fd3 left two classes
covering update_model_dirs:

- TestUpdateModelDirs (local) — bare tmp_path setup; covered the
  reorder-changes-head case and the empty-list defensive case.
- TestOQManagerUpdateModelDirs (upstream) — fp_model_dir fixture
  setup; covered async list_quantizable_models() integration and the
  head-tracking case.

They overlapped on the head-tracking assertion. Fold them into a
single TestUpdateModelDirs with three tests:

- test_picks_up_added_dir (async integration, from upstream)
- test_output_dir_tracks_primary_dir (extended with the local
  test's _model_dirs == [head, ...] order assertion)
- test_update_to_empty_leaves_old_output_dir (from local)

Net: -7 tests removed, -2 new in the consolidated class, +1 stronger
assertion on the head-tracking test.
…y_limit

The docstring's "four guard fields are written as direct attributes /
on current CPython the GIL serializes per-attribute STORE_ATTRs"
paragraph was contextual to the now-removed _MemoryLimitState bundle
(see 675cae1) — when the bundle existed, the paragraph explained why
the bundled single-store was equivalent under today's GIL. With the
bundle gone the paragraph is just a CPython explainer that adds noise
without informing the reader.
…ream

Upstream ``tests/test_integrations.py::TestIntegrationSettings`` covers
defaults, basic to_dict, full-dict and empty-dict from_dict. The local
class had near-exact duplicates of three of those (``test_defaults``,
``test_from_dict_full``, ``test_from_dict_empty``) — drop them.

Keep the genuinely additive coverage:
- ``test_to_dict_defaults`` — pins the exact to_dict shape, catching
  the kind of field-added-but-test-not-updated regression that bit
  MemorySettings.to_dict (see 81dc2d5).
- ``test_to_dict_custom`` — stricter than upstream's (sets all seven
  fields, asserts the full dict).
- ``test_from_dict_partial`` / ``test_from_dict_explicit_null_overrides_default``
  / ``test_round_trip`` — partial-dict fallback, null-override
  semantics, identity round-trip; not covered upstream.

Net: -28 lines, three fewer merge-conflict surfaces with upstream's
copy of the same tests.
# Conflicts:
#	omlx/engine/base.py
#	omlx/process_memory_enforcer.py
#	omlx/scheduler.py
#	tests/test_process_memory_enforcer.py
#	tests/test_settings.py
# Conflicts:
#	packaging/build.py
#	pyproject.toml
…ping not yet populated

_emit_prefill_boundary_snapshot routed through a uid that doesn't
exist yet. At prefill time the request has not been inserted into
BatchGenerator (that happens in _insert_prefilled_request *after*
prefill completes), so request_id_to_uid has no entry for it and the
lookup defaults to -1. _on_prefill_boundary_snapshot then does
uid_to_request_id.get(-1) -> None -> silent return, dropping every
boundary snapshot.

For ArraysCache / GDN / hybrid models that meant every non-last
cached block stored a placeholder (mx.zeros((1,)), mx.zeros((1,)))
instead of the real recurrent state. The next identical-prefix
request matched the cache by hash, found a placeholder in the last
matched block, and the lookup-side guard at prefix_cache.py:1986
correctly rejected the cache as unusable — but the resulting
"Request will reprocess from scratch" branch then re-prefilled the
entire prompt. Visible symptom: re-uploading the same large prompt
(e.g. a chat continuation) re-prefilled all 90k tokens instead of
the few new ones.

Pass request_id through directly. _on_prefill_boundary_snapshot was
already keyed by request_id internally — the uid round-trip was
pointless. The other prefill-side caller, _maybe_capture_boundary_
snapshot (called during generation from a path where uid IS valid),
writes snapshots inline without going through this helper, so it
was unaffected. That asymmetry explains the bimodal pattern in
production logs: short-prompt + long-generation requests captured
20+ intermediate snapshots cleanly, while long-prompt + short-
generation requests captured 0 (since all block boundaries were in
prefill, where the uid lookup failed).

Verified end-to-end via observability instrumentation on a 90k-token
prompt: every SNAP_EMIT was followed by uid=-1 SNAP_SKIP before the
fix; every SNAP_EMIT is followed by SNAP_OK after. The decisive
proof was a subsequent identical-prefix request that started prefill
at the cached boundary (token 51200) instead of token 0.

Updated existing tests to reflect that no uid mapping is required.
Added a regression test that mirrors production state at prefill
time (request_id_to_uid deliberately unset).
@jundot
Copy link
Copy Markdown
Owner

jundot commented May 28, 2026

Thanks for the diagnosis. The request_id_to_uid timing explanation is exactly right, and the short-prompt+long-gen vs long-prompt+short-gen bimodal pattern matches what I've been seeing on Qwen3.6-A3B sessions. Cherry-picked the boundary-snapshot fix onto main as 9407468 with your authorship preserved.

I left the other 23 commits on this branch alone. They overlap with your narrower PRs (#1447, #1450, #1451, #1452, #1453, etc.), so I'd rather review and merge those independently to keep the main history readable. Closing this PR; the narrow ones stay open. Apologies if that's not what you intended. Happy to revisit if you'd prefer the bundle path.

@jundot jundot closed this May 28, 2026
@cfbraun
Copy link
Copy Markdown
Contributor Author

cfbraun commented May 28, 2026

Thanks for the merge of 9407468 — and for the careful read on the bimodal pattern.

You're right about the fat-PR shape — that's my mistake. I branched pr/prefill-snapshot-uid-fix off my local main, which is ~24 commits ahead of upstream because it carries the other in-flight work. GitHub then computed the cross-fork PR diff as all 24 commits, not just my one new commit. From your side it correctly looked like a bundle. I'll branch future single-fix PRs off upstream/main directly to keep the diff narrow.

Happy to leave the narrower PRs (#1446#1453, #1458) on their independent tracks per your suggestion. No need to revisit. Thanks again.

@cfbraun cfbraun deleted the pr/prefill-snapshot-uid-fix branch May 28, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants