fix(cache): boundary snapshots dropped during prefill because uid mapping not yet populated by cfbraun · Pull Request #1471 · jundot/omlx

cfbraun · 2026-05-27T16:22:11Z

Summary

Prefill-side boundary snapshots were being silently dropped for ArraysCache / GDN / hybrid models, causing every non-last cached block to store a placeholder and every identical-prefix re-upload to re-prefill from scratch.

_emit_prefill_boundary_snapshot (scheduler.py:2549) routed through:

self._on_prefill_boundary_snapshot(
    self.request_id_to_uid.get(request.request_id, -1),  # → -1
    snapshot_cache,
    total_tokens,
)

But request_id_to_uid is only populated when the request is inserted into BatchGenerator via _insert_prefilled_request, which runs after prefill completes. During prefill the lookup defaults to -1. _on_prefill_boundary_snapshot then does:

request_id = self.uid_to_request_id.get(uid)  # uid=-1 → None
if request_id is None:
    return  # silent

→ every boundary-snapshot call during prefill was a no-op.

For sliceable cache types (KVCache, BatchKVCache, etc.) this is invisible — those layers don't need snapshots. For ArraysCache / GDN layers, every non-last block in prefix_cache.store_cache stores a (mx.zeros((1,)), mx.zeros((1,))) placeholder when no snapshot is available. The next identical-prefix request matches blocks by hash, finds a placeholder shape (1,) in the last matched block, and the rejection at prefix_cache.py:1986 correctly bails — but the resulting "Request will reprocess from scratch" branch re-prefills the entire prompt.

User-visible symptom

Hybrid-architecture models (e.g. Qwen3-Next / Qwen3.6-A3B) re-prefilling the full prompt on every chat continuation, even when only one or two new turns were appended. Log signature:

ArraysCache layer 0: partial prefix match detected (placeholder in last matched block).
Rejecting cache to prevent stale GDN state. Request will reprocess from scratch.

Repeating every ~few minutes under normal chat use.

The fix

Pass request_id through directly. _on_prefill_boundary_snapshot already keyed everything by request_id internally — the uid round-trip was pointless.

self._on_prefill_boundary_snapshot(
    request.request_id,
    snapshot_cache,
    total_tokens,
)

And the helper's signature changes from uid: int to request_id: str, dropping the now-unused uid_to_request_id lookup.

The other prefill-side caller, _maybe_capture_boundary_snapshot (called during generation from _step after the request has been inserted, where uid IS valid), writes snapshots inline without going through this helper, so it was unaffected. That asymmetry explains the bimodal pattern in production logs: short-prompt + long-generation requests captured 20+ intermediate snapshots cleanly (boundaries crossed during generation), while long-prompt + short-generation requests captured 0 (all boundaries in prefill, where the uid lookup failed).

Diagnosis evidence

End-to-end repro on a Qwen3.6-35B-A3B (GDN layer 0) instance, 90k-token prompt:

Before fix (every block boundary):

SNAP_EMIT request=d6655b20-... tokens=2048 non_none_layers=30/40   ← emit fires
SNAP_SKIP tokens=2048 uid=-1 reason=no_request_id_for_uid           ← dropped
SNAP_EMIT request=... tokens=4096 non_none_layers=30/40
SNAP_SKIP tokens=4096 uid=-1 reason=no_request_id_for_uid
...
Using boundary cache snapshot for ...: storing 88064/90919 tokens (skipping trailing partial block, 0 intermediate snapshots)

After fix:

SNAP_EMIT request=1a58e4f5-... tokens=2048 non_none_layers=30/40
SNAP_OK request=1a58e4f5-... tokens=2048 saved_to_ssd=True
...
Using boundary cache snapshot for 1a58e4f5-...: storing 90112/90919 tokens (skipping trailing partial block, 43 intermediate snapshots)

Subsequent identical-prefix request started prefill at token 51200 (previous cached boundary) instead of 0 — no rejection line.

Test plan

Updated test_prefill_boundary_snapshot_records_rotating_cache to NOT pre-populate request_id_to_uid / uid_to_request_id — matches actual prefill state.
Updated test_prefill_boundary_snapshot_ignores_non_boundary_token_count similarly.
Added test_emit_prefill_boundary_snapshot_persists_before_uid_assignment — pins the regression: calling _emit_prefill_boundary_snapshot while the request has not been inserted into BatchGenerator must still record the snapshot.
pytest tests/test_scheduler.py -k "prefill_boundary or emit_prefill_boundary" — 3 passed.
Confirmed end-to-end via instrumentation on a real 90k-token chat continuation against Qwen3.6-35B-A3B-UD-MLX-4bit.

When the SSD cache hits its size limit under sustained save load, every save_block triggers eviction. The previous design routed each evicted block's unlink as ("unlink", path) onto the same _write_queue that carries pending writes — so eviction could never free queue capacity, it could only add more work to the queue. Combined with the pre-eviction _write_queue.full() short-circuit, this kept the cache permanently full once the queue saturated. Inline the unlinks on the eviction-calling thread instead. Eviction typically removes a single block per save (loop in evict_until_size stops as soon as size drops below target), so this is one syscall per save, not N. The deferred-unlink justification ("avoid blocking the inference thread with N file delete syscalls") doesn't actually materialise in steady state — and inlining removes the bounded-queue contention entirely. With unlinks off _write_queue the pre-eviction full() short-circuit is no longer needed (eviction never made the queue worse). The eviction call at the top of save_block — which ran before tensors were built and so could only pass a 1 MiB placeholder size — is also dropped; the single eviction call at save_block:1339 (post-tensor-build) with the real estimated_size is enough. Bounded inline-eviction burst. The ENOSPC recovery path invalidates the disk-usage snapshot, which shrinks _get_effective_max_size sharply on the next save — evict_until_size can then return hundreds of entries at once and the inline-unlink loop would stall the inference thread on a syscall storm. Cap the burst at _MAX_INLINE_UNLINKS_PER_SAVE = 32 and reinsert the deferred metadata into the index so subsequent saves drain the remainder; bounds per-call latency at the cost of taking multiple saves to fully reconverge. Genuine writer saturation (the writer can't keep up with sustained inserts) remains possible. save_block and _enqueue_ssd_write now call _write_queue.put(item, timeout=0.25) instead of put_nowait so a transient burst rides over short writer-backlog windows; only sustained saturation drops the block. The 250 ms budget is well below human- perceptible latency and typically covers one or two writer iterations on a healthy SSD. The blocking shutdown-flush path keeps the 500 ms timeout. Estimate computes the actual safetensors metadata-JSON length plus a 256 B JSON-envelope margin and 128 B/tensor for the per-tensor header, replacing the previous fixed 1 KiB metadata constant. Deep-layer models with large layer_meta_states JSON were already capable of exceeding the 1 KiB budget; the new estimate tracks the real value so the index can't briefly overshoot _effective_max after a write. ENOSPC failures invalidate the 30 s disk_usage snapshot under self._lock so a writer-thread invalidation cannot interleave with an inference-thread read-check-write and produce a fresh-value-stale- timestamp pair. The invalidation only protects the NEXT round of save_block calls — in-flight saves that already passed the quota gate are still queued and may re-ENOSPC. The error message reflects that. Logged at ERROR because save_block has already returned True / incremented _stats["saves"] by the time the writer fails, so the slot is silently lost; surfacing this at ERROR makes the drift visible to operators. Enqueue-vs-persist counters split. _stats["saves"] continues to count blocks that passed the quota gate (incremented in save_block on the inference thread), and a new _stats["saves_persisted"] counter is incremented after the writer's atomic rename succeeds — surfaced as PagedSSDCacheStats.saves_persisted. saves - saves_persisted is the steady-state in-flight depth; dashboards that need true durability should switch to the new field. save_rate's existing formula (saves / (saves + errors)) keeps backwards compatibility. Tests (tests/test_paged_ssd_cache.py::TestEvictionAndQueueSaturation): - test_eviction_does_not_enqueue_unlink_tasks: forces eviction and asserts no ("unlink", ...) entries leak into _write_queue. - test_save_uses_timeout_not_put_nowait: verifies save_block calls _write_queue.put with a positive timeout (regression for the prior put_nowait that dropped on first burst). - test_enospc_invalidates_disk_usage_snapshot: an ENOSPC writer failure clears _disk_usage_cache so the next effective-max call recomputes against true free-space. - test_saves_persisted_increments_only_after_rename: pins the documented enqueue-vs-persist semantic of the two counters. - test_inline_eviction_burst_is_capped: a forced large-burst eviction is bounded at _MAX_INLINE_UNLINKS_PER_SAVE and reconverges across subsequent calls. 82/82 paged_ssd_cache tests pass.

@patch

Ships xgrammar==0.2.0 + apache-tvm-ffi==0.1.11 in the DMG (~+37 MB) instead of skipping structured outputs. xgrammar's torch>=1.10.0 runtime dep is load-bearing only at import time; omlx/_torch_stub.py satisfies those references (dtypes, Tensor aliases, cuda/version/nn submodules) without the ~500 MB torch wheel. oMLX's hot path is unchanged — bitmasks are numpy int32, applied via the existing MLX kernel. Wiring: - omlx/_torch_stub.py: install() is a no-op when real torch is found. The real-torch path now also leaves TVM_FFI_DISABLE_TORCH_C_DLPACK untouched — that env var is only set when the stub itself is being installed, so a user with real torch can still use tvm-ffi's torch-C-DLPack fast path. The version-drift check is hoisted out of the install lock so a slow cold-disk xgrammar import doesn't block concurrent callers behind the lock. - omlx/{api/grammar,server,admin/routes}.py: install before xgrammar - tests/conftest.py: install at collection so @patch decorators resolve - packaging/build.py: idempotent _install_xgrammar() with a versioned sentinel file. The sentinel write is now atomic (tmp + os.replace) so a SIGKILL/ENOSPC mid-Path.write_text can't leave a zero-length sentinel that the next run accepts. Before writing, the package roots (xgrammar/__init__.py, tvm_ffi/__init__.py) are checked for presence and non-zero size — a truncated zipfile.extractall pass would otherwise also be accepted by the bare-existence check. Hoisted into main() so --skip-venv runs it too. - omlx/_torch_stub.py: the top-level torch.__getattr__ now logs the missing attribute name at WARNING (rate-limited to once per name per process) before raising AttributeError, so a future xgrammar release reaching for an unstubbed torch surface is diagnosable from logs even when the AttributeError surfaces in a request handler that doesn't separately log it. - server.py: DMG error path no longer claims unavailability Tests (tests/test_torch_stub.py, 16 cases — 4 new on top of the 12 pre-existing): install / idempotency / real-torch-present detection / dtype dict-keying / isinstance(Tensor) / unsupported helpers raising / env-var conditional on no-real-torch path / top-level __getattr__ AttributeError + WARNING log contract / __spec__ is a real ModuleSpec for every stubbed submodule / torch.utils.dlpack.to_dlpack raises. Guards against a future xgrammar or tvm-ffi version reaching for a new torch attribute at import time. Verified on bundled app Python: import xgrammar succeeds without torch on disk; create_grammar_compiler + compile_json_schema + GrammarConstraintProcessor mask logits correctly with a real HF tokenizer; tests/test_grammar.py = 57 passed / 3 pre-existing skips.

…th the writer thread test_cleanup_all_drains_queue failed ~20% of the time, in isolation and in suites. The race: T0 save() puts item in queue T1 writer queue.get() pulls item (already off the queue) T2 cleanup_all() drains queue — finds it empty T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir) T4 writer mkdir(req_dir) + temp write + rename(final) Result: ``req-X/N.safetensors`` survives the supposed cleanup, the caller (scheduler / shutdown) believes the snapshot dir is empty, and the leftover file is the disk shadow of a request that should have been forgotten. Add ``_writer_busy`` (threading.Lock). The writer holds it for the entire duration of each ``_process_write_item`` call. ``cleanup_all()`` drains the queue first (no new items can start) and then acquires ``_writer_busy``, so any item the writer had already pulled finishes before rmtree runs. Same class of race exists for ``cleanup_request()`` — writer pulls a request's item before cleanup_request sets the cancelled-counter, then mkdir's the req_dir back into existence after the rmtree, and the late-rename rescue at the tail of ``_process_write_item`` catches it in most timings but a small window remains. Apply the same ``_writer_busy`` barrier (bounded, see below) so the two cleanup paths are symmetric. Per-path timeouts so cleanup_request can yield faster than cleanup_all: - ``_CLEANUP_ALL_TIMEOUT_S = 5.0`` — cleanup_all runs at reset / startup where blocking longer is tolerable in exchange for a stronger orphan-avoidance guarantee. - ``_CLEANUP_REQUEST_TIMEOUT_S = 2.0`` — cleanup_request is called from the scheduler's abort hot path (~3 sites) where bounding latency matters more than chasing one in-flight write. The bounded acquires have a logged-warning fallback. Without that bound a slow disk inside ``_write_safetensors_no_mx`` could deadlock the scheduler's hot path — cleanup_all() is called from request abort / reset (scheduler.py:5400 / 5596 / 5737), not just shutdown. The worst-case fallback behaviour is an orphaned file under the recreated snapshot dir until next startup's constructor cleanup, which is the exact same state the pre-fix code produced on every cleanup_all. Counter pop on cleanup_request now only runs when the ``_writer_busy`` acquire succeeded. On the timeout fallback the writer is still mid-item and may not yet have consulted ``_is_cancelled``; popping the counter there would defeat the late- rename rescue that the docstring advertises as the timeout- fallback safety net. The previous code popped unconditionally. Counter bump itself is now additive (``get + count``) and skipped entirely when ``count == 0``. Two distinct bugs that closes: - Skip-on-zero: cleanup_request("X") for an rid with NO pending items previously wrote ``cancelled[X] = 0`` then popped on the acquired path. On the timeout fallback the pop never ran and the ``X: 0`` entry lingered for the process lifetime — every later ``save()`` under that rid (or any reuse of the string) was silently discarded by the writer's ``_is_cancelled`` gates, which check key membership not value > 0. Counter must only exist when there is at least one in-flight item to drain it. Regression: test_cleanup_request_no_pending_does_not_pin_counter _on_timeout (new). - Additive: a re-entrant cleanup_request("X") for an rid that already has an in-flight cancellation must NOT overwrite the previous count. The writer's ``cleared_by_cleanup`` branch + ``_writer_busy`` lock together close the file-write race today (so no orphan file slips out), but the per-item dec_cancelled bookkeeping has to balance — overwriting drops remaining decs on the floor and on the next ``save()`` under the same rid the writer would see a non-zero counter from the earlier batch and silently discard the new item. shutdown() now accepts ``cleanup=True`` so callers that want both operations can express the cleanup-before-shutdown ordering in one call instead of sequencing them manually. The cleanup_all() warning at the top of the function catches misordered callers that still call cleanup_all() AFTER shutdown — the writer no longer reacquires ``_writer_busy`` past the sentinel, so a post-shutdown cleanup degrades to an in-memory-only clear. Tests: - ``test_cleanup_all_drains_queue`` — no longer sleeps; relies on the lock to guarantee ordering, runs deterministically. - ``test_cleanup_all_blocks_until_writer_finishes_pinned_item`` — monkey-patches ``_write_safetensors_no_mx`` to block on an Event, pins the writer mid-item, asserts cleanup_all does NOT return until the writer releases. Without the lock this test fails deterministically rather than the original ~20% flake rate. - ``test_cleanup_request_blocks_until_writer_finishes_pinned_item`` — symmetric for cleanup_request. - ``test_cleanup_request_keeps_counter_on_timeout`` — regression for the bug where the timeout-fallback pop dropped the late-rename rescue's safety net. - ``test_cleanup_request_timeout_drains_counter_on_writer_early_return`` — the writer's cleared_by_cleanup early-return must still dec_cancelled or the counter pins the rid forever. - ``test_cleanup_request_no_pending_does_not_pin_counter_on_timeout`` (new) — regression for the skip-on-zero bug above. Pins the writer with an unrelated save, calls cleanup_request("never-saved-rid") past the 0.1s timeout, asserts the rid does NOT appear in ``_cancelled_requests`` AND that a subsequent save under the same rid is not silently discarded. - ``test_shutdown_cleanup_true_runs_cleanup_before_setting_flag`` — pins the cleanup-before-shutdown ordering of the new ``shutdown(cleanup=True)`` path. 24/24 boundary-store tests pass.

Scheduler._preflight_memory_check, MemoryMonitor.estimate_prefill_peak_bytes, and ProcessMemoryEnforcer._propagate_memory_limit existed for a while but never actually fired in practice — five distinct gaps along the chain short-circuited the guard on every request: 1. Scheduler.__init__ left self.memory_monitor=None. The monitor is built by cache/factory.create_full_cache_stack, but tiered_manager passes memory_monitor=None for paged-SSD-only mode (the only mode oMLX uses), so the scheduler's reference stayed None and _preflight_memory_check returned at the "if self.memory_monitor is None" gate. 2. _set_model_info_for_monitor expected num_hidden_layers / num_attention_heads / head_dim on the top-level model.config. VLM configs (e.g. Qwen3.6-VL, Gemma-4 family) nest those under config.text_config and put architectures / vision_config at the top level. The "if num_layers and num_kv_heads and head_dim" gate failed for every VLM and set_model_info() was never called. 3. ProcessMemoryEnforcer._get_hard_limit_bytes returned max(system_ram - 4 GiB, max_bytes). On a 48 GiB host with max_process_memory=40 GiB that's a 44 GiB threshold — large enough that a 110k-token Qwen3.6-VL prefill (28 GiB baseline + ~14 GiB SDPA-fallback peak) slipped past the rejection check. Metal then aborted at kIOGPUCommandBufferCallbackErrorOutOfMemory or, worse, the underlying Apple IOGPUFamily kernel bug (FB22091885, ml-explore/mlx#3186) fired and the host panicked. 4. _propagate_memory_limit was only called on memory-pressure-level transitions (ok→soft, soft→hard, etc.). EnginePool lazy-loads engines on first request, well after enforcer.start() has propagated to its then-empty entry map. The freshly-loaded scheduler kept _prefill_memory_guard at its False default until pressure first crossed soft — by which point the heavy request was already in MLX. 5. _propagate_memory_limit reached for getattr(entry.engine, "scheduler", None). Neither BatchedEngine nor VLMBatchedEngine exposes .scheduler at the wrapper level — both keep it at self._engine.engine.scheduler. getattr returned None for every real engine; the loop body did nothing. Existing tests masked this because they assigned .scheduler directly onto a MagicMock — a contract the real engines never satisfy. Fix: - Scheduler.__init__: build a real MemoryMonitor unconditionally and immediately call _set_model_info_for_monitor() so the SDPA-peak math is available from the first request. The 4 GiB max_kv_cache_memory ctor arg is required positive but unused on the prefill-peak path (which reads only head_dim / num_attention_heads / num_kv_cache_layers); _check_memory_pressure is a hardcoded ``pass`` in paged-SSD-only mode and the eviction helpers (_evict_blocks_permanently / _evict_blocks_to_cold) now short- circuit on ``memory_monitor.eviction_enabled`` before they would call into the disabled estimate_blocks_to_free path. A future re-wiring for paged-SSD eviction would need to thread the real KV budget through. - MemoryMonitor.eviction_enabled is a new public predicate so callers can branch on the eviction wiring without reaching into a private attribute and without provoking the RuntimeError that estimate_blocks_to_free emits when eviction is disabled. - _set_model_info_for_monitor: ALWAYS prefer the LM sub-config when one of text_config / language_config / llm_config exposes num_hidden_layers OR the legacy n_layer, even if the top-level config also has one. Some VLM packs (older Gemma-3, Llava variants, HF auto- wrappers) surface num_hidden_layers at the top level referring to the *vision encoder* not the LM; accepting that would silently miscalibrate the SDPA-peak estimate by a constant factor. Probing the legacy ``n_layer`` alias keeps GPT-style nested configs from being shadowed by the same vision-encoder leakage. Falls back to top-level only when no sub-config has either field. - _get_hard_limit_bytes: return self._max_bytes directly so the guard's threshold tracks the user-configured ceiling. Two downstream sites consume this value — _preflight_memory_check (admission rejection) AND the mid-prefill runtime abort in _do_external_prefill / _step_prefill_chunk. Both intentionally tighten in lockstep: a request whose estimated peak slips past preflight (e.g. cache state changes between scheduling and execution) is also aborted in-flight rather than being allowed to consume up to the system ceiling and potentially trigger the Apple kernel bug. The enforcer's separate soft/hard watermarks (_soft_bytes / _hard_bytes) still drive LRU eviction independently. - _check_and_enforce: call _propagate_memory_limit() every poll iteration, not only on level changes. Newly-loaded engines now pick up the limits + guard flag within one poll interval of load. Iteration uses ``list(self._engine_pool._entries.values())`` to snapshot the view so a future EnginePool mutator moved to a worker thread cannot trigger ``RuntimeError: dictionary changed size during iteration`` or — worse — silently miss an engine and leave it without the propagated guard, regressing the dead-guard bug this commit fixes. The cost is one cheap copy of value references. - _propagate_memory_limit + new _resolve_scheduler(engine) helper: walk engine._engine.engine.scheduler when the direct attribute is absent so real BatchedEngine / VLMBatchedEngine wrappers receive the propagation. Test mocks that assign .scheduler directly still work. When the wrapper-chain traversal returns None, the propagation logs a WARNING naming the engine type (rate-limited to once per type per process) so the silent no-op failure mode that originally hid the dead guard is loud in CI / oncall. - _MemoryLimitState frozen dataclass + Scheduler._memory_state: the four scheduler fields the enforcer publishes (memory_limit_bytes, memory_hard_limit_bytes, prefill_memory_guard, admission_paused) form a logical bundle — guard True implies hard_limit > 0, soft matches the configured ceiling, admission_paused tracks the enforcer's pressure level. The API hot-path reader (_preflight_memory_check) must see all four consistently; observing a partially-published combination (e.g. guard=True, hard_limit=0 mid-write) would let a too-large request slip past rejection and back the kernel-panic bug this commit closes. Bundling them into an immutable dataclass and publishing a single reference store makes the publication atomic regardless of Python memory model — including PEP 703 free-threading where per-attribute STORE_ATTRs are no longer GIL-serialized into a coherent order from another thread's perspective. The reader does ``state = scheduler._memory_state`` once and reads fields off the local snapshot. The four legacy attribute names (_memory_limit_bytes, _memory_hard_limit_bytes, _prefill_memory_guard, _admission_paused) are kept as @Property accessors backed by ``_memory_state``, so the dozens of test setattrs and the secondary readers in _do_external_prefill / _step_prefill_chunk / _schedule_waiting work unchanged. Each setter rebuilds the bundle via ``replace`` (frozen dataclass) so an individual assignment is still a single atomic ref store of the new bundle. The setters fall back to a fresh _MemoryLimitState if invoked before __init__ finishes — covers tests that build Schedulers via __new__ and set attrs one at a time. Tests (108 pass across the touched suites): - tests/test_scheduler_prefill_memory_guard.py — covers plain LLM config, VLM nested config, top-level-num_hidden_layers referring to vision encoder, language_config / llm_config aliases, legacy n_layer / n_head / n_embd fallback, VLM nested-via-legacy-n_layer (new regression for the widened sub-config probe), broken-config exception swallow, guard on/off, cached-vs-uncached. - tests/test_process_memory_enforcer.py — includes regressions for the load-after-start race (test_check_and_enforce_propagates_every_poll), the wrapper-traversal path (test_propagates_through_batched_engine_wrapper), and the rate-limited unreachable-scheduler warning (test_unresolvable_scheduler_logs_warning_once). Two new tests pin the bundle contract: test_propagate_publishes_atomic_memory_state_bundle — asserts the writer publishes a frozen _MemoryLimitState with all four fields coherent, so the reader's single-snapshot read can never observe a mixed (guard=True, hard_limit=0) combination. test_memory_state_bundle_matches_guard_off_path — symmetric for the OFF path. - tests/test_memory_monitor.py — adds two cases pinning the public ``eviction_enabled`` predicate so Scheduler's dormant-eviction short-circuits don't silently flip behavior on a future refactor. Verified end-to-end against Qwen3.6-35B-A3B-oQ6 on a 48 GiB Mac16,8 with max_process_memory=40 GiB: a 110k-token prefill that previously either OOM-aborted the Python process or triggered the Apple IOGPUFamily kernel panic is now rejected by _preflight_memory_check before reaching MLX. Broader concurrent / cache-replay torture cycled memory pressure soft↔ok throughout but never crossed hard.

When the prefill memory guard rejects a request, surface it as a proper HTTP 413 with the diagnostic message in the body, instead of an unhandled 500 + truncated chunked read. Two pieces are needed because the chat/completions path wraps the response body in a StreamingResponse to send SSE keepalive bytes during long prefill, and starlette emits http.response.start (status 200) before iterating the body generator. A typed exception thrown from within that body lands as "Caught handled exception, but response already started" — the registered 413 handler is correctly invoked but can no longer change the status code, so the client sees an incomplete chunked read. 1. Typed-exception path. Scheduler.add_request now runs _preflight_memory_check synchronously and raises PrefillMemoryExceededError if rejection is warranted — before the request is appended to self.waiting. Register an @app.exception_handler(PrefillMemoryExceededError) that returns HTTP 413 with the OpenAI-style {"error": {"message": ...}} body on /v1/* routes and a plain {"detail": ...} elsewhere. The async re-check in _schedule_waiting stays as a race-safety net. The docstring on add_request now spells out that admission preprocessing (tokenisation, prefix-cache lookup, block-table acquisition, SpecPrefill scoring) HAS run by the time the typed exception fires — preprocessing state is rolled back on the raise path so the rejection doesn't leak resources. 2. Server-side preflight before StreamingResponse. Add Scheduler.preflight_or_raise(num_prompt_tokens, cached_tokens=0) that runs the token-count form of _preflight_memory_check and raises PrefillMemoryExceededError on rejection (with a logger.warning so observability matches the async path). Split out _preflight_memory_check_tokens(num_prompt_tokens, cached_tokens) so the existing Request-shaped check still works. Add engine.preflight_chat(messages, tools=None, **kwargs) and engine.preflight_completion(prompt, **kwargs) on BatchedEngine and VLMBatchedEngine. Both apply the chat template, tokenize, traverse the wrapper chain to reach the scheduler, and call preflight_or_raise. If the wrapper chain resolves to None (partial init failure) the preflight no-ops with a logged warning rather than silently skipping the safety check. The warning is routed through ``_warn_scheduler_unreachable_once`` in engine.base — rate-limited per (engine_class, method) pair per process so a misconfigured engine doesn't spam logs at request rate, only once at first occurrence (enough to alert CI / oncall). The VLM ``preflight_chat`` strips image content-parts via ``extract_images_from_messages`` BEFORE applying the chat template, mirroring the real ``_process_chat_messages`` flow. Without this, modern HF chat templates (Qwen2.5-VL, Gemma-Vision, Llama-3.2-Vision) render image content-parts as literal placeholder strings inline with the text — the tokenized prompt then already contains image- placeholder tokens AND the per-image ``_IMAGE_TOKEN_UPPER_BOUND`` budget is added on top, double-counting and producing spurious 413s on borderline image-bearing requests that the real chat path would have admitted. After the strip, the per-image budget is the only source of image-token accounting and the upper bound is sharp. The VLM preflight_chat keeps the 1280-tokens-per-image budget (the high end of what Qwen-VL / Gemma-Vision packs expand each image to). Over-estimates somewhat for small images (false-positive 413s for borderline-and-image cases) but never under-counts, which is the property the guard needs to stay safe against the Apple IOGPUFamily panic path. VLM ``preflight_chat`` also routes tools through ``convert_tools_for_template`` to match the real chat path — without conversion, Pydantic ``ToolDefinition`` callers would hit the template's TypeError-retry fallback and silently drop tools, miscalibrating the token count. Wire the preflights into every prompt-bearing route handler BEFORE the StreamingResponse return: POST /v1/chat/completions (OpenAI standard) POST /v1/completions (OpenAI legacy) POST /v1/messages (Anthropic) POST /v1/responses (OpenAI Responses) ``_is_api_route`` keeps its path-prefix-only implementation. The server is currently mounted at root (no sub-app prefix); a comment on the helper records the assumption so a future ``app.mount( "/api", ...)`` deployment can move it to scope-based matching. Conservative-by-design choices documented for follow-ups: - The server preflight passes cached_tokens=0 (does not thread the prefix-cache lookup forward). Under a hot prefix cache an admission that would have fit may be rejected with 413; the synchronous in-add_request recheck inside the scheduler runs after the prefix- cache resolution and can still accept it, but the early rejection costs the caller a retry. Tracked for a future PR that pre-resolves the cache key. - The estimator is prompt-only; decode growth from ``max_tokens`` is not modelled. The tightened in-flight abort threshold (now equal to max_bytes — see the memory-guard commit) catches the runtime overflow if it happens, but the request will see a stream-shape error rather than upfront 413. Acceptable given decode growth on borderline prompts is unusual. Tests: - tests/test_server_prefill_memory_handler.py — 4 cases: - handler shape on /v1/* (OpenAI ``{"error": ...}`` wrapper) and non-/v1/ (plain ``{"detail": ...}``); - end-to-end ``/v1/responses`` regression hitting the real route with a force-rejecting preflight, asserting the production handler returns 413 with the OpenAI body shape (catches any future route that constructs a StreamingResponse without awaiting preflight first). - tests/test_engine_preflight.py — 20 cases: scheduler raise path (within budget / over budget / guard disabled / fully cached), BatchedEngine + VLMBatchedEngine preflight_chat with a stubbed wrapper chain, preflight_completion, the VLM image-token budget, the new VLM strip-images-before-template behaviour (regression for the double-count bug), Pydantic-tools conversion through convert_tools_for_template, and the wrapper-chain-unreachable warning log. Tokenizer hardening: preflight_chat and preflight_completion on both engines now catch exceptions from tokenizer.encode (BPE malformed input, HF Rust "Already borrowed", UnicodeDecodeError on exotic content) and log + skip the memory check rather than letting them surface as a new 500. The real chat / completion path will hit the same error and route it through the existing handler chain, so the user-visible response shape stays consistent with the pre-preflight behaviour for these cases. A future PR can map raw tokenizer errors to 400 for both paths. Verified end-to-end against Qwen3.6-35B-A3B-oQ6 on a 48 GiB host: a 110k-token POST /v1/chat/completions now returns HTTP 413 in 0.13 s with the full diagnostic body ("Prefill would require ~43.47 GB peak (current 27.91 GB + KV+SDPA 15.56 GB) but limit is 40.00 GB."), and the request never enters MLX prefill so the underlying Apple IOGPUFamily kernel bug stays cold.

Closes a long-standing gap noticed during the upstream merge: the class ships full to_dict/from_dict but had no tests, so field drift would have gone unnoticed (cf. upstream jundot#1286 which had to retrofit the same coverage for SchedulerSettings/MemorySettings).

…stream Adopts the keyword-only `cached_tokens` parameter that upstream PR jundot#1326 settled on (commit efcc858). Same behavior — same correctness fix on the SDPA span — just a different API shape: before: (new_tokens, cached_tokens, chunk_size) # required pos after: (new_tokens, chunk_size, *, cached_tokens=0) # kw-only Done now so the next upstream merge auto-merges the function definition line instead of conflicting on it. The body still keeps the local correctness extras (early return on new_tokens<=0, negative cached_tokens clamp, eff_chunk floor) that upstream lacks.

Closes the four ModelSettings fields that had no references anywhere in tests/: turboquant_kv_bits, turboquant_skip_last, vlm_mtp_draft_model, vlm_mtp_draft_block_size. All four are wired into admin/routes.py and classified in model_profiles.py, so silent drift between defaults and the routes layer would have been invisible. Pattern matches the existing per-field convention in this file: defaults, to_dict/from_dict roundtrip, and (for Optional fields) excluded-when-none.

Verified unused: zero instantiations across the entire git history (`git log -S 'TieredCacheManager('` returns only the initial-commit declaration), zero imports outside the module's own __init__ re-export, zero test references. The class was a planned coordinator between PagedCacheManager / BlockAwarePrefixCache / PagedSSDCacheManager / MemoryMonitor — Scheduler.__init__ does that wiring directly, making the abstraction redundant. Removes 353 lines plus the re-export from omlx.cache.__all__.

omlx/api/mcp_routes.py was wired into server.py at module init but had zero direct test coverage — test_mcp_*.py files exercised the manager and executor layers but not the FastAPI handlers, leaving the request parsing, no-manager fallbacks, and the tool/tool_name alias (upstream PR jundot#1285) unguarded. 15 tests via TestClient covering all four entry points: - set_mcp_manager_getter wiring + None propagation - GET /v1/mcp/tools: empty-list short-circuit, namespaced full_name serialization - GET /v1/mcp/servers: enum-to-string flattening, error-field passthrough - POST /v1/mcp/execute: 503 when unconfigured, success / handled-error paths, tool-vs-tool_name alias regression guard, default-arguments fallback, 422 on missing tool name The autouse fixture resets the module-global `_get_mcp_manager` callback between tests so state can't leak across cases.

omlx/api/rerank_models.py defines the request/response contract for the /v1/rerank endpoint (Cohere/Jina-compatible) but had no schema tests — only the engine layer was exercised by test_reranker_*.py. Field renames or default changes would have silently broken clients. 24 tests covering all four classes: - RerankRequest: text + multimodal (dict) query/documents shapes, defaults (return_documents=True, top_n=None), validation rejection on missing required fields, JSON round-trip with return_documents=False - RerankResult: document=None path, multimodal document passthrough (preserves image data-URIs), required field enforcement - RerankUsage: required total_tokens - RerankResponse: auto-id 'rerank-' prefix + 8 hex chars (Cohere clients filter on this), per-instance uniqueness, explicit-id override, optional usage, JSON round-trip with multiple results

omlx/admin/oq_manager.py is instantiated into server_state at startup and drives all admin-panel quantization workflows, but had no direct tests — test_oq.py covers the omlx/oq.py library underneath, not the async task manager on top. 51 tests across the synchronous and validation surface: - QuantStatus enum + str-inheritance - QuantTask.to_dict shape, progress rounding, status enum.value - _dir_size and _format_size helpers (parameterized) - _phase_label including the quantizing_eta|c|t|eta parser edge cases (zero total, non-numeric parts) - OQManager.__init__ defaults and the first-dir-wins output_dir rule - update_model_dirs refresh + empty-update no-op invariant - get_tasks / is_quantizing across active vs terminal statuses - remove_task: unknown / refuses-active / clears _cancelled set - cancel_quantization rejection paths (unknown, non-active) - start_quantization validation: invalid oq_level, invalid dtype, missing model dir, missing config.json, output collision, duplicate active task, and that completed tasks don't block resubmits - list_quantizable_models: empty / skip-malformed / MTP detection / is_quantized flag / dedup across dirs - shutdown: no-op when idle, cancels every active task The full _run_quantization / _estimate_progress lifecycle and the cooperative-cancellation flow are left for a future pass — they need end-to-end mocking of quantize_oq_streaming + Metal cleanup, which is brittle and not the highest-value gap right now.

Extends test_oq_manager.py with the lifecycle paths that needed quantize_oq_streaming mocking — left as a follow-up in the previous commit because they require threading-coordinated stubs. 13 new tests across four classes: - TestRunQuantizationHappyPath: completion fields (status, progress, output_size, started_at/completed_at ordering), sync and async on_complete callbacks, on_complete exception isolation (a buggy callback must not flip status to FAILED), progress-callback wiring - TestRunQuantizationFailure: exception → FAILED with error captured, partial output cleaned from disk - TestRunQuantizationPreCancel: _cancelled set before semaphore entry → quantize_oq_streaming never invoked (shutdown-race guard) - TestCancelCooperativeExit: full cancel via the progress-callback path (the design upstream chose over hard-cancelling the asyncio wrapper — see the comment block in cancel_quantization), plus the cancel-during-LOADING edge case where no progress_cb has fired yet - TestEstimateProgress: unknown-task no-op, parent task's finally clause cancels the progress estimator on both success and failure Uses threading.Event coordination in fake_quantize so the cooperative cancel path exits as soon as the flag is set, not after fixed delays — the full test class runs in <0.3s.

omlx/mcp/config.py is consulted on every server startup but had no direct tests — test_mcp_types.py covered the MCPConfig / MCPServerConfig dataclasses underneath, not the file discovery, JSON/YAML loading, or the validate_config schema checks layered on top. 38 tests across eight classes: - validate_config input: non-dict / empty / servers-must-be-dict / the falsy-non-dict-servers quirk (`[]` silently falls through to mcpServers) / entry-must-be-dict with server name in the message - validate_config server loading: stdio happy path, key-as-name auto injection, key-overrides-explicit-name, unknown-kwarg surfaces a ValueError with the offending server name, the Claude Desktop `mcpServers` alias for drop-in compatibility, `servers` precedence over `mcpServers` (or-chain semantics), empty mcpServers fall-through - validate_config globals: max_tool_calls valid/zero/negative/non-int plus the bool-is-int quirk (documented), default_timeout int/float/ zero/negative/string rejection - Path discovery: explicit existing path, missing explicit path → FileNotFoundError, tilde expansion, OMLX_MCP_CONFIG honored, env-var-missing-file falls through with a warning rather than crashing the server startup, search-path order, all-paths-missing returns an empty MCPConfig (not None, not error) - File formats: JSON load + malformed-JSON propagation, YAML load via .yaml and .yml (gated on PyYAML availability) - create_example_config: valid JSON, round-trips through validate, showcases multiple transports All discovery tests use an `isolated_env` fixture that monkeypatches CONFIG_SEARCH_PATHS to [], clears the env var, and chdirs to tmp_path so real ~/.config/omlx/mcp.json or stray ./mcp.json files can't leak into the test.

omlx/optimizations.py is a thin hardware/MLX status helper consumed by the admin dashboard. The re-exported HardwareInfo / detect_hardware / get_total_memory_gb were already covered by test_utils_hardware.py; this commit pins the dict shape produced by get_optimization_status and the flash-attention probe. 10 tests: - Re-exports are identity-equal to the canonical hardware symbols (otherwise external scripts importing from omlx.optimizations would silently get stale copies); __all__ matches the documented surface - get_optimization_status returns three top-level keys (hardware, mlx_memory, mlx_lm_features) with the expected sub-key sets - mlx_memory counters are non-negative ints sourced from mx.get_*_memory - mlx_lm_features static strings pinned verbatim — the dashboard shows these verbatim, so a typo here would surface in production - flash_attention reports 'built-in' on real MLX, 'not available' when mx.fast lacks scaled_dot_product_attention (simulated via patch.object(mx, 'fast', MagicMock(spec=[]))) - active_bytes actually reflects MLX state — allocating a 4 MiB array must register in the delta, proving the value isn't hardcoded

Closes two remaining gaps with real production impact: omlx/models/base_model.py — mean_pooling and normalize_embeddings are the pooling/L2-norm primitives used by omlx/models/xlm_roberta.py (the reranker model). A silent change to either would corrupt embedding output. 12 tests: - BaseModelArgs/BaseModelOutput dataclass shape (defaults + full fields) - mean_pooling: uniform mask = simple mean, partial mask excludes padded positions (the load-bearing invariant — pre-mask sums would let padding tokens taint embeddings), all-zero-mask guard prevents NaN/Inf (the clip(..., a_min=1e-9) defense), batch-dim passthrough, fp16 dtype propagation via the mask.astype(hidden_states.dtype) cast - normalize_embeddings: unit-norm output, axis=-1 only (catches wrong-axis bugs that would silently destroy similarity comparisons), higher-rank shape preservation, idempotence omlx/eval/mbpp.py — _extract_code is the regex/heuristic parser that pulls Python source out of MBPP-task LLM responses before the subprocess executor runs them. The branch coverage matters: 50% of runs use unspecified-language fences. 15 tests: - Fenced blocks: ```python wins over ```, multiline, whitespace strip - Heuristic line scan: def/class/import/from/# all trigger code-region start; everything after is included (no end detection by design) - Fallback: no markers → whole response stripped, empty/whitespace inputs, lone print() statement documents the known limitation

…_block Upstream 1b666af added the ssd_write_drops counter at three drop sites, including a pre-tensor-extraction guard in save_block that returns False when the write queue is already full. The earlier inline-LRU-unlinks change (b9d7161) argued the short-circuit was redundant because eviction no longer contended for the queue, and the rebase resolution dropped it. But eviction was never the only path to saturation: when the writer itself can't keep up with sustained saves, the short-circuit avoids running _enforce_size_limit_for_new_block and the GPU tensor extraction only to drop the block at the put step a few hundred lines later. Gated on `not self._hot_cache_enabled`. In hot-cache mode writes are deferred to LRU spill via _enqueue_ssd_write, which has its own timeout-put + drop accounting and shouldn't double-count here. Tests: - restore test_ssd_write_drops_increments_on_cold_store_preflight which patches _write_queue.full() to True - reword test_ssd_write_drops_increments_on_cold_store_late_exception to clarify it now covers the put-after-preflight race - test_ssd_write_drops_increments_on_hot_eviction_queue_full now patches put (the actual call in _enqueue_ssd_write) instead of put_nowait, matching the local timeout-based put semantics

The 0e6cbc5 commit (now part of the rebased local history) tightened _get_hard_limit_bytes to always return max_bytes, arguing that auto mode's system_ram - 4GB headroom let Qwen3.6-VL head_dim=256 prefills burst past the cap before Metal aborted. Upstream acd0533 has since landed an adaptive prefill throttle (_prefill_safe_zone_ratio / _prefill_min_chunk_tokens) that shrinks the per-chunk transient as usage approaches the cap — the original failure mode is now bounded by the throttle before reaching Metal. Restore upstream's two-branch behavior: - user_explicit_max=True -> max_bytes (user value IS the ceiling) - auto -> max(system_ram - 4GB, max_bytes) Tests: - remove the now-obsolete test_hard_limit_equals_max_bytes added by 0e6cbc5 (coverage is preserved by upstream's test_hard_limit_honors_user_explicit_max and test_hard_limit_auto_mode_still_uses_system_minus_4gb) - enforcer fixture now sets user_explicit_max=True so propagation tests can assert the exact hard_limit value without entanglement with the host's system memory

upstream acd0533 added max_process_memory_is_explicit, prefill_safe_zone_ratio, and prefill_min_chunk_tokens to MemorySettings.to_dict() but the to_dict test was not updated (last touched in jundot#1286 on 2026-05-19, predating acd0533 by a week). The test was already failing on upstream/main — this just closes the gap.

…butes The bundle (introduced in b6a69c4) wrapped the four memory-guard fields (_memory_limit_bytes, _memory_hard_limit_bytes, _prefill_memory_guard, _admission_paused) in a frozen dataclass so ProcessMemoryEnforcer could publish them via a single reference store. The motivation was PEP 703 free-threading, where per-attribute STORE_ATTRs are no longer GIL-serialized into a consistent order from another thread's view. Under current CPython the GIL already serializes the per-attribute writes within a single enforcer poll, so the hot-path reader (_preflight_memory_check_tokens) sees a coherent (guard, hard_limit) pair without the bundle. Keeping the bundle traded non-trivial complexity (frozen dataclass + four property/setter pairs + ``dataclasses.replace`` rebuilds on every test setattr) for PEP 703 prep we don't need today. Drop it; revisit if/when we ship nogil. - omlx/scheduler.py: remove _MemoryLimitState dataclass, remove the four properties and _current_memory_state helper, init the four fields as plain attributes (matches upstream). - omlx/process_memory_enforcer.py: _propagate_memory_limit writes the four fields directly instead of building a new bundle. - tests/test_process_memory_enforcer.py: drop _MemoryLimitState imports and bundle setup; collapse the two redundant bundle-atomicity tests into one guard-off test. Net: -250 / +51 lines.

The post-rebase add/add merge with upstream 1010fd3 left two classes covering update_model_dirs: - TestUpdateModelDirs (local) — bare tmp_path setup; covered the reorder-changes-head case and the empty-list defensive case. - TestOQManagerUpdateModelDirs (upstream) — fp_model_dir fixture setup; covered async list_quantizable_models() integration and the head-tracking case. They overlapped on the head-tracking assertion. Fold them into a single TestUpdateModelDirs with three tests: - test_picks_up_added_dir (async integration, from upstream) - test_output_dir_tracks_primary_dir (extended with the local test's _model_dirs == [head, ...] order assertion) - test_update_to_empty_leaves_old_output_dir (from local) Net: -7 tests removed, -2 new in the consolidated class, +1 stronger assertion on the head-tracking test.

…y_limit The docstring's "four guard fields are written as direct attributes / on current CPython the GIL serializes per-attribute STORE_ATTRs" paragraph was contextual to the now-removed _MemoryLimitState bundle (see 675cae1) — when the bundle existed, the paragraph explained why the bundled single-store was equivalent under today's GIL. With the bundle gone the paragraph is just a CPython explainer that adds noise without informing the reader.

…ream Upstream ``tests/test_integrations.py::TestIntegrationSettings`` covers defaults, basic to_dict, full-dict and empty-dict from_dict. The local class had near-exact duplicates of three of those (``test_defaults``, ``test_from_dict_full``, ``test_from_dict_empty``) — drop them. Keep the genuinely additive coverage: - ``test_to_dict_defaults`` — pins the exact to_dict shape, catching the kind of field-added-but-test-not-updated regression that bit MemorySettings.to_dict (see 81dc2d5). - ``test_to_dict_custom`` — stricter than upstream's (sets all seven fields, asserts the full dict). - ``test_from_dict_partial`` / ``test_from_dict_explicit_null_overrides_default`` / ``test_round_trip`` — partial-dict fallback, null-override semantics, identity round-trip; not covered upstream. Net: -28 lines, three fewer merge-conflict surfaces with upstream's copy of the same tests.

# Conflicts: # omlx/engine/base.py # omlx/process_memory_enforcer.py # omlx/scheduler.py # tests/test_process_memory_enforcer.py # tests/test_settings.py

# Conflicts: # packaging/build.py # pyproject.toml

…ping not yet populated _emit_prefill_boundary_snapshot routed through a uid that doesn't exist yet. At prefill time the request has not been inserted into BatchGenerator (that happens in _insert_prefilled_request *after* prefill completes), so request_id_to_uid has no entry for it and the lookup defaults to -1. _on_prefill_boundary_snapshot then does uid_to_request_id.get(-1) -> None -> silent return, dropping every boundary snapshot. For ArraysCache / GDN / hybrid models that meant every non-last cached block stored a placeholder (mx.zeros((1,)), mx.zeros((1,))) instead of the real recurrent state. The next identical-prefix request matched the cache by hash, found a placeholder in the last matched block, and the lookup-side guard at prefix_cache.py:1986 correctly rejected the cache as unusable — but the resulting "Request will reprocess from scratch" branch then re-prefilled the entire prompt. Visible symptom: re-uploading the same large prompt (e.g. a chat continuation) re-prefilled all 90k tokens instead of the few new ones. Pass request_id through directly. _on_prefill_boundary_snapshot was already keyed by request_id internally — the uid round-trip was pointless. The other prefill-side caller, _maybe_capture_boundary_ snapshot (called during generation from a path where uid IS valid), writes snapshots inline without going through this helper, so it was unaffected. That asymmetry explains the bimodal pattern in production logs: short-prompt + long-generation requests captured 20+ intermediate snapshots cleanly, while long-prompt + short- generation requests captured 0 (since all block boundaries were in prefill, where the uid lookup failed). Verified end-to-end via observability instrumentation on a 90k-token prompt: every SNAP_EMIT was followed by uid=-1 SNAP_SKIP before the fix; every SNAP_EMIT is followed by SNAP_OK after. The decisive proof was a subsequent identical-prefix request that started prefill at the cached boundary (token 51200) instead of token 0. Updated existing tests to reflect that no uid mapping is required. Added a regression test that mirrors production state at prefill time (request_id_to_uid deliberately unset).

jundot · 2026-05-28T04:19:32Z

Thanks for the diagnosis. The request_id_to_uid timing explanation is exactly right, and the short-prompt+long-gen vs long-prompt+short-gen bimodal pattern matches what I've been seeing on Qwen3.6-A3B sessions. Cherry-picked the boundary-snapshot fix onto main as 9407468 with your authorship preserved.

I left the other 23 commits on this branch alone. They overlap with your narrower PRs (#1447, #1450, #1451, #1452, #1453, etc.), so I'd rather review and merge those independently to keep the main history readable. Closing this PR; the narrow ones stay open. Apologies if that's not what you intended. Happy to revisit if you'd prefer the bundle path.

cfbraun · 2026-05-28T05:43:21Z

Thanks for the merge of 9407468 — and for the careful read on the bimodal pattern.

You're right about the fat-PR shape — that's my mistake. I branched pr/prefill-snapshot-uid-fix off my local main, which is ~24 commits ahead of upstream because it carries the other in-flight work. GitHub then computed the cross-fork PR diff as all 24 commits, not just my one new commit. From your side it correctly looked like a bundle. I'll branch future single-fix PRs off upstream/main directly to keep the diff narrow.

Happy to leave the narrower PRs (#1446–#1453, #1458) on their independent tracks per your suggestion. No need to revisit. Thanks again.

cfbraun added 30 commits May 26, 2026 09:40

Merge upstream/main (v0.3.12 + memory_guard_tier restructure)

fcf3170

# Conflicts: # omlx/engine/base.py # omlx/process_memory_enforcer.py # omlx/scheduler.py # tests/test_process_memory_enforcer.py # tests/test_settings.py

Merge upstream/main (per-engine threads + VLM MTP guard)

a813a37

Merge upstream/main (PR jundot#1445 + VLM lazy-state fix)

5eed46e

Merge upstream/main (PR jundot#1422 merge)

f23f50a

Merge upstream/main (PR jundot#1423 merge)

bf799cf

Merge upstream/main into main

48b3eb2

# Conflicts: # packaging/build.py # pyproject.toml

jundot closed this May 28, 2026

cfbraun mentioned this pull request May 28, 2026

fix(specprefill): materialize draft model lazy state on loader thread #1485

Merged

4 tasks

cfbraun deleted the pr/prefill-snapshot-uid-fix branch May 28, 2026 06:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cache): boundary snapshots dropped during prefill because uid mapping not yet populated#1471

fix(cache): boundary snapshots dropped during prefill because uid mapping not yet populated#1471
cfbraun wants to merge 30 commits into
jundot:mainfrom
cfbraun:pr/prefill-snapshot-uid-fix

cfbraun commented May 27, 2026

Uh oh!

jundot commented May 28, 2026

Uh oh!

cfbraun commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cfbraun commented May 27, 2026

Summary

User-visible symptom

The fix

Diagnosis evidence

Test plan

Uh oh!

jundot commented May 28, 2026

Uh oh!

cfbraun commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants