fix(cache): boundary snapshots dropped during prefill because uid mapping not yet populated#1471
fix(cache): boundary snapshots dropped during prefill because uid mapping not yet populated#1471cfbraun wants to merge 30 commits into
Conversation
When the SSD cache hits its size limit under sustained save load, every
save_block triggers eviction. The previous design routed each evicted
block's unlink as ("unlink", path) onto the same _write_queue that
carries pending writes — so eviction could never free queue capacity, it
could only add more work to the queue. Combined with the pre-eviction
_write_queue.full() short-circuit, this kept the cache permanently full
once the queue saturated.
Inline the unlinks on the eviction-calling thread instead. Eviction
typically removes a single block per save (loop in evict_until_size
stops as soon as size drops below target), so this is one syscall per
save, not N. The deferred-unlink justification ("avoid blocking the
inference thread with N file delete syscalls") doesn't actually
materialise in steady state — and inlining removes the bounded-queue
contention entirely.
With unlinks off _write_queue the pre-eviction full() short-circuit is
no longer needed (eviction never made the queue worse). The eviction
call at the top of save_block — which ran before tensors were built and
so could only pass a 1 MiB placeholder size — is also dropped; the
single eviction call at save_block:1339 (post-tensor-build) with the
real estimated_size is enough.
Bounded inline-eviction burst. The ENOSPC recovery path invalidates
the disk-usage snapshot, which shrinks _get_effective_max_size sharply
on the next save — evict_until_size can then return hundreds of
entries at once and the inline-unlink loop would stall the inference
thread on a syscall storm. Cap the burst at
_MAX_INLINE_UNLINKS_PER_SAVE = 32 and reinsert the deferred metadata
into the index so subsequent saves drain the remainder; bounds
per-call latency at the cost of taking multiple saves to fully
reconverge.
Genuine writer saturation (the writer can't keep up with sustained
inserts) remains possible. save_block and _enqueue_ssd_write now call
_write_queue.put(item, timeout=0.25) instead of put_nowait so a
transient burst rides over short writer-backlog windows; only sustained
saturation drops the block. The 250 ms budget is well below human-
perceptible latency and typically covers one or two writer iterations
on a healthy SSD. The blocking shutdown-flush path keeps the 500 ms
timeout.
Estimate computes the actual safetensors metadata-JSON length plus a
256 B JSON-envelope margin and 128 B/tensor for the per-tensor header,
replacing the previous fixed 1 KiB metadata constant. Deep-layer models
with large layer_meta_states JSON were already capable of exceeding the
1 KiB budget; the new estimate tracks the real value so the index can't
briefly overshoot _effective_max after a write.
ENOSPC failures invalidate the 30 s disk_usage snapshot under
self._lock so a writer-thread invalidation cannot interleave with an
inference-thread read-check-write and produce a fresh-value-stale-
timestamp pair. The invalidation only protects the NEXT round of
save_block calls — in-flight saves that already passed the quota gate
are still queued and may re-ENOSPC. The error message reflects that.
Logged at ERROR because save_block has already returned True /
incremented _stats["saves"] by the time the writer fails, so the slot
is silently lost; surfacing this at ERROR makes the drift visible to
operators.
Enqueue-vs-persist counters split. _stats["saves"] continues to count
blocks that passed the quota gate (incremented in save_block on the
inference thread), and a new _stats["saves_persisted"] counter is
incremented after the writer's atomic rename succeeds — surfaced as
PagedSSDCacheStats.saves_persisted. saves - saves_persisted is the
steady-state in-flight depth; dashboards that need true durability
should switch to the new field. save_rate's existing formula
(saves / (saves + errors)) keeps backwards compatibility.
Tests (tests/test_paged_ssd_cache.py::TestEvictionAndQueueSaturation):
- test_eviction_does_not_enqueue_unlink_tasks: forces eviction and
asserts no ("unlink", ...) entries leak into _write_queue.
- test_save_uses_timeout_not_put_nowait: verifies save_block calls
_write_queue.put with a positive timeout (regression for the prior
put_nowait that dropped on first burst).
- test_enospc_invalidates_disk_usage_snapshot: an ENOSPC writer
failure clears _disk_usage_cache so the next effective-max call
recomputes against true free-space.
- test_saves_persisted_increments_only_after_rename: pins the
documented enqueue-vs-persist semantic of the two counters.
- test_inline_eviction_burst_is_capped: a forced large-burst eviction
is bounded at _MAX_INLINE_UNLINKS_PER_SAVE and reconverges across
subsequent calls.
82/82 paged_ssd_cache tests pass.
Ships xgrammar==0.2.0 + apache-tvm-ffi==0.1.11 in the DMG (~+37 MB)
instead of skipping structured outputs. xgrammar's torch>=1.10.0
runtime dep is load-bearing only at import time; omlx/_torch_stub.py
satisfies those references (dtypes, Tensor aliases, cuda/version/nn
submodules) without the ~500 MB torch wheel. oMLX's hot path is
unchanged — bitmasks are numpy int32, applied via the existing MLX
kernel.
Wiring:
- omlx/_torch_stub.py: install() is a no-op when real torch is found.
The real-torch path now also leaves TVM_FFI_DISABLE_TORCH_C_DLPACK
untouched — that env var is only set when the stub itself is being
installed, so a user with real torch can still use tvm-ffi's
torch-C-DLPack fast path. The version-drift check is hoisted out
of the install lock so a slow cold-disk xgrammar import doesn't
block concurrent callers behind the lock.
- omlx/{api/grammar,server,admin/routes}.py: install before xgrammar
- tests/conftest.py: install at collection so @patch decorators resolve
- packaging/build.py: idempotent _install_xgrammar() with a versioned
sentinel file. The sentinel write is now atomic (tmp + os.replace)
so a SIGKILL/ENOSPC mid-Path.write_text can't leave a zero-length
sentinel that the next run accepts. Before writing, the package
roots (xgrammar/__init__.py, tvm_ffi/__init__.py) are checked for
presence and non-zero size — a truncated zipfile.extractall pass
would otherwise also be accepted by the bare-existence check.
Hoisted into main() so --skip-venv runs it too.
- omlx/_torch_stub.py: the top-level torch.__getattr__ now logs the
missing attribute name at WARNING (rate-limited to once per name
per process) before raising AttributeError, so a future xgrammar
release reaching for an unstubbed torch surface is diagnosable
from logs even when the AttributeError surfaces in a request
handler that doesn't separately log it.
- server.py: DMG error path no longer claims unavailability
Tests (tests/test_torch_stub.py, 16 cases — 4 new on top of the
12 pre-existing): install / idempotency / real-torch-present
detection / dtype dict-keying / isinstance(Tensor) / unsupported
helpers raising / env-var conditional on no-real-torch path /
top-level __getattr__ AttributeError + WARNING log contract /
__spec__ is a real ModuleSpec for every stubbed submodule /
torch.utils.dlpack.to_dlpack raises. Guards against a future xgrammar
or tvm-ffi version reaching for a new torch attribute at import time.
Verified on bundled app Python: import xgrammar succeeds without
torch on disk; create_grammar_compiler + compile_json_schema +
GrammarConstraintProcessor mask logits correctly with a real HF
tokenizer; tests/test_grammar.py = 57 passed / 3 pre-existing skips.
…th the writer thread
test_cleanup_all_drains_queue failed ~20% of the time, in isolation
and in suites. The race:
T0 save() puts item in queue
T1 writer queue.get() pulls item (already off the queue)
T2 cleanup_all() drains queue — finds it empty
T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir)
T4 writer mkdir(req_dir) + temp write + rename(final)
Result: ``req-X/N.safetensors`` survives the supposed cleanup, the
caller (scheduler / shutdown) believes the snapshot dir is empty, and
the leftover file is the disk shadow of a request that should have
been forgotten.
Add ``_writer_busy`` (threading.Lock). The writer holds it for the
entire duration of each ``_process_write_item`` call. ``cleanup_all()``
drains the queue first (no new items can start) and then acquires
``_writer_busy``, so any item the writer had already pulled finishes
before rmtree runs.
Same class of race exists for ``cleanup_request()`` — writer pulls a
request's item before cleanup_request sets the cancelled-counter, then
mkdir's the req_dir back into existence after the rmtree, and the
late-rename rescue at the tail of ``_process_write_item`` catches it
in most timings but a small window remains. Apply the same
``_writer_busy`` barrier (bounded, see below) so the two cleanup paths
are symmetric.
Per-path timeouts so cleanup_request can yield faster than
cleanup_all:
- ``_CLEANUP_ALL_TIMEOUT_S = 5.0`` — cleanup_all runs at reset /
startup where blocking longer is tolerable in exchange for a
stronger orphan-avoidance guarantee.
- ``_CLEANUP_REQUEST_TIMEOUT_S = 2.0`` — cleanup_request is called
from the scheduler's abort hot path (~3 sites) where bounding
latency matters more than chasing one in-flight write.
The bounded acquires have a logged-warning fallback. Without that
bound a slow disk inside ``_write_safetensors_no_mx`` could deadlock
the scheduler's hot path — cleanup_all() is called from request abort
/ reset (scheduler.py:5400 / 5596 / 5737), not just shutdown. The
worst-case fallback behaviour is an orphaned file under the recreated
snapshot dir until next startup's constructor cleanup, which is the
exact same state the pre-fix code produced on every cleanup_all.
Counter pop on cleanup_request now only runs when the
``_writer_busy`` acquire succeeded. On the timeout fallback the
writer is still mid-item and may not yet have consulted
``_is_cancelled``; popping the counter there would defeat the late-
rename rescue that the docstring advertises as the timeout-
fallback safety net. The previous code popped unconditionally.
Counter bump itself is now additive (``get + count``) and skipped
entirely when ``count == 0``. Two distinct bugs that closes:
- Skip-on-zero: cleanup_request("X") for an rid with NO pending
items previously wrote ``cancelled[X] = 0`` then popped on the
acquired path. On the timeout fallback the pop never ran and the
``X: 0`` entry lingered for the process lifetime — every later
``save()`` under that rid (or any reuse of the string) was
silently discarded by the writer's ``_is_cancelled`` gates,
which check key membership not value > 0. Counter must only
exist when there is at least one in-flight item to drain it.
Regression: test_cleanup_request_no_pending_does_not_pin_counter
_on_timeout (new).
- Additive: a re-entrant cleanup_request("X") for an rid that
already has an in-flight cancellation must NOT overwrite the
previous count. The writer's ``cleared_by_cleanup`` branch +
``_writer_busy`` lock together close the file-write race today
(so no orphan file slips out), but the per-item dec_cancelled
bookkeeping has to balance — overwriting drops remaining decs on
the floor and on the next ``save()`` under the same rid the
writer would see a non-zero counter from the earlier batch and
silently discard the new item.
shutdown() now accepts ``cleanup=True`` so callers that want both
operations can express the cleanup-before-shutdown ordering in one
call instead of sequencing them manually. The cleanup_all() warning
at the top of the function catches misordered callers that still
call cleanup_all() AFTER shutdown — the writer no longer reacquires
``_writer_busy`` past the sentinel, so a post-shutdown cleanup
degrades to an in-memory-only clear.
Tests:
- ``test_cleanup_all_drains_queue`` — no longer sleeps; relies on the
lock to guarantee ordering, runs deterministically.
- ``test_cleanup_all_blocks_until_writer_finishes_pinned_item`` —
monkey-patches ``_write_safetensors_no_mx`` to block on an Event,
pins the writer mid-item, asserts cleanup_all does NOT return until
the writer releases. Without the lock this test fails
deterministically rather than the original ~20% flake rate.
- ``test_cleanup_request_blocks_until_writer_finishes_pinned_item`` —
symmetric for cleanup_request.
- ``test_cleanup_request_keeps_counter_on_timeout`` — regression for
the bug where the timeout-fallback pop dropped the late-rename
rescue's safety net.
- ``test_cleanup_request_timeout_drains_counter_on_writer_early_return``
— the writer's cleared_by_cleanup early-return must still
dec_cancelled or the counter pins the rid forever.
- ``test_cleanup_request_no_pending_does_not_pin_counter_on_timeout``
(new) — regression for the skip-on-zero bug above. Pins the writer
with an unrelated save, calls cleanup_request("never-saved-rid")
past the 0.1s timeout, asserts the rid does NOT appear in
``_cancelled_requests`` AND that a subsequent save under the same
rid is not silently discarded.
- ``test_shutdown_cleanup_true_runs_cleanup_before_setting_flag`` —
pins the cleanup-before-shutdown ordering of the new
``shutdown(cleanup=True)`` path.
24/24 boundary-store tests pass.
Scheduler._preflight_memory_check, MemoryMonitor.estimate_prefill_peak_bytes, and ProcessMemoryEnforcer._propagate_memory_limit existed for a while but never actually fired in practice — five distinct gaps along the chain short-circuited the guard on every request: 1. Scheduler.__init__ left self.memory_monitor=None. The monitor is built by cache/factory.create_full_cache_stack, but tiered_manager passes memory_monitor=None for paged-SSD-only mode (the only mode oMLX uses), so the scheduler's reference stayed None and _preflight_memory_check returned at the "if self.memory_monitor is None" gate. 2. _set_model_info_for_monitor expected num_hidden_layers / num_attention_heads / head_dim on the top-level model.config. VLM configs (e.g. Qwen3.6-VL, Gemma-4 family) nest those under config.text_config and put architectures / vision_config at the top level. The "if num_layers and num_kv_heads and head_dim" gate failed for every VLM and set_model_info() was never called. 3. ProcessMemoryEnforcer._get_hard_limit_bytes returned max(system_ram - 4 GiB, max_bytes). On a 48 GiB host with max_process_memory=40 GiB that's a 44 GiB threshold — large enough that a 110k-token Qwen3.6-VL prefill (28 GiB baseline + ~14 GiB SDPA-fallback peak) slipped past the rejection check. Metal then aborted at kIOGPUCommandBufferCallbackErrorOutOfMemory or, worse, the underlying Apple IOGPUFamily kernel bug (FB22091885, ml-explore/mlx#3186) fired and the host panicked. 4. _propagate_memory_limit was only called on memory-pressure-level transitions (ok→soft, soft→hard, etc.). EnginePool lazy-loads engines on first request, well after enforcer.start() has propagated to its then-empty entry map. The freshly-loaded scheduler kept _prefill_memory_guard at its False default until pressure first crossed soft — by which point the heavy request was already in MLX. 5. _propagate_memory_limit reached for getattr(entry.engine, "scheduler", None). Neither BatchedEngine nor VLMBatchedEngine exposes .scheduler at the wrapper level — both keep it at self._engine.engine.scheduler. getattr returned None for every real engine; the loop body did nothing. Existing tests masked this because they assigned .scheduler directly onto a MagicMock — a contract the real engines never satisfy. Fix: - Scheduler.__init__: build a real MemoryMonitor unconditionally and immediately call _set_model_info_for_monitor() so the SDPA-peak math is available from the first request. The 4 GiB max_kv_cache_memory ctor arg is required positive but unused on the prefill-peak path (which reads only head_dim / num_attention_heads / num_kv_cache_layers); _check_memory_pressure is a hardcoded ``pass`` in paged-SSD-only mode and the eviction helpers (_evict_blocks_permanently / _evict_blocks_to_cold) now short- circuit on ``memory_monitor.eviction_enabled`` before they would call into the disabled estimate_blocks_to_free path. A future re-wiring for paged-SSD eviction would need to thread the real KV budget through. - MemoryMonitor.eviction_enabled is a new public predicate so callers can branch on the eviction wiring without reaching into a private attribute and without provoking the RuntimeError that estimate_blocks_to_free emits when eviction is disabled. - _set_model_info_for_monitor: ALWAYS prefer the LM sub-config when one of text_config / language_config / llm_config exposes num_hidden_layers OR the legacy n_layer, even if the top-level config also has one. Some VLM packs (older Gemma-3, Llava variants, HF auto- wrappers) surface num_hidden_layers at the top level referring to the *vision encoder* not the LM; accepting that would silently miscalibrate the SDPA-peak estimate by a constant factor. Probing the legacy ``n_layer`` alias keeps GPT-style nested configs from being shadowed by the same vision-encoder leakage. Falls back to top-level only when no sub-config has either field. - _get_hard_limit_bytes: return self._max_bytes directly so the guard's threshold tracks the user-configured ceiling. Two downstream sites consume this value — _preflight_memory_check (admission rejection) AND the mid-prefill runtime abort in _do_external_prefill / _step_prefill_chunk. Both intentionally tighten in lockstep: a request whose estimated peak slips past preflight (e.g. cache state changes between scheduling and execution) is also aborted in-flight rather than being allowed to consume up to the system ceiling and potentially trigger the Apple kernel bug. The enforcer's separate soft/hard watermarks (_soft_bytes / _hard_bytes) still drive LRU eviction independently. - _check_and_enforce: call _propagate_memory_limit() every poll iteration, not only on level changes. Newly-loaded engines now pick up the limits + guard flag within one poll interval of load. Iteration uses ``list(self._engine_pool._entries.values())`` to snapshot the view so a future EnginePool mutator moved to a worker thread cannot trigger ``RuntimeError: dictionary changed size during iteration`` or — worse — silently miss an engine and leave it without the propagated guard, regressing the dead-guard bug this commit fixes. The cost is one cheap copy of value references. - _propagate_memory_limit + new _resolve_scheduler(engine) helper: walk engine._engine.engine.scheduler when the direct attribute is absent so real BatchedEngine / VLMBatchedEngine wrappers receive the propagation. Test mocks that assign .scheduler directly still work. When the wrapper-chain traversal returns None, the propagation logs a WARNING naming the engine type (rate-limited to once per type per process) so the silent no-op failure mode that originally hid the dead guard is loud in CI / oncall. - _MemoryLimitState frozen dataclass + Scheduler._memory_state: the four scheduler fields the enforcer publishes (memory_limit_bytes, memory_hard_limit_bytes, prefill_memory_guard, admission_paused) form a logical bundle — guard True implies hard_limit > 0, soft matches the configured ceiling, admission_paused tracks the enforcer's pressure level. The API hot-path reader (_preflight_memory_check) must see all four consistently; observing a partially-published combination (e.g. guard=True, hard_limit=0 mid-write) would let a too-large request slip past rejection and back the kernel-panic bug this commit closes. Bundling them into an immutable dataclass and publishing a single reference store makes the publication atomic regardless of Python memory model — including PEP 703 free-threading where per-attribute STORE_ATTRs are no longer GIL-serialized into a coherent order from another thread's perspective. The reader does ``state = scheduler._memory_state`` once and reads fields off the local snapshot. The four legacy attribute names (_memory_limit_bytes, _memory_hard_limit_bytes, _prefill_memory_guard, _admission_paused) are kept as @Property accessors backed by ``_memory_state``, so the dozens of test setattrs and the secondary readers in _do_external_prefill / _step_prefill_chunk / _schedule_waiting work unchanged. Each setter rebuilds the bundle via ``replace`` (frozen dataclass) so an individual assignment is still a single atomic ref store of the new bundle. The setters fall back to a fresh _MemoryLimitState if invoked before __init__ finishes — covers tests that build Schedulers via __new__ and set attrs one at a time. Tests (108 pass across the touched suites): - tests/test_scheduler_prefill_memory_guard.py — covers plain LLM config, VLM nested config, top-level-num_hidden_layers referring to vision encoder, language_config / llm_config aliases, legacy n_layer / n_head / n_embd fallback, VLM nested-via-legacy-n_layer (new regression for the widened sub-config probe), broken-config exception swallow, guard on/off, cached-vs-uncached. - tests/test_process_memory_enforcer.py — includes regressions for the load-after-start race (test_check_and_enforce_propagates_every_poll), the wrapper-traversal path (test_propagates_through_batched_engine_wrapper), and the rate-limited unreachable-scheduler warning (test_unresolvable_scheduler_logs_warning_once). Two new tests pin the bundle contract: test_propagate_publishes_atomic_memory_state_bundle — asserts the writer publishes a frozen _MemoryLimitState with all four fields coherent, so the reader's single-snapshot read can never observe a mixed (guard=True, hard_limit=0) combination. test_memory_state_bundle_matches_guard_off_path — symmetric for the OFF path. - tests/test_memory_monitor.py — adds two cases pinning the public ``eviction_enabled`` predicate so Scheduler's dormant-eviction short-circuits don't silently flip behavior on a future refactor. Verified end-to-end against Qwen3.6-35B-A3B-oQ6 on a 48 GiB Mac16,8 with max_process_memory=40 GiB: a 110k-token prefill that previously either OOM-aborted the Python process or triggered the Apple IOGPUFamily kernel panic is now rejected by _preflight_memory_check before reaching MLX. Broader concurrent / cache-replay torture cycled memory pressure soft↔ok throughout but never crossed hard.
When the prefill memory guard rejects a request, surface it as a proper
HTTP 413 with the diagnostic message in the body, instead of an
unhandled 500 + truncated chunked read.
Two pieces are needed because the chat/completions path wraps the
response body in a StreamingResponse to send SSE keepalive bytes during
long prefill, and starlette emits http.response.start (status 200)
before iterating the body generator. A typed exception thrown from
within that body lands as "Caught handled exception, but response
already started" — the registered 413 handler is correctly invoked but
can no longer change the status code, so the client sees an incomplete
chunked read.
1. Typed-exception path.
Scheduler.add_request now runs _preflight_memory_check synchronously
and raises PrefillMemoryExceededError if rejection is warranted —
before the request is appended to self.waiting. Register an
@app.exception_handler(PrefillMemoryExceededError) that returns
HTTP 413 with the OpenAI-style {"error": {"message": ...}} body on
/v1/* routes and a plain {"detail": ...} elsewhere. The async
re-check in _schedule_waiting stays as a race-safety net. The
docstring on add_request now spells out that admission preprocessing
(tokenisation, prefix-cache lookup, block-table acquisition,
SpecPrefill scoring) HAS run by the time the typed exception fires
— preprocessing state is rolled back on the raise path so the
rejection doesn't leak resources.
2. Server-side preflight before StreamingResponse.
Add Scheduler.preflight_or_raise(num_prompt_tokens, cached_tokens=0)
that runs the token-count form of _preflight_memory_check and
raises PrefillMemoryExceededError on rejection (with a
logger.warning so observability matches the async path). Split out
_preflight_memory_check_tokens(num_prompt_tokens, cached_tokens) so
the existing Request-shaped check still works.
Add engine.preflight_chat(messages, tools=None, **kwargs) and
engine.preflight_completion(prompt, **kwargs) on BatchedEngine and
VLMBatchedEngine. Both apply the chat template, tokenize, traverse
the wrapper chain to reach the scheduler, and call
preflight_or_raise. If the wrapper chain resolves to None (partial
init failure) the preflight no-ops with a logged warning rather
than silently skipping the safety check. The warning is routed
through ``_warn_scheduler_unreachable_once`` in engine.base —
rate-limited per (engine_class, method) pair per process so a
misconfigured engine doesn't spam logs at request rate, only once
at first occurrence (enough to alert CI / oncall).
The VLM ``preflight_chat`` strips image content-parts via
``extract_images_from_messages`` BEFORE applying the chat template,
mirroring the real ``_process_chat_messages`` flow. Without this,
modern HF chat templates (Qwen2.5-VL, Gemma-Vision, Llama-3.2-Vision)
render image content-parts as literal placeholder strings inline
with the text — the tokenized prompt then already contains image-
placeholder tokens AND the per-image ``_IMAGE_TOKEN_UPPER_BOUND``
budget is added on top, double-counting and producing spurious
413s on borderline image-bearing requests that the real chat path
would have admitted. After the strip, the per-image budget is the
only source of image-token accounting and the upper bound is sharp.
The VLM preflight_chat keeps the 1280-tokens-per-image budget
(the high end of what Qwen-VL / Gemma-Vision packs expand each
image to). Over-estimates somewhat for small images (false-positive
413s for borderline-and-image cases) but never under-counts, which
is the property the guard needs to stay safe against the Apple
IOGPUFamily panic path. VLM ``preflight_chat`` also routes tools
through ``convert_tools_for_template`` to match the real chat path
— without conversion, Pydantic ``ToolDefinition`` callers would
hit the template's TypeError-retry fallback and silently drop
tools, miscalibrating the token count.
Wire the preflights into every prompt-bearing route handler BEFORE
the StreamingResponse return:
POST /v1/chat/completions (OpenAI standard)
POST /v1/completions (OpenAI legacy)
POST /v1/messages (Anthropic)
POST /v1/responses (OpenAI Responses)
``_is_api_route`` keeps its path-prefix-only implementation. The
server is currently mounted at root (no sub-app prefix); a comment
on the helper records the assumption so a future ``app.mount(
"/api", ...)`` deployment can move it to scope-based matching.
Conservative-by-design choices documented for follow-ups:
- The server preflight passes cached_tokens=0 (does not thread the
prefix-cache lookup forward). Under a hot prefix cache an admission
that would have fit may be rejected with 413; the synchronous
in-add_request recheck inside the scheduler runs after the prefix-
cache resolution and can still accept it, but the early rejection
costs the caller a retry. Tracked for a future PR that pre-resolves
the cache key.
- The estimator is prompt-only; decode growth from ``max_tokens`` is
not modelled. The tightened in-flight abort threshold (now equal to
max_bytes — see the memory-guard commit) catches the runtime
overflow if it happens, but the request will see a stream-shape
error rather than upfront 413. Acceptable given decode growth on
borderline prompts is unusual.
Tests:
- tests/test_server_prefill_memory_handler.py — 4 cases:
- handler shape on /v1/* (OpenAI ``{"error": ...}`` wrapper) and
non-/v1/ (plain ``{"detail": ...}``);
- end-to-end ``/v1/responses`` regression hitting the real route
with a force-rejecting preflight, asserting the production
handler returns 413 with the OpenAI body shape (catches any
future route that constructs a StreamingResponse without
awaiting preflight first).
- tests/test_engine_preflight.py — 20 cases: scheduler raise path
(within budget / over budget / guard disabled / fully cached),
BatchedEngine + VLMBatchedEngine preflight_chat with a stubbed
wrapper chain, preflight_completion, the VLM image-token budget,
the new VLM strip-images-before-template behaviour (regression for
the double-count bug), Pydantic-tools conversion through
convert_tools_for_template, and the wrapper-chain-unreachable
warning log.
Tokenizer hardening: preflight_chat and preflight_completion
on both engines now catch exceptions from tokenizer.encode (BPE
malformed input, HF Rust "Already borrowed", UnicodeDecodeError on
exotic content) and log + skip the memory check rather than letting
them surface as a new 500. The real chat / completion path will hit
the same error and route it through the existing handler chain, so
the user-visible response shape stays consistent with the
pre-preflight behaviour for these cases. A future PR can map raw
tokenizer errors to 400 for both paths.
Verified end-to-end against Qwen3.6-35B-A3B-oQ6 on a 48 GiB host: a
110k-token POST /v1/chat/completions now returns HTTP 413 in 0.13 s
with the full diagnostic body ("Prefill would require ~43.47 GB peak
(current 27.91 GB + KV+SDPA 15.56 GB) but limit is 40.00 GB."), and
the request never enters MLX prefill so the underlying Apple
IOGPUFamily kernel bug stays cold.
Closes a long-standing gap noticed during the upstream merge: the class ships full to_dict/from_dict but had no tests, so field drift would have gone unnoticed (cf. upstream jundot#1286 which had to retrofit the same coverage for SchedulerSettings/MemorySettings).
…stream Adopts the keyword-only `cached_tokens` parameter that upstream PR jundot#1326 settled on (commit efcc858). Same behavior — same correctness fix on the SDPA span — just a different API shape: before: (new_tokens, cached_tokens, chunk_size) # required pos after: (new_tokens, chunk_size, *, cached_tokens=0) # kw-only Done now so the next upstream merge auto-merges the function definition line instead of conflicting on it. The body still keeps the local correctness extras (early return on new_tokens<=0, negative cached_tokens clamp, eff_chunk floor) that upstream lacks.
Closes the four ModelSettings fields that had no references anywhere in tests/: turboquant_kv_bits, turboquant_skip_last, vlm_mtp_draft_model, vlm_mtp_draft_block_size. All four are wired into admin/routes.py and classified in model_profiles.py, so silent drift between defaults and the routes layer would have been invisible. Pattern matches the existing per-field convention in this file: defaults, to_dict/from_dict roundtrip, and (for Optional fields) excluded-when-none.
Verified unused: zero instantiations across the entire git history
(`git log -S 'TieredCacheManager('` returns only the initial-commit
declaration), zero imports outside the module's own __init__ re-export,
zero test references. The class was a planned coordinator between
PagedCacheManager / BlockAwarePrefixCache / PagedSSDCacheManager /
MemoryMonitor — Scheduler.__init__ does that wiring directly, making
the abstraction redundant.
Removes 353 lines plus the re-export from omlx.cache.__all__.
omlx/api/mcp_routes.py was wired into server.py at module init but had zero direct test coverage — test_mcp_*.py files exercised the manager and executor layers but not the FastAPI handlers, leaving the request parsing, no-manager fallbacks, and the tool/tool_name alias (upstream PR jundot#1285) unguarded. 15 tests via TestClient covering all four entry points: - set_mcp_manager_getter wiring + None propagation - GET /v1/mcp/tools: empty-list short-circuit, namespaced full_name serialization - GET /v1/mcp/servers: enum-to-string flattening, error-field passthrough - POST /v1/mcp/execute: 503 when unconfigured, success / handled-error paths, tool-vs-tool_name alias regression guard, default-arguments fallback, 422 on missing tool name The autouse fixture resets the module-global `_get_mcp_manager` callback between tests so state can't leak across cases.
omlx/api/rerank_models.py defines the request/response contract for the /v1/rerank endpoint (Cohere/Jina-compatible) but had no schema tests — only the engine layer was exercised by test_reranker_*.py. Field renames or default changes would have silently broken clients. 24 tests covering all four classes: - RerankRequest: text + multimodal (dict) query/documents shapes, defaults (return_documents=True, top_n=None), validation rejection on missing required fields, JSON round-trip with return_documents=False - RerankResult: document=None path, multimodal document passthrough (preserves image data-URIs), required field enforcement - RerankUsage: required total_tokens - RerankResponse: auto-id 'rerank-' prefix + 8 hex chars (Cohere clients filter on this), per-instance uniqueness, explicit-id override, optional usage, JSON round-trip with multiple results
omlx/admin/oq_manager.py is instantiated into server_state at startup and drives all admin-panel quantization workflows, but had no direct tests — test_oq.py covers the omlx/oq.py library underneath, not the async task manager on top. 51 tests across the synchronous and validation surface: - QuantStatus enum + str-inheritance - QuantTask.to_dict shape, progress rounding, status enum.value - _dir_size and _format_size helpers (parameterized) - _phase_label including the quantizing_eta|c|t|eta parser edge cases (zero total, non-numeric parts) - OQManager.__init__ defaults and the first-dir-wins output_dir rule - update_model_dirs refresh + empty-update no-op invariant - get_tasks / is_quantizing across active vs terminal statuses - remove_task: unknown / refuses-active / clears _cancelled set - cancel_quantization rejection paths (unknown, non-active) - start_quantization validation: invalid oq_level, invalid dtype, missing model dir, missing config.json, output collision, duplicate active task, and that completed tasks don't block resubmits - list_quantizable_models: empty / skip-malformed / MTP detection / is_quantized flag / dedup across dirs - shutdown: no-op when idle, cancels every active task The full _run_quantization / _estimate_progress lifecycle and the cooperative-cancellation flow are left for a future pass — they need end-to-end mocking of quantize_oq_streaming + Metal cleanup, which is brittle and not the highest-value gap right now.
Extends test_oq_manager.py with the lifecycle paths that needed quantize_oq_streaming mocking — left as a follow-up in the previous commit because they require threading-coordinated stubs. 13 new tests across four classes: - TestRunQuantizationHappyPath: completion fields (status, progress, output_size, started_at/completed_at ordering), sync and async on_complete callbacks, on_complete exception isolation (a buggy callback must not flip status to FAILED), progress-callback wiring - TestRunQuantizationFailure: exception → FAILED with error captured, partial output cleaned from disk - TestRunQuantizationPreCancel: _cancelled set before semaphore entry → quantize_oq_streaming never invoked (shutdown-race guard) - TestCancelCooperativeExit: full cancel via the progress-callback path (the design upstream chose over hard-cancelling the asyncio wrapper — see the comment block in cancel_quantization), plus the cancel-during-LOADING edge case where no progress_cb has fired yet - TestEstimateProgress: unknown-task no-op, parent task's finally clause cancels the progress estimator on both success and failure Uses threading.Event coordination in fake_quantize so the cooperative cancel path exits as soon as the flag is set, not after fixed delays — the full test class runs in <0.3s.
omlx/mcp/config.py is consulted on every server startup but had no direct tests — test_mcp_types.py covered the MCPConfig / MCPServerConfig dataclasses underneath, not the file discovery, JSON/YAML loading, or the validate_config schema checks layered on top. 38 tests across eight classes: - validate_config input: non-dict / empty / servers-must-be-dict / the falsy-non-dict-servers quirk (`[]` silently falls through to mcpServers) / entry-must-be-dict with server name in the message - validate_config server loading: stdio happy path, key-as-name auto injection, key-overrides-explicit-name, unknown-kwarg surfaces a ValueError with the offending server name, the Claude Desktop `mcpServers` alias for drop-in compatibility, `servers` precedence over `mcpServers` (or-chain semantics), empty mcpServers fall-through - validate_config globals: max_tool_calls valid/zero/negative/non-int plus the bool-is-int quirk (documented), default_timeout int/float/ zero/negative/string rejection - Path discovery: explicit existing path, missing explicit path → FileNotFoundError, tilde expansion, OMLX_MCP_CONFIG honored, env-var-missing-file falls through with a warning rather than crashing the server startup, search-path order, all-paths-missing returns an empty MCPConfig (not None, not error) - File formats: JSON load + malformed-JSON propagation, YAML load via .yaml and .yml (gated on PyYAML availability) - create_example_config: valid JSON, round-trips through validate, showcases multiple transports All discovery tests use an `isolated_env` fixture that monkeypatches CONFIG_SEARCH_PATHS to [], clears the env var, and chdirs to tmp_path so real ~/.config/omlx/mcp.json or stray ./mcp.json files can't leak into the test.
omlx/optimizations.py is a thin hardware/MLX status helper consumed by the admin dashboard. The re-exported HardwareInfo / detect_hardware / get_total_memory_gb were already covered by test_utils_hardware.py; this commit pins the dict shape produced by get_optimization_status and the flash-attention probe. 10 tests: - Re-exports are identity-equal to the canonical hardware symbols (otherwise external scripts importing from omlx.optimizations would silently get stale copies); __all__ matches the documented surface - get_optimization_status returns three top-level keys (hardware, mlx_memory, mlx_lm_features) with the expected sub-key sets - mlx_memory counters are non-negative ints sourced from mx.get_*_memory - mlx_lm_features static strings pinned verbatim — the dashboard shows these verbatim, so a typo here would surface in production - flash_attention reports 'built-in' on real MLX, 'not available' when mx.fast lacks scaled_dot_product_attention (simulated via patch.object(mx, 'fast', MagicMock(spec=[]))) - active_bytes actually reflects MLX state — allocating a 4 MiB array must register in the delta, proving the value isn't hardcoded
Closes two remaining gaps with real production impact: omlx/models/base_model.py — mean_pooling and normalize_embeddings are the pooling/L2-norm primitives used by omlx/models/xlm_roberta.py (the reranker model). A silent change to either would corrupt embedding output. 12 tests: - BaseModelArgs/BaseModelOutput dataclass shape (defaults + full fields) - mean_pooling: uniform mask = simple mean, partial mask excludes padded positions (the load-bearing invariant — pre-mask sums would let padding tokens taint embeddings), all-zero-mask guard prevents NaN/Inf (the clip(..., a_min=1e-9) defense), batch-dim passthrough, fp16 dtype propagation via the mask.astype(hidden_states.dtype) cast - normalize_embeddings: unit-norm output, axis=-1 only (catches wrong-axis bugs that would silently destroy similarity comparisons), higher-rank shape preservation, idempotence omlx/eval/mbpp.py — _extract_code is the regex/heuristic parser that pulls Python source out of MBPP-task LLM responses before the subprocess executor runs them. The branch coverage matters: 50% of runs use unspecified-language fences. 15 tests: - Fenced blocks: ```python wins over ```, multiline, whitespace strip - Heuristic line scan: def/class/import/from/# all trigger code-region start; everything after is included (no end detection by design) - Fallback: no markers → whole response stripped, empty/whitespace inputs, lone print() statement documents the known limitation
…_block Upstream 1b666af added the ssd_write_drops counter at three drop sites, including a pre-tensor-extraction guard in save_block that returns False when the write queue is already full. The earlier inline-LRU-unlinks change (b9d7161) argued the short-circuit was redundant because eviction no longer contended for the queue, and the rebase resolution dropped it. But eviction was never the only path to saturation: when the writer itself can't keep up with sustained saves, the short-circuit avoids running _enforce_size_limit_for_new_block and the GPU tensor extraction only to drop the block at the put step a few hundred lines later. Gated on `not self._hot_cache_enabled`. In hot-cache mode writes are deferred to LRU spill via _enqueue_ssd_write, which has its own timeout-put + drop accounting and shouldn't double-count here. Tests: - restore test_ssd_write_drops_increments_on_cold_store_preflight which patches _write_queue.full() to True - reword test_ssd_write_drops_increments_on_cold_store_late_exception to clarify it now covers the put-after-preflight race - test_ssd_write_drops_increments_on_hot_eviction_queue_full now patches put (the actual call in _enqueue_ssd_write) instead of put_nowait, matching the local timeout-based put semantics
The 0e6cbc5 commit (now part of the rebased local history) tightened _get_hard_limit_bytes to always return max_bytes, arguing that auto mode's system_ram - 4GB headroom let Qwen3.6-VL head_dim=256 prefills burst past the cap before Metal aborted. Upstream acd0533 has since landed an adaptive prefill throttle (_prefill_safe_zone_ratio / _prefill_min_chunk_tokens) that shrinks the per-chunk transient as usage approaches the cap — the original failure mode is now bounded by the throttle before reaching Metal. Restore upstream's two-branch behavior: - user_explicit_max=True -> max_bytes (user value IS the ceiling) - auto -> max(system_ram - 4GB, max_bytes) Tests: - remove the now-obsolete test_hard_limit_equals_max_bytes added by 0e6cbc5 (coverage is preserved by upstream's test_hard_limit_honors_user_explicit_max and test_hard_limit_auto_mode_still_uses_system_minus_4gb) - enforcer fixture now sets user_explicit_max=True so propagation tests can assert the exact hard_limit value without entanglement with the host's system memory
upstream acd0533 added max_process_memory_is_explicit, prefill_safe_zone_ratio, and prefill_min_chunk_tokens to MemorySettings.to_dict() but the to_dict test was not updated (last touched in jundot#1286 on 2026-05-19, predating acd0533 by a week). The test was already failing on upstream/main — this just closes the gap.
…butes The bundle (introduced in b6a69c4) wrapped the four memory-guard fields (_memory_limit_bytes, _memory_hard_limit_bytes, _prefill_memory_guard, _admission_paused) in a frozen dataclass so ProcessMemoryEnforcer could publish them via a single reference store. The motivation was PEP 703 free-threading, where per-attribute STORE_ATTRs are no longer GIL-serialized into a consistent order from another thread's view. Under current CPython the GIL already serializes the per-attribute writes within a single enforcer poll, so the hot-path reader (_preflight_memory_check_tokens) sees a coherent (guard, hard_limit) pair without the bundle. Keeping the bundle traded non-trivial complexity (frozen dataclass + four property/setter pairs + ``dataclasses.replace`` rebuilds on every test setattr) for PEP 703 prep we don't need today. Drop it; revisit if/when we ship nogil. - omlx/scheduler.py: remove _MemoryLimitState dataclass, remove the four properties and _current_memory_state helper, init the four fields as plain attributes (matches upstream). - omlx/process_memory_enforcer.py: _propagate_memory_limit writes the four fields directly instead of building a new bundle. - tests/test_process_memory_enforcer.py: drop _MemoryLimitState imports and bundle setup; collapse the two redundant bundle-atomicity tests into one guard-off test. Net: -250 / +51 lines.
The post-rebase add/add merge with upstream 1010fd3 left two classes covering update_model_dirs: - TestUpdateModelDirs (local) — bare tmp_path setup; covered the reorder-changes-head case and the empty-list defensive case. - TestOQManagerUpdateModelDirs (upstream) — fp_model_dir fixture setup; covered async list_quantizable_models() integration and the head-tracking case. They overlapped on the head-tracking assertion. Fold them into a single TestUpdateModelDirs with three tests: - test_picks_up_added_dir (async integration, from upstream) - test_output_dir_tracks_primary_dir (extended with the local test's _model_dirs == [head, ...] order assertion) - test_update_to_empty_leaves_old_output_dir (from local) Net: -7 tests removed, -2 new in the consolidated class, +1 stronger assertion on the head-tracking test.
…y_limit The docstring's "four guard fields are written as direct attributes / on current CPython the GIL serializes per-attribute STORE_ATTRs" paragraph was contextual to the now-removed _MemoryLimitState bundle (see 675cae1) — when the bundle existed, the paragraph explained why the bundled single-store was equivalent under today's GIL. With the bundle gone the paragraph is just a CPython explainer that adds noise without informing the reader.
…ream Upstream ``tests/test_integrations.py::TestIntegrationSettings`` covers defaults, basic to_dict, full-dict and empty-dict from_dict. The local class had near-exact duplicates of three of those (``test_defaults``, ``test_from_dict_full``, ``test_from_dict_empty``) — drop them. Keep the genuinely additive coverage: - ``test_to_dict_defaults`` — pins the exact to_dict shape, catching the kind of field-added-but-test-not-updated regression that bit MemorySettings.to_dict (see 81dc2d5). - ``test_to_dict_custom`` — stricter than upstream's (sets all seven fields, asserts the full dict). - ``test_from_dict_partial`` / ``test_from_dict_explicit_null_overrides_default`` / ``test_round_trip`` — partial-dict fallback, null-override semantics, identity round-trip; not covered upstream. Net: -28 lines, three fewer merge-conflict surfaces with upstream's copy of the same tests.
# Conflicts: # omlx/engine/base.py # omlx/process_memory_enforcer.py # omlx/scheduler.py # tests/test_process_memory_enforcer.py # tests/test_settings.py
# Conflicts: # packaging/build.py # pyproject.toml
…ping not yet populated _emit_prefill_boundary_snapshot routed through a uid that doesn't exist yet. At prefill time the request has not been inserted into BatchGenerator (that happens in _insert_prefilled_request *after* prefill completes), so request_id_to_uid has no entry for it and the lookup defaults to -1. _on_prefill_boundary_snapshot then does uid_to_request_id.get(-1) -> None -> silent return, dropping every boundary snapshot. For ArraysCache / GDN / hybrid models that meant every non-last cached block stored a placeholder (mx.zeros((1,)), mx.zeros((1,))) instead of the real recurrent state. The next identical-prefix request matched the cache by hash, found a placeholder in the last matched block, and the lookup-side guard at prefix_cache.py:1986 correctly rejected the cache as unusable — but the resulting "Request will reprocess from scratch" branch then re-prefilled the entire prompt. Visible symptom: re-uploading the same large prompt (e.g. a chat continuation) re-prefilled all 90k tokens instead of the few new ones. Pass request_id through directly. _on_prefill_boundary_snapshot was already keyed by request_id internally — the uid round-trip was pointless. The other prefill-side caller, _maybe_capture_boundary_ snapshot (called during generation from a path where uid IS valid), writes snapshots inline without going through this helper, so it was unaffected. That asymmetry explains the bimodal pattern in production logs: short-prompt + long-generation requests captured 20+ intermediate snapshots cleanly, while long-prompt + short- generation requests captured 0 (since all block boundaries were in prefill, where the uid lookup failed). Verified end-to-end via observability instrumentation on a 90k-token prompt: every SNAP_EMIT was followed by uid=-1 SNAP_SKIP before the fix; every SNAP_EMIT is followed by SNAP_OK after. The decisive proof was a subsequent identical-prefix request that started prefill at the cached boundary (token 51200) instead of token 0. Updated existing tests to reflect that no uid mapping is required. Added a regression test that mirrors production state at prefill time (request_id_to_uid deliberately unset).
|
Thanks for the diagnosis. The request_id_to_uid timing explanation is exactly right, and the short-prompt+long-gen vs long-prompt+short-gen bimodal pattern matches what I've been seeing on Qwen3.6-A3B sessions. Cherry-picked the boundary-snapshot fix onto main as 9407468 with your authorship preserved. I left the other 23 commits on this branch alone. They overlap with your narrower PRs (#1447, #1450, #1451, #1452, #1453, etc.), so I'd rather review and merge those independently to keep the main history readable. Closing this PR; the narrow ones stay open. Apologies if that's not what you intended. Happy to revisit if you'd prefer the bundle path. |
|
Thanks for the merge of 9407468 — and for the careful read on the bimodal pattern. You're right about the fat-PR shape — that's my mistake. I branched Happy to leave the narrower PRs (#1446–#1453, #1458) on their independent tracks per your suggestion. No need to revisit. Thanks again. |
Summary
Prefill-side boundary snapshots were being silently dropped for ArraysCache / GDN / hybrid models, causing every non-last cached block to store a placeholder and every identical-prefix re-upload to re-prefill from scratch.
_emit_prefill_boundary_snapshot(scheduler.py:2549) routed through:But
request_id_to_uidis only populated when the request is inserted intoBatchGeneratorvia_insert_prefilled_request, which runs after prefill completes. During prefill the lookup defaults to-1._on_prefill_boundary_snapshotthen does:→ every boundary-snapshot call during prefill was a no-op.
For sliceable cache types (KVCache, BatchKVCache, etc.) this is invisible — those layers don't need snapshots. For ArraysCache / GDN layers, every non-last block in
prefix_cache.store_cachestores a(mx.zeros((1,)), mx.zeros((1,)))placeholder when no snapshot is available. The next identical-prefix request matches blocks by hash, finds a placeholder shape(1,)in the last matched block, and the rejection atprefix_cache.py:1986correctly bails — but the resulting "Request will reprocess from scratch" branch re-prefills the entire prompt.User-visible symptom
Hybrid-architecture models (e.g. Qwen3-Next / Qwen3.6-A3B) re-prefilling the full prompt on every chat continuation, even when only one or two new turns were appended. Log signature:
Repeating every ~few minutes under normal chat use.
The fix
Pass
request_idthrough directly._on_prefill_boundary_snapshotalready keyed everything byrequest_idinternally — the uid round-trip was pointless.And the helper's signature changes from
uid: inttorequest_id: str, dropping the now-unuseduid_to_request_idlookup.The other prefill-side caller,
_maybe_capture_boundary_snapshot(called during generation from_stepafter the request has been inserted, where uid IS valid), writes snapshots inline without going through this helper, so it was unaffected. That asymmetry explains the bimodal pattern in production logs: short-prompt + long-generation requests captured 20+ intermediate snapshots cleanly (boundaries crossed during generation), while long-prompt + short-generation requests captured 0 (all boundaries in prefill, where the uid lookup failed).Diagnosis evidence
End-to-end repro on a Qwen3.6-35B-A3B (GDN layer 0) instance, 90k-token prompt:
Before fix (every block boundary):
After fix:
Subsequent identical-prefix request started prefill at token 51200 (previous cached boundary) instead of 0 — no rejection line.
Test plan
test_prefill_boundary_snapshot_records_rotating_cacheto NOT pre-populaterequest_id_to_uid/uid_to_request_id— matches actual prefill state.test_prefill_boundary_snapshot_ignores_non_boundary_token_countsimilarly.test_emit_prefill_boundary_snapshot_persists_before_uid_assignment— pins the regression: calling_emit_prefill_boundary_snapshotwhile the request has not been inserted into BatchGenerator must still record the snapshot.pytest tests/test_scheduler.py -k "prefill_boundary or emit_prefill_boundary"— 3 passed.