feat(server): typed pre-StreamingResponse preflight + HTTP 413#1452
Open
cfbraun wants to merge 4 commits into
Open
feat(server): typed pre-StreamingResponse preflight + HTTP 413#1452cfbraun wants to merge 4 commits into
cfbraun wants to merge 4 commits into
Conversation
a1a723b to
72f2389
Compare
Closed
5 tasks
3d4e6aa to
00cd892
Compare
Adds a pre-admission prefill-memory check that runs BEFORE the
FastAPI route wraps the response in a ``StreamingResponse`` and
raises a typed ``PrefillMemoryExceededError`` that the registered
exception handler maps to HTTP 413 with a structured JSON body
carrying ``estimated_bytes`` / ``limit_bytes``.
The existing ``_preflight_memory_check`` runs *inside*
``_schedule_waiting`` — after Starlette has already emitted
``http.response.start`` (status 200) for the streaming response.
Rejection there surfaces as an in-stream ``RequestOutput(
finish_reason="error")`` SSE event, so OpenAI-SDK and Anthropic-SDK
clients see HTTP 200 and parse an error mid-stream. That's confusing
for clients that don't read SSE bodies on error, and there's no clean
way to bubble it to the caller's exception handlers.
This PR adds an *earlier* gate:
1. Engine grows ``preflight_chat`` / ``preflight_completion``
async methods. ``BatchedEngine`` and ``VLMBatchedEngine``
override the base no-op: tokenize the templated prompt
(text-only template for VLMs, plus a conservative per-image
token-budget so we never under-count), then call the
scheduler.
2. Scheduler grows ``preflight_or_raise`` (raises
``PrefillMemoryExceededError`` on overshoot) and
``_preflight_memory_check_tokens`` (token-count form returning
a structured ``_PreflightRejection`` so the typed exception can
carry the numeric estimated / limit bytes).
3. Server grows ``prefill_memory_exceeded_handler`` (maps the
typed exception to HTTP 413 + OpenAI-shaped error body with
``code="prefill_memory_exceeded"``) and a 4× ``await
engine.preflight_*`` call site before each ``StreamingResponse``
return (chat, completion, anthropic messages, responses API).
Complementary to ``acd0533``'s adaptive prefill throttle: the
throttle is mid-prefill and bounds slow drift; this is pre-admission
and rejects "obviously too big from the start" before any compute
runs.
Also wakes up the existing dormant prefill-peak estimator by
instantiating ``self.memory_monitor`` in scheduler ``__init__`` —
``Scheduler`` previously only assigned ``self.memory_monitor =
None``, so ``_preflight_memory_check`` always returned None at the
``if self.memory_monitor is None: return`` guard. ``MemoryMonitor``
gains an ``eviction_enabled=False`` mode for use in paged-SSD-only
schedulers where ``estimate_blocks_to_free`` / ``_check_memory_pressure``
are dormant; the constructor's ``max_kv_cache_memory`` becomes
``int | None`` to match.
The existing in-stream ``_preflight_memory_check`` is kept as
defense-in-depth for callers that bypass the server preflight
(direct engine API, future endpoints). Its return type changes from
``str | None`` to ``_PreflightRejection | None``; the one in-stream
call site is updated to use ``.message``.
VLM helper additions:
- ``_IMAGE_TOKEN_UPPER_BOUND_FALLBACK`` constant.
- ``_derive_image_token_upper_bound(processor)``: derive the
per-image token bound from the processor's ``max_pixels`` /
``patch_size`` / ``merge_size``; fall back to the constant when
the processor isn't loaded or doesn't expose those.
- ``_count_image_tokens(messages, per_image_upper_bound)``: walk
OpenAI-style content parts and return the conservative budget.
Tests:
- ``tests/test_engine_preflight.py`` (~500 LOC, 19 tests): scheduler
``preflight_or_raise`` happy / reject paths; engine preflight
chat / completion oversize → raise; VLM image-token budget
inclusion; image-stripping before template application; Pydantic
tool conversion; tokenizer-error swallow on all three preflight
paths; request_id threading; scheduler-unreachable rate-limited
warning.
- ``tests/test_server_prefill_memory_handler.py`` (~195 LOC, 5
tests): handler returns 413 with OpenAI-shaped body for chat;
surfaces estimated_bytes / limit_bytes; non-API routes get plain
detail; ``/v1/responses`` endpoint reaches 413.
22 new tests + 208 existing scheduler / enforcer / memory tests pass.
User report: an HTTP 413 from this PR said "ceiling is 36.70 GB ... Reduce context length or lower memory_guard_tier" — but they had memory_guard_tier set to ``balanced`` on a 48 GB Mac with static_ceiling = 40 GB. The 36.7 GB was the *dynamic* ceiling (reclaimable memory was low right now), not the static cap they saw in the dashboard. The advice steered them toward the wrong knob: they needed to close other apps, not lower the tier. Root cause: ``_get_hard_limit_bytes()`` returns ``min(static, dynamic, metal_cap)`` but the rejection message had no way to know which of the three was binding. Every rejection collapsed to the same generic "raise tier or reduce context" advice. Fix: propagate the three component ceilings alongside the hard limit so the rejection path can name the binding one and give matching advice. Refactor ``_get_hard_limit_bytes`` into a thin wrapper around a new ``_get_ceiling_breakdown()`` that returns the four values in a single computation — important because the metal_cap probe shells out to ``sysctl``, and we don't want to double that on every poll. The enforcer reads the breakdown once and writes ``_memory_static_ceiling_bytes`` / ``_memory_dynamic_ceiling_bytes`` / ``_memory_metal_cap_bytes`` on each scheduler. The scheduler's ``_preflight_memory_check_tokens`` then identifies the binding constraint by comparing each component against ``_memory_hard_limit_bytes`` and emits a message+advice tailored to the cause: - **dynamic binding** (static cap has room but reclaimable is low): "close other apps (static cap is X but only Y is reclaimable right now), raise memory_guard_tier, or reduce context length" - **metal_cap binding** (kernel sysctl is the floor): "raise kernel iogpu.wired_limit_mb in Terminal, or reduce context length" - **static binding** (the configured reserve is the constraint): "reduce context length or raise memory_guard_tier" - **components not propagated** (test scaffolding wired hard_limit directly): falls back to the generic message; no crash. Tests (TestRejectionMessageNamesBindingCeiling, 4 new): - dynamic-binding case mirrors the user-reported scenario and checks for "close other apps" + the static-cap mention. - metal_cap binding mentions ``iogpu.wired_limit_mb``. - static binding falls through to ``memory_guard_tier``. - fallback path when the breakdown is unset still produces a usable message rather than a KeyError or empty string. ``_make_enforcer`` fixture updated to override ``_get_ceiling_breakdown`` alongside the existing ``_get_hard_limit_bytes`` override so tests with fixed ceilings don't accidentally pull the host's live ceiling math into the propagated values.
When ``memory_guard_tier=custom`` the "dynamic" ceiling that ``_get_dynamic_ceiling`` returns is the user-pinned ``custom_ceiling_bytes`` rather than computed reclaimable memory. The existing advice ladder treated dynamic-binding as "close other apps to free RAM" — wrong for custom-tier users: closing apps doesn't move their hard-pinned ceiling, raising ``custom_ceiling_bytes`` does. Propagate ``memory_guard_tier`` alongside the breakdown components and branch the advice on it. New test pins the custom-tier scenario.
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).
f9b9d60 to
31397a8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
For requests whose estimated peak exceeds the configured ceiling — i.e. requests that genuinely can't fit at any chunk size — run a typed memory check before the route handler wraps the response in a
StreamingResponse, and map the structured rejection to HTTP 413 with a diagnostic message that names the binding ceiling and suggests the right remedy.Strictly complementary to #1523 (
cb0904d), not a competing solution:Without this PR, an unfittable request either (a) tries to start streaming and the route handler raises after
http.response.starthas already gone out, locking the client to a malformed HTTP 200 with truncated SSE, or (b) gets requeued repeatedly by #1523 and eventually fails with a genericfinish_reason="error"— neither of which tells the operator which knob to turn.The problem with raising after the stream opens
FastAPI emits
http.response.startthe moment the route handler returns aStreamingResponse. Anything that raises inside the body iterator runs after the status code is on the wire, so the registeredPrefillMemoryExceededError→ HTTP 413 handler is bypassed. The client sees HTTP 200, an empty/truncated SSE stream, and no actionable error.What's in this PR
Typed
_PreflightRejection(scheduler.py) carryingmessage,estimated_bytes, andlimit_bytes. Replaces the prior pattern of returning a bare string that callers had to parse.preflight_or_raise(num_prompt_tokens, cached_tokens=0)on the scheduler: runs the memory check from token counts (noRequestneeded), raisesPrefillMemoryExceededErroron rejection. Callable from the API server layer immediately after tokenization.preflight_chat/preflight_completiononBatchedEngineandVLMBatchedEngine: tokenize the inputs the same way the streaming path does, then callpreflight_or_raiseon the underlying scheduler. The VLM helper covers both the standard and turboquant entry points.HTTP 413 mapping (
server.py): chat/completion/responses endpoints invoke the engine's preflight before constructing theStreamingResponse.PrefillMemoryExceededErroris registered with FastAPI to return 413 with the diagnostic body. The existing in-stream re-check inside_schedule_waitingis kept as defense-in-depth for callers that bypass the server preflight (direct engine API, future endpoints).Component-ceiling breakdown + advice ladder (
scheduler.py:_preflight_memory_check_tokens,process_memory_enforcer.py:_propagate_memory_limit): the enforcer now propagatesstatic,dynamic,metal_cap, andmemory_guard_tieralongside the existing hard limit. The rejection message identifies which of the three is binding and tailors the remedy:dynamicbinding, non-custom tier → "close other apps to free RAM"dynamicbinding, custom tier → "raisecustom_ceiling_bytes"metal_capbinding → "raise kerneliogpu.wired_limit_mb"staticbinding → "reduce context length or raisememory_guard_tier"Without this, the same generic "ceiling exceeded" message hides which knob actually helps.
MemoryMonitor
eviction_enabled=Falsemode (memory_monitor.py):Scheduler.__init__constructs the monitor in estimator-only mode so the preflight estimator works from the moment the engine is loaded, without callers needing to pre-callset_model_info.Test plan
tests/test_engine_preflight.py(651 lines, 22 tests) — pins the preflight contract onScheduler.preflight_or_raise,BatchedEngine.preflight_chat/completion,VLMBatchedEngine.preflight_chat, plus the four advice-ladder branches (dynamic,dynamic+custom,metal_cap,static) and the fallback when the breakdown isn't propagated.tests/test_server_prefill_memory_handler.py(195 lines) — verifies the FastAPI handler returns HTTP 413 with the diagnostic body for chat / completion / responses.tests/test_process_memory_enforcer.py— extended for the breakdown propagation (static / dynamic / metal_cap / tier).f9b9d60(defensiveset_model_info+ updated test mocks for the newMemoryMonitorinit shape).Why this is still useful after #1523
#1523 covers the recoverable case: a request that could fit if we sized chunks better, and the predictive throttle + requeue keeps it working. This PR covers the unrecoverable case: a request that simply can't fit, and the operator deserves a clean HTTP 413 telling them why and which knob to turn, instead of either a malformed 200 stream or a generic "error" finish reason. Both fixes together give a coherent memory-pressure story: throttle when you can, reject cleanly when you can't.