feat(server): typed pre-StreamingResponse preflight + HTTP 413 by cfbraun · Pull Request #1452 · jundot/omlx

cfbraun · 2026-05-27T07:32:59Z

Summary

For requests whose estimated peak exceeds the configured ceiling — i.e. requests that genuinely can't fit at any chunk size — run a typed memory check before the route handler wraps the response in a StreamingResponse, and map the structured rejection to HTTP 413 with a diagnostic message that names the binding ceiling and suggests the right remedy.

Strictly complementary to #1523 (cb0904d), not a competing solution:

Failure mode	Owner
Request could fit but transient SDPA/MoE spikes blow the cap	#1523 — predictive throttle + reclaim + requeue
Request can't possibly fit (e.g. 200k context against a 40 GB ceiling)	this PR — preflight rejection mapped to HTTP 413
Telling the user which of static / dynamic / metal_cap / custom is the binding ceiling	this PR — breakdown propagation + tailored advice ladder

Without this PR, an unfittable request either (a) tries to start streaming and the route handler raises after http.response.start has already gone out, locking the client to a malformed HTTP 200 with truncated SSE, or (b) gets requeued repeatedly by #1523 and eventually fails with a generic finish_reason="error" — neither of which tells the operator which knob to turn.

The problem with raising after the stream opens

FastAPI emits http.response.start the moment the route handler returns a StreamingResponse. Anything that raises inside the body iterator runs after the status code is on the wire, so the registered PrefillMemoryExceededError → HTTP 413 handler is bypassed. The client sees HTTP 200, an empty/truncated SSE stream, and no actionable error.

What's in this PR

Typed _PreflightRejection (scheduler.py) carrying message, estimated_bytes, and limit_bytes. Replaces the prior pattern of returning a bare string that callers had to parse.
preflight_or_raise(num_prompt_tokens, cached_tokens=0) on the scheduler: runs the memory check from token counts (no Request needed), raises PrefillMemoryExceededError on rejection. Callable from the API server layer immediately after tokenization.
preflight_chat / preflight_completion on BatchedEngine and VLMBatchedEngine: tokenize the inputs the same way the streaming path does, then call preflight_or_raise on the underlying scheduler. The VLM helper covers both the standard and turboquant entry points.
HTTP 413 mapping (server.py): chat/completion/responses endpoints invoke the engine's preflight before constructing the StreamingResponse. PrefillMemoryExceededError is registered with FastAPI to return 413 with the diagnostic body. The existing in-stream re-check inside _schedule_waiting is kept as defense-in-depth for callers that bypass the server preflight (direct engine API, future endpoints).
Component-ceiling breakdown + advice ladder (scheduler.py:_preflight_memory_check_tokens, process_memory_enforcer.py:_propagate_memory_limit): the enforcer now propagates static, dynamic, metal_cap, and memory_guard_tier alongside the existing hard limit. The rejection message identifies which of the three is binding and tailors the remedy:
- dynamic binding, non-custom tier → "close other apps to free RAM"
- dynamic binding, custom tier → "raise custom_ceiling_bytes"
- metal_cap binding → "raise kernel iogpu.wired_limit_mb"
- static binding → "reduce context length or raise memory_guard_tier"
Without this, the same generic "ceiling exceeded" message hides which knob actually helps.
MemoryMonitor eviction_enabled=False mode (memory_monitor.py): Scheduler.__init__ constructs the monitor in estimator-only mode so the preflight estimator works from the moment the engine is loaded, without callers needing to pre-call set_model_info.

Test plan

tests/test_engine_preflight.py (651 lines, 22 tests) — pins the preflight contract on Scheduler.preflight_or_raise, BatchedEngine.preflight_chat/completion, VLMBatchedEngine.preflight_chat, plus the four advice-ladder branches (dynamic, dynamic+custom, metal_cap, static) and the fallback when the breakdown isn't propagated.
tests/test_server_prefill_memory_handler.py (195 lines) — verifies the FastAPI handler returns HTTP 413 with the diagnostic body for chat / completion / responses.
tests/test_process_memory_enforcer.py — extended for the breakdown propagation (static / dynamic / metal_cap / tier).
CI — green on Python 3.11 / 3.12 / 3.13 after the fixup in f9b9d60 (defensive set_model_info + updated test mocks for the new MemoryMonitor init shape).
Manual — confirmed on a real bundle that requests >40 GB peak return HTTP 413 with a message identifying which of static / dynamic / metal_cap / custom is binding.

Why this is still useful after #1523

#1523 covers the recoverable case: a request that could fit if we sized chunks better, and the predictive throttle + requeue keeps it working. This PR covers the unrecoverable case: a request that simply can't fit, and the operator deserves a clean HTTP 413 telling them why and which knob to turn, instead of either a malformed 200 stream or a generic "error" finish reason. Both fixes together give a coherent memory-pressure story: throttle when you can, reject cleanly when you can't.

Adds a pre-admission prefill-memory check that runs BEFORE the FastAPI route wraps the response in a ``StreamingResponse`` and raises a typed ``PrefillMemoryExceededError`` that the registered exception handler maps to HTTP 413 with a structured JSON body carrying ``estimated_bytes`` / ``limit_bytes``. The existing ``_preflight_memory_check`` runs *inside* ``_schedule_waiting`` — after Starlette has already emitted ``http.response.start`` (status 200) for the streaming response. Rejection there surfaces as an in-stream ``RequestOutput( finish_reason="error")`` SSE event, so OpenAI-SDK and Anthropic-SDK clients see HTTP 200 and parse an error mid-stream. That's confusing for clients that don't read SSE bodies on error, and there's no clean way to bubble it to the caller's exception handlers. This PR adds an *earlier* gate: 1. Engine grows ``preflight_chat`` / ``preflight_completion`` async methods. ``BatchedEngine`` and ``VLMBatchedEngine`` override the base no-op: tokenize the templated prompt (text-only template for VLMs, plus a conservative per-image token-budget so we never under-count), then call the scheduler. 2. Scheduler grows ``preflight_or_raise`` (raises ``PrefillMemoryExceededError`` on overshoot) and ``_preflight_memory_check_tokens`` (token-count form returning a structured ``_PreflightRejection`` so the typed exception can carry the numeric estimated / limit bytes). 3. Server grows ``prefill_memory_exceeded_handler`` (maps the typed exception to HTTP 413 + OpenAI-shaped error body with ``code="prefill_memory_exceeded"``) and a 4× ``await engine.preflight_*`` call site before each ``StreamingResponse`` return (chat, completion, anthropic messages, responses API). Complementary to ``acd0533``'s adaptive prefill throttle: the throttle is mid-prefill and bounds slow drift; this is pre-admission and rejects "obviously too big from the start" before any compute runs. Also wakes up the existing dormant prefill-peak estimator by instantiating ``self.memory_monitor`` in scheduler ``__init__`` — ``Scheduler`` previously only assigned ``self.memory_monitor = None``, so ``_preflight_memory_check`` always returned None at the ``if self.memory_monitor is None: return`` guard. ``MemoryMonitor`` gains an ``eviction_enabled=False`` mode for use in paged-SSD-only schedulers where ``estimate_blocks_to_free`` / ``_check_memory_pressure`` are dormant; the constructor's ``max_kv_cache_memory`` becomes ``int | None`` to match. The existing in-stream ``_preflight_memory_check`` is kept as defense-in-depth for callers that bypass the server preflight (direct engine API, future endpoints). Its return type changes from ``str | None`` to ``_PreflightRejection | None``; the one in-stream call site is updated to use ``.message``. VLM helper additions: - ``_IMAGE_TOKEN_UPPER_BOUND_FALLBACK`` constant. - ``_derive_image_token_upper_bound(processor)``: derive the per-image token bound from the processor's ``max_pixels`` / ``patch_size`` / ``merge_size``; fall back to the constant when the processor isn't loaded or doesn't expose those. - ``_count_image_tokens(messages, per_image_upper_bound)``: walk OpenAI-style content parts and return the conservative budget. Tests: - ``tests/test_engine_preflight.py`` (~500 LOC, 19 tests): scheduler ``preflight_or_raise`` happy / reject paths; engine preflight chat / completion oversize → raise; VLM image-token budget inclusion; image-stripping before template application; Pydantic tool conversion; tokenizer-error swallow on all three preflight paths; request_id threading; scheduler-unreachable rate-limited warning. - ``tests/test_server_prefill_memory_handler.py`` (~195 LOC, 5 tests): handler returns 413 with OpenAI-shaped body for chat; surfaces estimated_bytes / limit_bytes; non-API routes get plain detail; ``/v1/responses`` endpoint reaches 413. 22 new tests + 208 existing scheduler / enforcer / memory tests pass.

User report: an HTTP 413 from this PR said "ceiling is 36.70 GB ... Reduce context length or lower memory_guard_tier" — but they had memory_guard_tier set to ``balanced`` on a 48 GB Mac with static_ceiling = 40 GB. The 36.7 GB was the *dynamic* ceiling (reclaimable memory was low right now), not the static cap they saw in the dashboard. The advice steered them toward the wrong knob: they needed to close other apps, not lower the tier. Root cause: ``_get_hard_limit_bytes()`` returns ``min(static, dynamic, metal_cap)`` but the rejection message had no way to know which of the three was binding. Every rejection collapsed to the same generic "raise tier or reduce context" advice. Fix: propagate the three component ceilings alongside the hard limit so the rejection path can name the binding one and give matching advice. Refactor ``_get_hard_limit_bytes`` into a thin wrapper around a new ``_get_ceiling_breakdown()`` that returns the four values in a single computation — important because the metal_cap probe shells out to ``sysctl``, and we don't want to double that on every poll. The enforcer reads the breakdown once and writes ``_memory_static_ceiling_bytes`` / ``_memory_dynamic_ceiling_bytes`` / ``_memory_metal_cap_bytes`` on each scheduler. The scheduler's ``_preflight_memory_check_tokens`` then identifies the binding constraint by comparing each component against ``_memory_hard_limit_bytes`` and emits a message+advice tailored to the cause: - **dynamic binding** (static cap has room but reclaimable is low): "close other apps (static cap is X but only Y is reclaimable right now), raise memory_guard_tier, or reduce context length" - **metal_cap binding** (kernel sysctl is the floor): "raise kernel iogpu.wired_limit_mb in Terminal, or reduce context length" - **static binding** (the configured reserve is the constraint): "reduce context length or raise memory_guard_tier" - **components not propagated** (test scaffolding wired hard_limit directly): falls back to the generic message; no crash. Tests (TestRejectionMessageNamesBindingCeiling, 4 new): - dynamic-binding case mirrors the user-reported scenario and checks for "close other apps" + the static-cap mention. - metal_cap binding mentions ``iogpu.wired_limit_mb``. - static binding falls through to ``memory_guard_tier``. - fallback path when the breakdown is unset still produces a usable message rather than a KeyError or empty string. ``_make_enforcer`` fixture updated to override ``_get_ceiling_breakdown`` alongside the existing ``_get_hard_limit_bytes`` override so tests with fixed ceilings don't accidentally pull the host's live ceiling math into the propagated values.

When ``memory_guard_tier=custom`` the "dynamic" ceiling that ``_get_dynamic_ceiling`` returns is the user-pinned ``custom_ceiling_bytes`` rather than computed reclaimable memory. The existing advice ladder treated dynamic-binding as "close other apps to free RAM" — wrong for custom-tier users: closing apps doesn't move their hard-pinned ceiling, raising ``custom_ceiling_bytes`` does. Propagate ``memory_guard_tier`` alongside the breakdown components and branch the advice on it. New test pins the custom-tier scenario.

The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).

cfbraun force-pushed the pr/http-413-preflight branch 2 times, most recently from a1a723b to 72f2389 Compare May 27, 2026 09:30

jcalvert mentioned this pull request May 27, 2026

fix(server): chunk-form SSE keepalive shares the stream's completion id #1479

Merged

jundot mentioned this pull request May 28, 2026

fix(cache): boundary snapshots dropped during prefill because uid mapping not yet populated #1471

Closed

5 tasks

cfbraun force-pushed the pr/http-413-preflight branch 7 times, most recently from 3d4e6aa to 00cd892 Compare June 1, 2026 06:12

cfbraun added 4 commits June 2, 2026 08:08

cfbraun force-pushed the pr/http-413-preflight branch from f9b9d60 to 31397a8 Compare June 2, 2026 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): typed pre-StreamingResponse preflight + HTTP 413#1452

feat(server): typed pre-StreamingResponse preflight + HTTP 413#1452
cfbraun wants to merge 4 commits into
jundot:mainfrom
cfbraun:pr/http-413-preflight

cfbraun commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cfbraun commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The problem with raising after the stream opens

What's in this PR

Test plan

Why this is still useful after #1523

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cfbraun commented May 27, 2026 •

edited

Loading