Skip to content

feat(server): typed pre-StreamingResponse preflight + HTTP 413#1452

Open
cfbraun wants to merge 4 commits into
jundot:mainfrom
cfbraun:pr/http-413-preflight
Open

feat(server): typed pre-StreamingResponse preflight + HTTP 413#1452
cfbraun wants to merge 4 commits into
jundot:mainfrom
cfbraun:pr/http-413-preflight

Conversation

@cfbraun
Copy link
Copy Markdown
Contributor

@cfbraun cfbraun commented May 27, 2026

Summary

For requests whose estimated peak exceeds the configured ceiling — i.e. requests that genuinely can't fit at any chunk size — run a typed memory check before the route handler wraps the response in a StreamingResponse, and map the structured rejection to HTTP 413 with a diagnostic message that names the binding ceiling and suggests the right remedy.

Strictly complementary to #1523 (cb0904d), not a competing solution:

Failure mode Owner
Request could fit but transient SDPA/MoE spikes blow the cap #1523 — predictive throttle + reclaim + requeue
Request can't possibly fit (e.g. 200k context against a 40 GB ceiling) this PR — preflight rejection mapped to HTTP 413
Telling the user which of static / dynamic / metal_cap / custom is the binding ceiling this PR — breakdown propagation + tailored advice ladder

Without this PR, an unfittable request either (a) tries to start streaming and the route handler raises after http.response.start has already gone out, locking the client to a malformed HTTP 200 with truncated SSE, or (b) gets requeued repeatedly by #1523 and eventually fails with a generic finish_reason="error" — neither of which tells the operator which knob to turn.

The problem with raising after the stream opens

FastAPI emits http.response.start the moment the route handler returns a StreamingResponse. Anything that raises inside the body iterator runs after the status code is on the wire, so the registered PrefillMemoryExceededError → HTTP 413 handler is bypassed. The client sees HTTP 200, an empty/truncated SSE stream, and no actionable error.

What's in this PR

  1. Typed _PreflightRejection (scheduler.py) carrying message, estimated_bytes, and limit_bytes. Replaces the prior pattern of returning a bare string that callers had to parse.

  2. preflight_or_raise(num_prompt_tokens, cached_tokens=0) on the scheduler: runs the memory check from token counts (no Request needed), raises PrefillMemoryExceededError on rejection. Callable from the API server layer immediately after tokenization.

  3. preflight_chat / preflight_completion on BatchedEngine and VLMBatchedEngine: tokenize the inputs the same way the streaming path does, then call preflight_or_raise on the underlying scheduler. The VLM helper covers both the standard and turboquant entry points.

  4. HTTP 413 mapping (server.py): chat/completion/responses endpoints invoke the engine's preflight before constructing the StreamingResponse. PrefillMemoryExceededError is registered with FastAPI to return 413 with the diagnostic body. The existing in-stream re-check inside _schedule_waiting is kept as defense-in-depth for callers that bypass the server preflight (direct engine API, future endpoints).

  5. Component-ceiling breakdown + advice ladder (scheduler.py:_preflight_memory_check_tokens, process_memory_enforcer.py:_propagate_memory_limit): the enforcer now propagates static, dynamic, metal_cap, and memory_guard_tier alongside the existing hard limit. The rejection message identifies which of the three is binding and tailors the remedy:

    • dynamic binding, non-custom tier → "close other apps to free RAM"
    • dynamic binding, custom tier → "raise custom_ceiling_bytes"
    • metal_cap binding → "raise kernel iogpu.wired_limit_mb"
    • static binding → "reduce context length or raise memory_guard_tier"

    Without this, the same generic "ceiling exceeded" message hides which knob actually helps.

  6. MemoryMonitor eviction_enabled=False mode (memory_monitor.py): Scheduler.__init__ constructs the monitor in estimator-only mode so the preflight estimator works from the moment the engine is loaded, without callers needing to pre-call set_model_info.

Test plan

  • tests/test_engine_preflight.py (651 lines, 22 tests) — pins the preflight contract on Scheduler.preflight_or_raise, BatchedEngine.preflight_chat/completion, VLMBatchedEngine.preflight_chat, plus the four advice-ladder branches (dynamic, dynamic+custom, metal_cap, static) and the fallback when the breakdown isn't propagated.
  • tests/test_server_prefill_memory_handler.py (195 lines) — verifies the FastAPI handler returns HTTP 413 with the diagnostic body for chat / completion / responses.
  • tests/test_process_memory_enforcer.py — extended for the breakdown propagation (static / dynamic / metal_cap / tier).
  • CI — green on Python 3.11 / 3.12 / 3.13 after the fixup in f9b9d60 (defensive set_model_info + updated test mocks for the new MemoryMonitor init shape).
  • Manual — confirmed on a real bundle that requests >40 GB peak return HTTP 413 with a message identifying which of static / dynamic / metal_cap / custom is binding.

Why this is still useful after #1523

#1523 covers the recoverable case: a request that could fit if we sized chunks better, and the predictive throttle + requeue keeps it working. This PR covers the unrecoverable case: a request that simply can't fit, and the operator deserves a clean HTTP 413 telling them why and which knob to turn, instead of either a malformed 200 stream or a generic "error" finish reason. Both fixes together give a coherent memory-pressure story: throttle when you can, reject cleanly when you can't.

cfbraun added 4 commits June 2, 2026 08:08
Adds a pre-admission prefill-memory check that runs BEFORE the
FastAPI route wraps the response in a ``StreamingResponse`` and
raises a typed ``PrefillMemoryExceededError`` that the registered
exception handler maps to HTTP 413 with a structured JSON body
carrying ``estimated_bytes`` / ``limit_bytes``.

The existing ``_preflight_memory_check`` runs *inside*
``_schedule_waiting`` — after Starlette has already emitted
``http.response.start`` (status 200) for the streaming response.
Rejection there surfaces as an in-stream ``RequestOutput(
finish_reason="error")`` SSE event, so OpenAI-SDK and Anthropic-SDK
clients see HTTP 200 and parse an error mid-stream. That's confusing
for clients that don't read SSE bodies on error, and there's no clean
way to bubble it to the caller's exception handlers.

This PR adds an *earlier* gate:

  1. Engine grows ``preflight_chat`` / ``preflight_completion``
     async methods. ``BatchedEngine`` and ``VLMBatchedEngine``
     override the base no-op: tokenize the templated prompt
     (text-only template for VLMs, plus a conservative per-image
     token-budget so we never under-count), then call the
     scheduler.
  2. Scheduler grows ``preflight_or_raise`` (raises
     ``PrefillMemoryExceededError`` on overshoot) and
     ``_preflight_memory_check_tokens`` (token-count form returning
     a structured ``_PreflightRejection`` so the typed exception can
     carry the numeric estimated / limit bytes).
  3. Server grows ``prefill_memory_exceeded_handler`` (maps the
     typed exception to HTTP 413 + OpenAI-shaped error body with
     ``code="prefill_memory_exceeded"``) and a 4× ``await
     engine.preflight_*`` call site before each ``StreamingResponse``
     return (chat, completion, anthropic messages, responses API).

Complementary to ``acd0533``'s adaptive prefill throttle: the
throttle is mid-prefill and bounds slow drift; this is pre-admission
and rejects "obviously too big from the start" before any compute
runs.

Also wakes up the existing dormant prefill-peak estimator by
instantiating ``self.memory_monitor`` in scheduler ``__init__`` —
``Scheduler`` previously only assigned ``self.memory_monitor =
None``, so ``_preflight_memory_check`` always returned None at the
``if self.memory_monitor is None: return`` guard. ``MemoryMonitor``
gains an ``eviction_enabled=False`` mode for use in paged-SSD-only
schedulers where ``estimate_blocks_to_free`` / ``_check_memory_pressure``
are dormant; the constructor's ``max_kv_cache_memory`` becomes
``int | None`` to match.

The existing in-stream ``_preflight_memory_check`` is kept as
defense-in-depth for callers that bypass the server preflight
(direct engine API, future endpoints). Its return type changes from
``str | None`` to ``_PreflightRejection | None``; the one in-stream
call site is updated to use ``.message``.

VLM helper additions:

- ``_IMAGE_TOKEN_UPPER_BOUND_FALLBACK`` constant.
- ``_derive_image_token_upper_bound(processor)``: derive the
  per-image token bound from the processor's ``max_pixels`` /
  ``patch_size`` / ``merge_size``; fall back to the constant when
  the processor isn't loaded or doesn't expose those.
- ``_count_image_tokens(messages, per_image_upper_bound)``: walk
  OpenAI-style content parts and return the conservative budget.

Tests:

- ``tests/test_engine_preflight.py`` (~500 LOC, 19 tests): scheduler
  ``preflight_or_raise`` happy / reject paths; engine preflight
  chat / completion oversize → raise; VLM image-token budget
  inclusion; image-stripping before template application; Pydantic
  tool conversion; tokenizer-error swallow on all three preflight
  paths; request_id threading; scheduler-unreachable rate-limited
  warning.
- ``tests/test_server_prefill_memory_handler.py`` (~195 LOC, 5
  tests): handler returns 413 with OpenAI-shaped body for chat;
  surfaces estimated_bytes / limit_bytes; non-API routes get plain
  detail; ``/v1/responses`` endpoint reaches 413.

22 new tests + 208 existing scheduler / enforcer / memory tests pass.
User report: an HTTP 413 from this PR said "ceiling is 36.70 GB ...
Reduce context length or lower memory_guard_tier" — but they had
memory_guard_tier set to ``balanced`` on a 48 GB Mac with
static_ceiling = 40 GB. The 36.7 GB was the *dynamic* ceiling
(reclaimable memory was low right now), not the static cap they saw
in the dashboard. The advice steered them toward the wrong knob: they
needed to close other apps, not lower the tier.

Root cause: ``_get_hard_limit_bytes()`` returns ``min(static, dynamic,
metal_cap)`` but the rejection message had no way to know which of
the three was binding. Every rejection collapsed to the same generic
"raise tier or reduce context" advice.

Fix: propagate the three component ceilings alongside the hard limit
so the rejection path can name the binding one and give matching
advice. Refactor ``_get_hard_limit_bytes`` into a thin wrapper around
a new ``_get_ceiling_breakdown()`` that returns the four values in a
single computation — important because the metal_cap probe shells out
to ``sysctl``, and we don't want to double that on every poll. The
enforcer reads the breakdown once and writes ``_memory_static_ceiling_bytes``
/ ``_memory_dynamic_ceiling_bytes`` / ``_memory_metal_cap_bytes`` on
each scheduler.

The scheduler's ``_preflight_memory_check_tokens`` then identifies
the binding constraint by comparing each component against
``_memory_hard_limit_bytes`` and emits a message+advice tailored to
the cause:

- **dynamic binding** (static cap has room but reclaimable is low):
  "close other apps (static cap is X but only Y is reclaimable right
  now), raise memory_guard_tier, or reduce context length"
- **metal_cap binding** (kernel sysctl is the floor):
  "raise kernel iogpu.wired_limit_mb in Terminal, or reduce context
  length"
- **static binding** (the configured reserve is the constraint):
  "reduce context length or raise memory_guard_tier"
- **components not propagated** (test scaffolding wired hard_limit
  directly): falls back to the generic message; no crash.

Tests (TestRejectionMessageNamesBindingCeiling, 4 new):
- dynamic-binding case mirrors the user-reported scenario and checks
  for "close other apps" + the static-cap mention.
- metal_cap binding mentions ``iogpu.wired_limit_mb``.
- static binding falls through to ``memory_guard_tier``.
- fallback path when the breakdown is unset still produces a usable
  message rather than a KeyError or empty string.

``_make_enforcer`` fixture updated to override ``_get_ceiling_breakdown``
alongside the existing ``_get_hard_limit_bytes`` override so tests
with fixed ceilings don't accidentally pull the host's live ceiling
math into the propagated values.
When ``memory_guard_tier=custom`` the "dynamic" ceiling that
``_get_dynamic_ceiling`` returns is the user-pinned
``custom_ceiling_bytes`` rather than computed reclaimable memory. The
existing advice ladder treated dynamic-binding as "close other apps to
free RAM" — wrong for custom-tier users: closing apps doesn't move
their hard-pinned ceiling, raising ``custom_ceiling_bytes`` does.

Propagate ``memory_guard_tier`` alongside the breakdown components and
branch the advice on it. New test pins the custom-tier scenario.
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492)
surfaced three failures introduced by the rebase against upstream's
cb0904d (predictive prefill throttle) and our own change to
``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only
mode so preflight checks work without prior set_model_info):

1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies
   as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the
   ``if num_layers and num_kv_heads and head_dim:`` truthiness gate
   doesn't reject auto-attribute mocks. The estimator path
   (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then
   blew up the upstream throttle tests with
   ``TypeError: '>' not supported between instances of 'MagicMock'
   and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so
   the estimator stays disabled until set_model_info is called with
   genuine model dims.

2. ``test_schedule_waiting_preflight_rejection_releases`` patched
   ``_preflight_memory_check`` to return a plain ``str`` (matching
   the pre-PR contract). Our PR types the return as
   ``_PreflightRejection`` and the rejection-path code in
   ``_schedule_waiting`` reads ``.message`` on it. Update the mock to
   return an actual ``_PreflightRejection`` so the test exercises the
   real shape.

3. ``test_picks_text_config_over_top_level_vision_dims`` (added by
   our merged jundot#1448) asserted ``sched.memory_monitor is None`` at
   init. Our PR now constructs the monitor in estimator-only mode at
   init time; replace the assertion with the actual setup it needs
   (overwrite ``memory_monitor`` with a MagicMock to inspect
   ``set_model_info`` calls).
@cfbraun cfbraun force-pushed the pr/http-413-preflight branch from f9b9d60 to 31397a8 Compare June 2, 2026 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant