feat: add --max-context-tokens for OOM prevention#19
Conversation
Track prompt token count and reject requests exceeding the configured context window before they reach the GPU, preventing OOM crashes. New flag: --max-context-tokens N (default: 0 = unlimited) Env: MAX_CONTEXT_TOKENS 5 tests: within-limit, exceeds-limit, zero-unlimited, default, env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds an optional prompt context-length guard to the FastAPI server to proactively reject overly-long requests (HTTP 400) instead of risking an OOM crash. It introduces a --max-context-tokens CLI flag (backed by MAX_CONTEXT_TOKENS) and basic unit tests for the new helpers.
Changes:
- Add
get_max_context_tokens()andcheck_context_length()utilities to compute tokenized prompt length and reject requests beyond a configured limit. - Wire the context-length check into both
/responsesand/chat/completionsrequest flows after chat-template formatting. - Add tests covering default/env configuration and helper behavior for within/over-limit cases.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
mlx_vlm/server.py |
Adds max-context configuration + prompt token counting guard; integrates guard into request handlers; adds CLI flag/env wiring. |
mlx_vlm/tests/test_server.py |
Adds unit tests for max-context env parsing and context-length validation helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| parser.add_argument( | ||
| "--max-context-tokens", | ||
| type=int, |
There was a problem hiding this comment.
The new --max-context-tokens flag accepts negative values; those end up setting MAX_CONTEXT_TOKENS to a negative number, and check_context_length then treats it as “no limit” (max_context <= 0). Consider validating/clamping this argument to be non-negative (e.g., argparse custom type that raises on negatives, or enforce >= 0 after parsing) so misconfiguration doesn’t silently disable the protection.
| parser.add_argument( | |
| "--max-context-tokens", | |
| type=int, | |
| def non_negative_int(value): | |
| try: | |
| parsed_value = int(value) | |
| except ValueError as exc: | |
| raise argparse.ArgumentTypeError( | |
| f"invalid int value: {value!r}" | |
| ) from exc | |
| if parsed_value < 0: | |
| raise argparse.ArgumentTypeError( | |
| "--max-context-tokens must be greater than or equal to 0" | |
| ) | |
| return parsed_value | |
| parser.add_argument( | |
| "--max-context-tokens", | |
| type=non_negative_int, |
| def test_get_max_context_tokens_default(): | ||
| import os | ||
| os.environ.pop("MAX_CONTEXT_TOKENS", None) | ||
| assert server.get_max_context_tokens() == 0 | ||
|
|
||
|
|
||
| def test_get_max_context_tokens_from_env(): | ||
| import os | ||
| os.environ["MAX_CONTEXT_TOKENS"] = "16384" | ||
| assert server.get_max_context_tokens() == 16384 | ||
| os.environ.pop("MAX_CONTEXT_TOKENS") |
There was a problem hiding this comment.
These tests mutate os.environ (pop/set MAX_CONTEXT_TOKENS) without restoring any pre-existing value. This can make the suite order-/environment-dependent when MAX_CONTEXT_TOKENS is set in the runner. Use pytest’s monkeypatch fixture (or save/restore the prior value) so environment changes are isolated to the test.
| def test_get_max_context_tokens_default(): | |
| import os | |
| os.environ.pop("MAX_CONTEXT_TOKENS", None) | |
| assert server.get_max_context_tokens() == 0 | |
| def test_get_max_context_tokens_from_env(): | |
| import os | |
| os.environ["MAX_CONTEXT_TOKENS"] = "16384" | |
| assert server.get_max_context_tokens() == 16384 | |
| os.environ.pop("MAX_CONTEXT_TOKENS") | |
| def test_get_max_context_tokens_default(monkeypatch): | |
| monkeypatch.delenv("MAX_CONTEXT_TOKENS", raising=False) | |
| assert server.get_max_context_tokens() == 0 | |
| def test_get_max_context_tokens_from_env(monkeypatch): | |
| monkeypatch.setenv("MAX_CONTEXT_TOKENS", "16384") | |
| assert server.get_max_context_tokens() == 16384 |
- Reject negative --max-context-tokens (validate >= 0) - Use pytest monkeypatch for env var tests instead of os.environ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary\nReject requests exceeding context window with 400 instead of OOM crash. --max-context-tokens flag. 5 tests.