Skip to content

feat: add --max-context-tokens for OOM prevention#19

Open
eloe wants to merge 2 commits into
mainfrom
feature/context-tracking
Open

feat: add --max-context-tokens for OOM prevention#19
eloe wants to merge 2 commits into
mainfrom
feature/context-tracking

Conversation

@eloe
Copy link
Copy Markdown
Owner

@eloe eloe commented Apr 6, 2026

Summary\nReject requests exceeding context window with 400 instead of OOM crash. --max-context-tokens flag. 5 tests.

Track prompt token count and reject requests exceeding the configured
context window before they reach the GPU, preventing OOM crashes.

New flag: --max-context-tokens N (default: 0 = unlimited)
Env: MAX_CONTEXT_TOKENS

5 tests: within-limit, exceeds-limit, zero-unlimited, default, env var.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional prompt context-length guard to the FastAPI server to proactively reject overly-long requests (HTTP 400) instead of risking an OOM crash. It introduces a --max-context-tokens CLI flag (backed by MAX_CONTEXT_TOKENS) and basic unit tests for the new helpers.

Changes:

  • Add get_max_context_tokens() and check_context_length() utilities to compute tokenized prompt length and reject requests beyond a configured limit.
  • Wire the context-length check into both /responses and /chat/completions request flows after chat-template formatting.
  • Add tests covering default/env configuration and helper behavior for within/over-limit cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
mlx_vlm/server.py Adds max-context configuration + prompt token counting guard; integrates guard into request handlers; adds CLI flag/env wiring.
mlx_vlm/tests/test_server.py Adds unit tests for max-context env parsing and context-length validation helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlx_vlm/server.py
Comment on lines +1457 to +1459
parser.add_argument(
"--max-context-tokens",
type=int,
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new --max-context-tokens flag accepts negative values; those end up setting MAX_CONTEXT_TOKENS to a negative number, and check_context_length then treats it as “no limit” (max_context <= 0). Consider validating/clamping this argument to be non-negative (e.g., argparse custom type that raises on negatives, or enforce >= 0 after parsing) so misconfiguration doesn’t silently disable the protection.

Suggested change
parser.add_argument(
"--max-context-tokens",
type=int,
def non_negative_int(value):
try:
parsed_value = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError(
f"invalid int value: {value!r}"
) from exc
if parsed_value < 0:
raise argparse.ArgumentTypeError(
"--max-context-tokens must be greater than or equal to 0"
)
return parsed_value
parser.add_argument(
"--max-context-tokens",
type=non_negative_int,

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/tests/test_server.py Outdated
Comment on lines +161 to +171
def test_get_max_context_tokens_default():
import os
os.environ.pop("MAX_CONTEXT_TOKENS", None)
assert server.get_max_context_tokens() == 0


def test_get_max_context_tokens_from_env():
import os
os.environ["MAX_CONTEXT_TOKENS"] = "16384"
assert server.get_max_context_tokens() == 16384
os.environ.pop("MAX_CONTEXT_TOKENS")
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests mutate os.environ (pop/set MAX_CONTEXT_TOKENS) without restoring any pre-existing value. This can make the suite order-/environment-dependent when MAX_CONTEXT_TOKENS is set in the runner. Use pytest’s monkeypatch fixture (or save/restore the prior value) so environment changes are isolated to the test.

Suggested change
def test_get_max_context_tokens_default():
import os
os.environ.pop("MAX_CONTEXT_TOKENS", None)
assert server.get_max_context_tokens() == 0
def test_get_max_context_tokens_from_env():
import os
os.environ["MAX_CONTEXT_TOKENS"] = "16384"
assert server.get_max_context_tokens() == 16384
os.environ.pop("MAX_CONTEXT_TOKENS")
def test_get_max_context_tokens_default(monkeypatch):
monkeypatch.delenv("MAX_CONTEXT_TOKENS", raising=False)
assert server.get_max_context_tokens() == 0
def test_get_max_context_tokens_from_env(monkeypatch):
monkeypatch.setenv("MAX_CONTEXT_TOKENS", "16384")
assert server.get_max_context_tokens() == 16384

Copilot uses AI. Check for mistakes.
- Reject negative --max-context-tokens (validate >= 0)
- Use pytest monkeypatch for env var tests instead of os.environ

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants