feat: add request cancellation and timeout support by eloe · Pull Request #20 · eloe/mlx-vlm

eloe · 2026-04-06T16:16:35Z

Summary\nHandle client disconnects during streaming. --request-timeout flag (default 300s). 5 tests.

…oints Detect client disconnects during SSE streaming via request.is_disconnected() and asyncio.CancelledError, closing the token iterator to stop generation. For non-streaming requests, wrap generation in asyncio.wait_for with a configurable --request-timeout (default 300s) returning HTTP 504 on timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds request cancellation handling for streaming responses and introduces a configurable timeout for non-streaming generation requests (plus a couple of new request/CLI behaviors around response formatting and prompt length limits).

Changes:

Detects client disconnects during SSE streaming for /responses and /chat/completions and aborts token iteration.
Adds non-streaming request timeout support (--request-timeout, default 300s) and tests for 504 responses on timeout.
Introduces response_format handling (JSON-only instruction injection) and a --max-context-tokens prompt-length enforcement.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 11 comments.

File	Description
`mlx_vlm/server.py`	Adds disconnect checks for streaming, timeout logic for non-streaming via executor + `wait_for`, and introduces `response_format` + `--max-context-tokens` enforcement.
`mlx_vlm/tests/test_server.py`	Adds tests for timeout/env-default behavior (but includes a couple of test-isolation/lint issues).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T16:32:15Z

+
+
+def get_request_timeout():
+    return int(os.environ.get("REQUEST_TIMEOUT", DEFAULT_REQUEST_TIMEOUT))


get_request_timeout() directly casts the REQUEST_TIMEOUT env var to int; if the env var is set to a non-integer value (e.g. "300s"), this will raise ValueError and crash the request handler. Consider parsing defensively (try/except) and falling back to DEFAULT_REQUEST_TIMEOUT or raising a clear HTTPException/config error.

Suggested change

return int(os.environ.get("REQUEST_TIMEOUT", DEFAULT_REQUEST_TIMEOUT))

request_timeout = os.environ.get("REQUEST_TIMEOUT")

if request_timeout is None:

return DEFAULT_REQUEST_TIMEOUT

try:

return int(request_timeout)

except (TypeError, ValueError):

return DEFAULT_REQUEST_TIMEOUT

Copilot · 2026-04-06T16:32:15Z

+                # Use generate from generate.py, with request timeout
+                timeout = get_request_timeout()
+                loop = asyncio.get_event_loop()
+                try:
+                    result = await asyncio.wait_for(
+                        loop.run_in_executor(
+                            None,
+                            lambda: generate(
+                                model=model,
+                                processor=processor,
+                                prompt=formatted_prompt,
+                                image=images,
+                                verbose=False,
+                                **generation_kwargs,
+                            ),
+                        ),
+                        timeout=timeout,
+                    )
+                except asyncio.TimeoutError:
+                    print(f"[cancellation] /responses generation timed out after {timeout}s.")
+                    mx.clear_cache()
+                    gc.collect()
+                    raise HTTPException(
+                        status_code=504,
+                        detail=f"Generation timed out after {timeout} seconds.",
+                    )


Using asyncio.wait_for() around loop.run_in_executor() will time out the awaiting task, but it will not stop the underlying generate() call running in the thread. That means timed-out requests may continue consuming compute, and mx.clear_cache()/gc.collect() here can run concurrently with generation, risking undefined behavior or crashes. Consider implementing cooperative cancellation/timeout inside generate(), or running generation in a separate process that can be terminated, and avoid clearing MLX caches until the generation task has actually stopped.

Copilot · 2026-04-06T16:32:15Z

-                )
+                # Use generate from generate.py, with request timeout
+                timeout = get_request_timeout()
+                loop = asyncio.get_event_loop()


asyncio.get_event_loop() inside an async endpoint is deprecated-style usage on modern Python; prefer asyncio.get_running_loop() to reliably retrieve the active loop and avoid warnings/behavior changes across Python versions.

Suggested change

loop = asyncio.get_event_loop()

loop = asyncio.get_running_loop()

Copilot · 2026-04-06T16:32:16Z

+                # Use generate from generate.py, with request timeout
+                timeout = get_request_timeout()
+                loop = asyncio.get_event_loop()
+                try:
+                    gen_result = await asyncio.wait_for(
+                        loop.run_in_executor(
+                            None,
+                            lambda: generate(
+                                model=model,
+                                processor=processor,
+                                prompt=formatted_prompt,
+                                image=images,
+                                audio=audio,
+                                verbose=False,
+                                vision_cache=model_cache.get("vision_cache"),
+                                **generation_kwargs,
+                            ),
+                        ),
+                        timeout=timeout,
+                    )
+                except asyncio.TimeoutError:
+                    print(f"[cancellation] /chat/completions generation timed out after {timeout}s.")
+                    mx.clear_cache()
+                    gc.collect()
+                    raise HTTPException(
+                        status_code=504,
+                        detail=f"Generation timed out after {timeout} seconds.",
+                    )


Same issue as /responses: wait_for() around run_in_executor() won't stop the background generate() thread on timeout, and clearing MX caches while that thread may still be running is unsafe. Consider cooperative cancellation in generate() or isolating generation in a killable worker (e.g., separate process), and only clear caches after the generation has actually ended.

Copilot · 2026-04-06T16:32:16Z

-                )
+                # Use generate from generate.py, with request timeout
+                timeout = get_request_timeout()
+                loop = asyncio.get_event_loop()


asyncio.get_event_loop() inside this async endpoint should be replaced with asyncio.get_running_loop() to avoid deprecated behavior and ensure the currently running loop is used.

Suggested change

loop = asyncio.get_event_loop()

loop = asyncio.get_running_loop()

Copilot · 2026-04-06T16:32:16Z

+def test_get_request_timeout_default():
+    """Default timeout should be 300 seconds when env var is unset."""
+    os.environ.pop("REQUEST_TIMEOUT", None)
+    assert server.get_request_timeout() == 300
+
+
+def test_get_request_timeout_from_env():
+    """REQUEST_TIMEOUT env var should override the default."""
+    os.environ["REQUEST_TIMEOUT"] = "60"
+    try:
+        assert server.get_request_timeout() == 60
+    finally:
+        os.environ.pop("REQUEST_TIMEOUT", None)


This test forces REQUEST_TIMEOUT to "60" and then unsets it, which does not restore a pre-existing value if one was set before the test run. Use monkeypatch (auto-restores) or capture/restore the previous env var value to keep tests isolated.

Suggested change

def test_get_request_timeout_default():

"""Default timeout should be 300 seconds when env var is unset."""

os.environ.pop("REQUEST_TIMEOUT", None)

assert server.get_request_timeout() == 300

def test_get_request_timeout_from_env():

"""REQUEST_TIMEOUT env var should override the default."""

os.environ["REQUEST_TIMEOUT"] = "60"

try:

assert server.get_request_timeout() == 60

finally:

os.environ.pop("REQUEST_TIMEOUT", None)

def test_get_request_timeout_default(monkeypatch):

"""Default timeout should be 300 seconds when env var is unset."""

monkeypatch.delenv("REQUEST_TIMEOUT", raising=False)

assert server.get_request_timeout() == 300

def test_get_request_timeout_from_env(monkeypatch):

"""REQUEST_TIMEOUT env var should override the default."""

monkeypatch.setenv("REQUEST_TIMEOUT", "60")

assert server.get_request_timeout() == 60

Copilot · 2026-04-06T16:32:17Z

+    import asyncio
+


Unused import: asyncio is imported here but never used. This will be flagged by the repo's pre-commit autoflake hook (remove-all-unused-imports). Remove the import to keep the test file clean.

Suggested change

import asyncio

Copilot · 2026-04-06T16:32:17Z

+    parser.add_argument(
+        "--max-context-tokens",
+        type=int,
+        default=0,
+        help="Maximum context window in tokens. Requests exceeding this are rejected. "
+        "0 means no limit. (default: %(default)s)",
+    )
+    parser.add_argument(
+        "--request-timeout",
+        type=int,
+        default=DEFAULT_REQUEST_TIMEOUT,
+        help="Maximum seconds for a single non-streaming generation request. "
+        "Streaming requests use client disconnect detection instead. "
+        "(default: %(default)s)",
+    )


PR description mentions request cancellation + --request-timeout, but this PR also introduces response_format handling and a new --max-context-tokens flag/context-length enforcement. Please update the PR description to include these additional user-facing features (or split them into a separate PR if they’re out of scope).

Copilot · 2026-04-06T16:32:17Z

+def check_context_length(prompt: str, processor, max_context: int) -> None:
+    """Raise HTTP 400 if the tokenized prompt exceeds *max_context* tokens."""
+    if max_context <= 0:
+        return
+    tokenizer = processor.tokenizer if hasattr(processor, "tokenizer") else processor
+    token_count = len(tokenizer.encode(prompt, add_special_tokens=False))
+    if token_count > max_context:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Prompt length ({token_count} tokens) exceeds maximum context "
+            f"window ({max_context} tokens). Reduce your prompt or increase --max-context-tokens.",
+        )


check_context_length() introduces new request-rejection behavior (HTTP 400 when prompt exceeds MAX_CONTEXT_TOKENS), but there are no tests covering this path. Add tests that set MAX_CONTEXT_TOKENS/--max-context-tokens and verify both /responses and /chat/completions reject over-limit prompts with the expected error detail.

Copilot · 2026-04-06T16:32:17Z

+            "You must respond with valid JSON only. "
+            "Do not include any text outside the JSON object."
+        )
+        messages.insert(0, {"role": "system", "content": json_instruction})


resolve_response_format() changes request semantics by injecting a system JSON-only instruction when response_format.type == "json_object", but there are no tests asserting the template receives the injected system message (and that it’s only injected once). Add unit/integration tests for both endpoints to lock in this behavior.

Suggested change

messages.insert(0, {"role": "system", "content": json_instruction})

has_json_instruction = any(

message.get("role") == "system"

and message.get("content") == json_instruction

for message in messages

)

if not has_json_instruction:

messages.insert(0, {"role": "system", "content": json_instruction})

- Validate REQUEST_TIMEOUT env var (catch ValueError, reject <= 0) - Validate MAX_CONTEXT_TOKENS >= 0 - Use asyncio.get_running_loop() instead of get_event_loop() - Add comments noting wait_for cannot cancel sync thread - Use monkeypatch for env var tests - Remove unused os import from tests - Add context length rejection tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eloe requested a review from Copilot April 6, 2026 16:27

Copilot started reviewing on behalf of eloe April 6, 2026 16:28 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

eloe mentioned this pull request Apr 8, 2026

Combined server enhancements: OpenAI API compliance, prompt caching, concurrency #21

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add request cancellation and timeout support#20

feat: add request cancellation and timeout support#20
eloe wants to merge 2 commits into
mainfrom
feature/request-cancellation

eloe commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def get_request_timeout():
		return int(os.environ.get("REQUEST_TIMEOUT", DEFAULT_REQUEST_TIMEOUT))

-    return int(os.environ.get("REQUEST_TIMEOUT", DEFAULT_REQUEST_TIMEOUT))
+    request_timeout = os.environ.get("REQUEST_TIMEOUT")
+    if request_timeout is None:
+        return DEFAULT_REQUEST_TIMEOUT
+    try:
+        return int(request_timeout)
+    except (TypeError, ValueError):
+        return DEFAULT_REQUEST_TIMEOUT

	loop = asyncio.get_event_loop()
	loop = asyncio.get_running_loop()

-        messages.insert(0, {"role": "system", "content": json_instruction})
+        has_json_instruction = any(
+            message.get("role") == "system"
+            and message.get("content") == json_instruction
+            for message in messages
+        )
+        if not has_json_instruction:
+            messages.insert(0, {"role": "system", "content": json_instruction})

Conversation

eloe commented Apr 6, 2026

Summary\nHandle client disconnects during streaming. --request-timeout flag (default 300s). 5 tests.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants