feat: add JSON mode via response_format parameter by eloe · Pull Request #18 · eloe/mlx-vlm

eloe · 2026-04-06T16:16:31Z

Summary\nAccept response_format: {type: json_object}. Injects JSON system instruction. 5 tests.

Accept OpenAI's response_format: {"type": "json_object"} on both /chat/completions and /responses endpoints. When json_object is requested, a system instruction is prepended telling the model to respond with valid JSON only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds OpenAI-compatible JSON mode support via a response_format request field, implemented by injecting a system instruction to produce JSON-only output. The PR also introduces prompt context-length rejection and new timeout/cancellation logic around generation.

Changes:

Add response_format support ({"type":"json_object"}) and inject a JSON-only system instruction.
Add max-context token limiting (MAX_CONTEXT_TOKENS + --max-context-tokens) with request rejection when exceeded.
Add non-streaming generation timeout handling and streaming client-disconnect handling; add tests covering JSON mode and context limiting.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File	Description
`mlx_vlm/server.py`	Implements `response_format` handling, context-length checks, timeout/cancellation logic, and CLI/env wiring.
`mlx_vlm/tests/test_server.py`	Adds unit/integration tests for context-length utilities and JSON mode behavior on both endpoints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T16:32:31Z

+            "You must respond with valid JSON only. "
+            "Do not include any text outside the JSON object."
+        )
+        messages.insert(0, {"role": "system", "content": json_instruction})


resolve_response_format mutates the input messages list in-place via insert(0, ...), which can create surprising side effects for callers that reuse the list. Consider returning a new list (e.g., prepend to a copy) instead of modifying the argument.

Suggested change

messages.insert(0, {"role": "system", "content": json_instruction})

return [{"role": "system", "content": json_instruction}, *messages]

Copilot · 2026-04-06T16:32:31Z

+    if not response_format:
+        return messages
+    fmt_type = response_format.get("type", "text")
+    if fmt_type == "json_object":
+        json_instruction = (


Unsupported response_format values are silently treated as text (no-op). For an OpenAI-compatible API, it would be safer to validate response_format['type'] and return an HTTP 400 (or 422) for unknown types to avoid clients thinking JSON mode is enabled when it isn’t.

Copilot · 2026-04-06T16:32:31Z

+                    result = await asyncio.wait_for(
+                        loop.run_in_executor(
+                            None,
+                            lambda: generate(
+                                model=model,
+                                processor=processor,
+                                prompt=formatted_prompt,
+                                image=images,
+                                verbose=False,
+                                **generation_kwargs,
+                            ),
+                        ),
+                        timeout=timeout,
+                    )
+                except asyncio.TimeoutError:
+                    print(f"[cancellation] /responses generation timed out after {timeout}s.")
+                    mx.clear_cache()
+                    gc.collect()
+                    raise HTTPException(


Timing out run_in_executor(...generate...) with asyncio.wait_for does not cancel the underlying thread; generation will continue running in the background after returning 504. Clearing MLX cache/GC immediately on timeout can also race with that still-running generation thread. Consider a cooperative cancellation mechanism inside generate, or isolate generation in a process that can be terminated, or avoid cache-clearing until the worker completes.

Copilot · 2026-04-06T16:32:32Z

+                try:
+                    gen_result = await asyncio.wait_for(
+                        loop.run_in_executor(
+                            None,
+                            lambda: generate(
+                                model=model,
+                                processor=processor,
+                                prompt=formatted_prompt,
+                                image=images,
+                                audio=audio,
+                                verbose=False,
+                                vision_cache=model_cache.get("vision_cache"),
+                                **generation_kwargs,
+                            ),
+                        ),
+                        timeout=timeout,
+                    )
+                except asyncio.TimeoutError:
+                    print(f"[cancellation] /chat/completions generation timed out after {timeout}s.")
+                    mx.clear_cache()
+                    gc.collect()


Timing out run_in_executor(...generate...) with asyncio.wait_for does not cancel the underlying thread; generation will continue running in the background after returning 504. Clearing MLX cache/GC immediately on timeout can also race with that still-running generation thread. Consider a cooperative cancellation mechanism inside generate, or isolate generation in a process that can be terminated, or avoid cache-clearing until the worker completes.

Suggested change

try:

gen_result = await asyncio.wait_for(

loop.run_in_executor(

None,

lambda: generate(

model=model,

processor=processor,

prompt=formatted_prompt,

image=images,

audio=audio,

verbose=False,

vision_cache=model_cache.get("vision_cache"),

**generation_kwargs,

),

),

timeout=timeout,

)

except asyncio.TimeoutError:

print(f"[cancellation] /chat/completions generation timed out after {timeout}s.")

mx.clear_cache()

gc.collect()

generation_future = loop.run_in_executor(

None,

lambda: generate(

model=model,

processor=processor,

prompt=formatted_prompt,

image=images,

audio=audio,

verbose=False,

vision_cache=model_cache.get("vision_cache"),

**generation_kwargs,

),

)

try:

gen_result = await asyncio.wait_for(

asyncio.shield(generation_future),

timeout=timeout,

)

except asyncio.TimeoutError:

print(f"[cancellation] /chat/completions generation timed out after {timeout}s.")

def _cleanup_after_generation(_future):

try:

mx.clear_cache()

gc.collect()

print("Background generation finished after timeout, cleared cache.")

except Exception as cleanup_exc:

print(

"Error while clearing cache after timed out generation: "

f"{cleanup_exc}"

)

generation_future.add_done_callback(_cleanup_after_generation)

Copilot · 2026-04-06T16:32:33Z

 @app.post("/v1/chat/completions", response_model=None, include_in_schema=False)
-async def chat_completions_endpoint(request: ChatRequest):
+async def chat_completions_endpoint(request: ChatRequest, raw_request: Request):
    """


The endpoint signature now annotates raw_request: Request, but Request is not imported in this module. This will raise a NameError when evaluating the function signature. Import Request from FastAPI/Starlette (e.g., from fastapi import Request).

Copilot · 2026-04-06T16:32:33Z

@@ -845,6 +890,7 @@ def run_openai(prompt, img_url,system, stream=False, max_output_tokens=512, mode
            num_images=len(images),
            **template_kwargs,
        )
+        check_context_length(formatted_prompt, processor, get_max_context_tokens())


PR description focuses on response_format/JSON mode, but this change set also adds request timeouts, client-disconnect handling, and a new MAX_CONTEXT_TOKENS limit/CLI flag. Consider updating the PR description/title to reflect the additional behavior, or splitting these concerns into separate PRs for easier review/revert.

Copilot · 2026-04-06T16:32:33Z

+                # Use generate from generate.py, with request timeout
+                timeout = get_request_timeout()
+                loop = asyncio.get_event_loop()
+                try:


This code uses get_request_timeout() and asyncio, but neither is defined/imported in server.py (repo-wide search also finds no get_request_timeout definition). As written, non-streaming /responses requests will crash with NameError. Add the missing imports/implementation, and prefer asyncio.get_running_loop() over get_event_loop() inside async endpoints.

Copilot · 2026-04-06T16:32:33Z

+                # Use generate from generate.py, with request timeout
+                timeout = get_request_timeout()
+                loop = asyncio.get_event_loop()
+                try:


This code uses get_request_timeout() and asyncio, but neither is defined/imported in server.py (repo-wide search also finds no get_request_timeout definition). As written, non-streaming /chat/completions requests will crash with NameError. Add the missing imports/implementation, and prefer asyncio.get_running_loop() over get_event_loop() inside async endpoints.

- Don't mutate messages list in-place (return new list) - Validate unsupported response_format types → 400 - Add missing asyncio, Request imports - Add missing get_request_timeout/DEFAULT_REQUEST_TIMEOUT - Use asyncio.get_running_loop() instead of get_event_loop() - Add thread-cancel caveat comments on wait_for - Use monkeypatch for env var tests - Add test for unsupported response_format type Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eloe requested a review from Copilot April 6, 2026 16:28

Copilot started reviewing on behalf of eloe April 6, 2026 16:28 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

eloe mentioned this pull request Apr 8, 2026

Combined server enhancements: OpenAI API compliance, prompt caching, concurrency #21

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add JSON mode via response_format parameter#18

feat: add JSON mode via response_format parameter#18
eloe wants to merge 2 commits into
mainfrom
feature/json-mode

eloe commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	messages.insert(0, {"role": "system", "content": json_instruction})
	return [{"role": "system", "content": json_instruction}, *messages]

Conversation

eloe commented Apr 6, 2026

Summary\nAccept response_format: {type: json_object}. Injects JSON system instruction. 5 tests.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants