Skip to content

feat: add JSON mode via response_format parameter#18

Open
eloe wants to merge 2 commits into
mainfrom
feature/json-mode
Open

feat: add JSON mode via response_format parameter#18
eloe wants to merge 2 commits into
mainfrom
feature/json-mode

Conversation

@eloe
Copy link
Copy Markdown
Owner

@eloe eloe commented Apr 6, 2026

Summary\nAccept response_format: {type: json_object}. Injects JSON system instruction. 5 tests.

Accept OpenAI's response_format: {"type": "json_object"} on both
/chat/completions and /responses endpoints. When json_object is
requested, a system instruction is prepended telling the model to
respond with valid JSON only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenAI-compatible JSON mode support via a response_format request field, implemented by injecting a system instruction to produce JSON-only output. The PR also introduces prompt context-length rejection and new timeout/cancellation logic around generation.

Changes:

  • Add response_format support ({"type":"json_object"}) and inject a JSON-only system instruction.
  • Add max-context token limiting (MAX_CONTEXT_TOKENS + --max-context-tokens) with request rejection when exceeded.
  • Add non-streaming generation timeout handling and streaming client-disconnect handling; add tests covering JSON mode and context limiting.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
mlx_vlm/server.py Implements response_format handling, context-length checks, timeout/cancellation logic, and CLI/env wiring.
mlx_vlm/tests/test_server.py Adds unit/integration tests for context-length utilities and JSON mode behavior on both endpoints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlx_vlm/server.py Outdated
"You must respond with valid JSON only. "
"Do not include any text outside the JSON object."
)
messages.insert(0, {"role": "system", "content": json_instruction})
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_response_format mutates the input messages list in-place via insert(0, ...), which can create surprising side effects for callers that reuse the list. Consider returning a new list (e.g., prepend to a copy) instead of modifying the argument.

Suggested change
messages.insert(0, {"role": "system", "content": json_instruction})
return [{"role": "system", "content": json_instruction}, *messages]

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines +662 to +666
if not response_format:
return messages
fmt_type = response_format.get("type", "text")
if fmt_type == "json_object":
json_instruction = (
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsupported response_format values are silently treated as text (no-op). For an OpenAI-compatible API, it would be safer to validate response_format['type'] and return an HTTP 400 (or 422) for unknown types to avoid clients thinking JSON mode is enabled when it isn’t.

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines +1035 to +1053
result = await asyncio.wait_for(
loop.run_in_executor(
None,
lambda: generate(
model=model,
processor=processor,
prompt=formatted_prompt,
image=images,
verbose=False,
**generation_kwargs,
),
),
timeout=timeout,
)
except asyncio.TimeoutError:
print(f"[cancellation] /responses generation timed out after {timeout}s.")
mx.clear_cache()
gc.collect()
raise HTTPException(
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timing out run_in_executor(...generate...) with asyncio.wait_for does not cancel the underlying thread; generation will continue running in the background after returning 504. Clearing MLX cache/GC immediately on timeout can also race with that still-running generation thread. Consider a cooperative cancellation mechanism inside generate, or isolate generation in a process that can be terminated, or avoid cache-clearing until the worker completes.

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines +1309 to +1329
try:
gen_result = await asyncio.wait_for(
loop.run_in_executor(
None,
lambda: generate(
model=model,
processor=processor,
prompt=formatted_prompt,
image=images,
audio=audio,
verbose=False,
vision_cache=model_cache.get("vision_cache"),
**generation_kwargs,
),
),
timeout=timeout,
)
except asyncio.TimeoutError:
print(f"[cancellation] /chat/completions generation timed out after {timeout}s.")
mx.clear_cache()
gc.collect()
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timing out run_in_executor(...generate...) with asyncio.wait_for does not cancel the underlying thread; generation will continue running in the background after returning 504. Clearing MLX cache/GC immediately on timeout can also race with that still-running generation thread. Consider a cooperative cancellation mechanism inside generate, or isolate generation in a process that can be terminated, or avoid cache-clearing until the worker completes.

Suggested change
try:
gen_result = await asyncio.wait_for(
loop.run_in_executor(
None,
lambda: generate(
model=model,
processor=processor,
prompt=formatted_prompt,
image=images,
audio=audio,
verbose=False,
vision_cache=model_cache.get("vision_cache"),
**generation_kwargs,
),
),
timeout=timeout,
)
except asyncio.TimeoutError:
print(f"[cancellation] /chat/completions generation timed out after {timeout}s.")
mx.clear_cache()
gc.collect()
generation_future = loop.run_in_executor(
None,
lambda: generate(
model=model,
processor=processor,
prompt=formatted_prompt,
image=images,
audio=audio,
verbose=False,
vision_cache=model_cache.get("vision_cache"),
**generation_kwargs,
),
)
try:
gen_result = await asyncio.wait_for(
asyncio.shield(generation_future),
timeout=timeout,
)
except asyncio.TimeoutError:
print(f"[cancellation] /chat/completions generation timed out after {timeout}s.")
def _cleanup_after_generation(_future):
try:
mx.clear_cache()
gc.collect()
print("Background generation finished after timeout, cleared cache.")
except Exception as cleanup_exc:
print(
"Error while clearing cache after timed out generation: "
f"{cleanup_exc}"
)
generation_future.add_done_callback(_cleanup_after_generation)

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines 1116 to 1118
@app.post("/v1/chat/completions", response_model=None, include_in_schema=False)
async def chat_completions_endpoint(request: ChatRequest):
async def chat_completions_endpoint(request: ChatRequest, raw_request: Request):
"""
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint signature now annotates raw_request: Request, but Request is not imported in this module. This will raise a NameError when evaluating the function signature. Import Request from FastAPI/Starlette (e.g., from fastapi import Request).

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines 881 to +893
@@ -845,6 +890,7 @@ def run_openai(prompt, img_url,system, stream=False, max_output_tokens=512, mode
num_images=len(images),
**template_kwargs,
)
check_context_length(formatted_prompt, processor, get_max_context_tokens())
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description focuses on response_format/JSON mode, but this change set also adds request timeouts, client-disconnect handling, and a new MAX_CONTEXT_TOKENS limit/CLI flag. Consider updating the PR description/title to reflect the additional behavior, or splitting these concerns into separate PRs for easier review/revert.

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py Outdated
Comment on lines +1031 to +1034
# Use generate from generate.py, with request timeout
timeout = get_request_timeout()
loop = asyncio.get_event_loop()
try:
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code uses get_request_timeout() and asyncio, but neither is defined/imported in server.py (repo-wide search also finds no get_request_timeout definition). As written, non-streaming /responses requests will crash with NameError. Add the missing imports/implementation, and prefer asyncio.get_running_loop() over get_event_loop() inside async endpoints.

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py Outdated
Comment on lines +1306 to +1309
# Use generate from generate.py, with request timeout
timeout = get_request_timeout()
loop = asyncio.get_event_loop()
try:
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code uses get_request_timeout() and asyncio, but neither is defined/imported in server.py (repo-wide search also finds no get_request_timeout definition). As written, non-streaming /chat/completions requests will crash with NameError. Add the missing imports/implementation, and prefer asyncio.get_running_loop() over get_event_loop() inside async endpoints.

Copilot uses AI. Check for mistakes.
- Don't mutate messages list in-place (return new list)
- Validate unsupported response_format types → 400
- Add missing asyncio, Request imports
- Add missing get_request_timeout/DEFAULT_REQUEST_TIMEOUT
- Use asyncio.get_running_loop() instead of get_event_loop()
- Add thread-cancel caveat comments on wait_for
- Use monkeypatch for env var tests
- Add test for unsupported response_format type

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eloe added a commit that referenced this pull request Apr 9, 2026
- Don't mutate messages list in-place (return new list)
- Validate unsupported response_format types → 400
- Add missing asyncio, Request imports
- Add missing get_request_timeout/DEFAULT_REQUEST_TIMEOUT
- Use asyncio.get_running_loop() instead of get_event_loop()
- Add thread-cancel caveat comments on wait_for
- Use monkeypatch for env var tests
- Add test for unsupported response_format type

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants