feat: add JSON mode via response_format parameter#18
Conversation
Accept OpenAI's response_format: {"type": "json_object"} on both
/chat/completions and /responses endpoints. When json_object is
requested, a system instruction is prepended telling the model to
respond with valid JSON only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds OpenAI-compatible JSON mode support via a response_format request field, implemented by injecting a system instruction to produce JSON-only output. The PR also introduces prompt context-length rejection and new timeout/cancellation logic around generation.
Changes:
- Add
response_formatsupport ({"type":"json_object"}) and inject a JSON-only system instruction. - Add max-context token limiting (
MAX_CONTEXT_TOKENS+--max-context-tokens) with request rejection when exceeded. - Add non-streaming generation timeout handling and streaming client-disconnect handling; add tests covering JSON mode and context limiting.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
mlx_vlm/server.py |
Implements response_format handling, context-length checks, timeout/cancellation logic, and CLI/env wiring. |
mlx_vlm/tests/test_server.py |
Adds unit/integration tests for context-length utilities and JSON mode behavior on both endpoints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "You must respond with valid JSON only. " | ||
| "Do not include any text outside the JSON object." | ||
| ) | ||
| messages.insert(0, {"role": "system", "content": json_instruction}) |
There was a problem hiding this comment.
resolve_response_format mutates the input messages list in-place via insert(0, ...), which can create surprising side effects for callers that reuse the list. Consider returning a new list (e.g., prepend to a copy) instead of modifying the argument.
| messages.insert(0, {"role": "system", "content": json_instruction}) | |
| return [{"role": "system", "content": json_instruction}, *messages] |
| if not response_format: | ||
| return messages | ||
| fmt_type = response_format.get("type", "text") | ||
| if fmt_type == "json_object": | ||
| json_instruction = ( |
There was a problem hiding this comment.
Unsupported response_format values are silently treated as text (no-op). For an OpenAI-compatible API, it would be safer to validate response_format['type'] and return an HTTP 400 (or 422) for unknown types to avoid clients thinking JSON mode is enabled when it isn’t.
| result = await asyncio.wait_for( | ||
| loop.run_in_executor( | ||
| None, | ||
| lambda: generate( | ||
| model=model, | ||
| processor=processor, | ||
| prompt=formatted_prompt, | ||
| image=images, | ||
| verbose=False, | ||
| **generation_kwargs, | ||
| ), | ||
| ), | ||
| timeout=timeout, | ||
| ) | ||
| except asyncio.TimeoutError: | ||
| print(f"[cancellation] /responses generation timed out after {timeout}s.") | ||
| mx.clear_cache() | ||
| gc.collect() | ||
| raise HTTPException( |
There was a problem hiding this comment.
Timing out run_in_executor(...generate...) with asyncio.wait_for does not cancel the underlying thread; generation will continue running in the background after returning 504. Clearing MLX cache/GC immediately on timeout can also race with that still-running generation thread. Consider a cooperative cancellation mechanism inside generate, or isolate generation in a process that can be terminated, or avoid cache-clearing until the worker completes.
| try: | ||
| gen_result = await asyncio.wait_for( | ||
| loop.run_in_executor( | ||
| None, | ||
| lambda: generate( | ||
| model=model, | ||
| processor=processor, | ||
| prompt=formatted_prompt, | ||
| image=images, | ||
| audio=audio, | ||
| verbose=False, | ||
| vision_cache=model_cache.get("vision_cache"), | ||
| **generation_kwargs, | ||
| ), | ||
| ), | ||
| timeout=timeout, | ||
| ) | ||
| except asyncio.TimeoutError: | ||
| print(f"[cancellation] /chat/completions generation timed out after {timeout}s.") | ||
| mx.clear_cache() | ||
| gc.collect() |
There was a problem hiding this comment.
Timing out run_in_executor(...generate...) with asyncio.wait_for does not cancel the underlying thread; generation will continue running in the background after returning 504. Clearing MLX cache/GC immediately on timeout can also race with that still-running generation thread. Consider a cooperative cancellation mechanism inside generate, or isolate generation in a process that can be terminated, or avoid cache-clearing until the worker completes.
| try: | |
| gen_result = await asyncio.wait_for( | |
| loop.run_in_executor( | |
| None, | |
| lambda: generate( | |
| model=model, | |
| processor=processor, | |
| prompt=formatted_prompt, | |
| image=images, | |
| audio=audio, | |
| verbose=False, | |
| vision_cache=model_cache.get("vision_cache"), | |
| **generation_kwargs, | |
| ), | |
| ), | |
| timeout=timeout, | |
| ) | |
| except asyncio.TimeoutError: | |
| print(f"[cancellation] /chat/completions generation timed out after {timeout}s.") | |
| mx.clear_cache() | |
| gc.collect() | |
| generation_future = loop.run_in_executor( | |
| None, | |
| lambda: generate( | |
| model=model, | |
| processor=processor, | |
| prompt=formatted_prompt, | |
| image=images, | |
| audio=audio, | |
| verbose=False, | |
| vision_cache=model_cache.get("vision_cache"), | |
| **generation_kwargs, | |
| ), | |
| ) | |
| try: | |
| gen_result = await asyncio.wait_for( | |
| asyncio.shield(generation_future), | |
| timeout=timeout, | |
| ) | |
| except asyncio.TimeoutError: | |
| print(f"[cancellation] /chat/completions generation timed out after {timeout}s.") | |
| def _cleanup_after_generation(_future): | |
| try: | |
| mx.clear_cache() | |
| gc.collect() | |
| print("Background generation finished after timeout, cleared cache.") | |
| except Exception as cleanup_exc: | |
| print( | |
| "Error while clearing cache after timed out generation: " | |
| f"{cleanup_exc}" | |
| ) | |
| generation_future.add_done_callback(_cleanup_after_generation) |
| @app.post("/v1/chat/completions", response_model=None, include_in_schema=False) | ||
| async def chat_completions_endpoint(request: ChatRequest): | ||
| async def chat_completions_endpoint(request: ChatRequest, raw_request: Request): | ||
| """ |
There was a problem hiding this comment.
The endpoint signature now annotates raw_request: Request, but Request is not imported in this module. This will raise a NameError when evaluating the function signature. Import Request from FastAPI/Starlette (e.g., from fastapi import Request).
| @@ -845,6 +890,7 @@ def run_openai(prompt, img_url,system, stream=False, max_output_tokens=512, mode | |||
| num_images=len(images), | |||
| **template_kwargs, | |||
| ) | |||
| check_context_length(formatted_prompt, processor, get_max_context_tokens()) | |||
There was a problem hiding this comment.
PR description focuses on response_format/JSON mode, but this change set also adds request timeouts, client-disconnect handling, and a new MAX_CONTEXT_TOKENS limit/CLI flag. Consider updating the PR description/title to reflect the additional behavior, or splitting these concerns into separate PRs for easier review/revert.
| # Use generate from generate.py, with request timeout | ||
| timeout = get_request_timeout() | ||
| loop = asyncio.get_event_loop() | ||
| try: |
There was a problem hiding this comment.
This code uses get_request_timeout() and asyncio, but neither is defined/imported in server.py (repo-wide search also finds no get_request_timeout definition). As written, non-streaming /responses requests will crash with NameError. Add the missing imports/implementation, and prefer asyncio.get_running_loop() over get_event_loop() inside async endpoints.
| # Use generate from generate.py, with request timeout | ||
| timeout = get_request_timeout() | ||
| loop = asyncio.get_event_loop() | ||
| try: |
There was a problem hiding this comment.
This code uses get_request_timeout() and asyncio, but neither is defined/imported in server.py (repo-wide search also finds no get_request_timeout definition). As written, non-streaming /chat/completions requests will crash with NameError. Add the missing imports/implementation, and prefer asyncio.get_running_loop() over get_event_loop() inside async endpoints.
- Don't mutate messages list in-place (return new list) - Validate unsupported response_format types → 400 - Add missing asyncio, Request imports - Add missing get_request_timeout/DEFAULT_REQUEST_TIMEOUT - Use asyncio.get_running_loop() instead of get_event_loop() - Add thread-cancel caveat comments on wait_for - Use monkeypatch for env var tests - Add test for unsupported response_format type Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Don't mutate messages list in-place (return new list) - Validate unsupported response_format types → 400 - Add missing asyncio, Request imports - Add missing get_request_timeout/DEFAULT_REQUEST_TIMEOUT - Use asyncio.get_running_loop() instead of get_event_loop() - Add thread-cancel caveat comments on wait_for - Use monkeypatch for env var tests - Add test for unsupported response_format type Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary\nAccept response_format: {type: json_object}. Injects JSON system instruction. 5 tests.