feat: OpenAI Responses API compliance with structured tool calling#11
feat: OpenAI Responses API compliance with structured tool calling#11eloe wants to merge 3 commits into
Conversation
Add full OpenAI Responses API (/v1/responses) compliance including: - Structured function_call output items (parsed from model text) - function_call_output input items for multi-turn tool use - previous_response_id with LRU response store (256 entries) - instructions field with developer-to-system role normalization - "text" type alias accepted alongside "input_text" - tools/tool_choice passthrough to chat template and response echo - Streaming SSE with sequence_number and [DONE] sentinel - incomplete_details for length-truncated responses - parallel_tool_calls, metadata field support New files: - responses_models.py: Self-contained Pydantic models for Responses API - responses_store.py: Thread-safe LRU store for response replay - tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming) Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the model generates <tool_call>...</tool_call> markup during streaming, detect the tag and suppress those tokens from being sent as response.output_text.delta events. This prevents raw tool call XML from being displayed to users (e.g., in Telegram via OpenClaw). The tool call is still parsed and emitted as structured function_call events after generation completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR rewrites the /responses (/v1/responses) endpoint to align more closely with the OpenAI Responses API, including structured tool calling, previous_response_id replay, and SSE streaming event shapes.
Changes:
- Added Pydantic request/response + streaming event models for the Responses API.
- Introduced an in-memory LRU
ResponseStoreto supportprevious_response_idconversation replay. - Reimplemented
/responsesendpoint with tool-call parsing, streaming SSE events with sequence numbers, and added a new Responses API test suite.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
mlx_vlm/responses_models.py |
Adds Responses API request/response + SSE event models. |
mlx_vlm/responses_store.py |
Implements an LRU response store for previous_response_id replay. |
mlx_vlm/server.py |
Rewrites /responses endpoint (streaming + non-streaming) using new models/store and tool parsing. |
mlx_vlm/tests/test_responses_api.py |
Adds unit + functional + streaming tests for Responses API behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ResponseOutputMessageContentList, | ||
| ] | ||
| ] = Field(None, description="Content of the message.") | ||
| tool_calls: List = [] |
There was a problem hiding this comment.
Several Pydantic models use mutable list literals as defaults (e.g., ChatMessage.tool_calls). This can lead to shared state across instances and surprising mutations. Use Field(default_factory=list) (or default_factory for other containers) for these fields.
| tool_calls: List = [] | |
| tool_calls: List = Field(default_factory=list) |
| class ContentPartOutputText(BaseModel): | ||
| """A text content part in an output message.""" | ||
|
|
||
| type: Literal["output_text"] = "output_text" | ||
| text: str = "" | ||
| annotations: List[str] = [] | ||
|
|
||
|
|
||
| class ResponseMessageItem(BaseModel): | ||
| """An assistant message output item.""" | ||
|
|
||
| id: str = Field(default_factory=lambda: f"msg_{uuid.uuid4().hex[:24]}") | ||
| type: Literal["message"] = "message" | ||
| role: Literal["assistant"] = "assistant" | ||
| status: Literal["in_progress", "completed"] = "completed" | ||
| content: List[ContentPartOutputText] = [] | ||
|
|
There was a problem hiding this comment.
Mutable list literals are used as defaults for annotations/content fields. Even if most instances won’t mutate them, this is unsafe and can create cross-instance leakage. Prefer Field(default_factory=list) for annotations and content.
| for part in content: | ||
| if ( | ||
| isinstance(part, dict) | ||
| and part.get("type") == "output_text" | ||
| ): | ||
| items.append( | ||
| { | ||
| "role": "assistant", | ||
| "content": [ | ||
| { | ||
| "type": "output_text", | ||
| "text": part.get("text", ""), | ||
| } | ||
| ], | ||
| } | ||
| ) |
There was a problem hiding this comment.
replay_input() emits a separate assistant message for each output_text part in a stored message. This can change conversation structure when a single assistant message contains multiple content parts. Consider preserving the original message and including all output_text parts in one assistant message item.
| for part in content: | |
| if ( | |
| isinstance(part, dict) | |
| and part.get("type") == "output_text" | |
| ): | |
| items.append( | |
| { | |
| "role": "assistant", | |
| "content": [ | |
| { | |
| "type": "output_text", | |
| "text": part.get("text", ""), | |
| } | |
| ], | |
| } | |
| ) | |
| message_content = [] | |
| for part in content: | |
| if ( | |
| isinstance(part, dict) | |
| and part.get("type") == "output_text" | |
| ): | |
| message_content.append( | |
| { | |
| "type": "output_text", | |
| "text": part.get("text", ""), | |
| } | |
| ) | |
| if message_content: | |
| items.append( | |
| { | |
| "role": "assistant", | |
| "content": message_content, | |
| } | |
| ) |
| # Suppress tool call tokens from being streamed as text | ||
| if not in_tool_call and tool_call_start_tag in full_text: | ||
| in_tool_call = True | ||
| if in_tool_call: | ||
| continue | ||
|
|
There was a problem hiding this comment.
Streaming tool-call suppression sets in_tool_call=True once the start tag appears, but never resets when the end tag is reached. This will suppress all subsequent text deltas, including legitimate text after a tool call. Track tool_module.tool_call_end (and possibly nested/multiple calls) so streaming resumes after the tool call closes.
| images.append(ci.get("image_url", "")) | ||
| elif ci_type == "image_url": | ||
| img = ci.get("image_url", {}) | ||
| if isinstance(img, dict): | ||
| images.append(img.get("url", "")) | ||
| elif isinstance(img, str): | ||
| images.append(img) |
There was a problem hiding this comment.
Image URL extraction appends empty strings when image_url/url is missing. This will later cause load_image('') failures and return 500s for malformed input. Validate required fields and raise a 400 (or skip the image part) instead of appending "".
| images.append(ci.get("image_url", "")) | |
| elif ci_type == "image_url": | |
| img = ci.get("image_url", {}) | |
| if isinstance(img, dict): | |
| images.append(img.get("url", "")) | |
| elif isinstance(img, str): | |
| images.append(img) | |
| image_url = ci.get("image_url") | |
| if not isinstance(image_url, str) or not image_url: | |
| raise HTTPException( | |
| status_code=400, | |
| detail="Invalid input_image content part: 'image_url' must be a non-empty string.", | |
| ) | |
| images.append(image_url) | |
| elif ci_type == "image_url": | |
| img = ci.get("image_url") | |
| if isinstance(img, dict): | |
| image_url = img.get("url") | |
| if not isinstance(image_url, str) or not image_url: | |
| raise HTTPException( | |
| status_code=400, | |
| detail="Invalid image_url content part: 'image_url.url' must be a non-empty string.", | |
| ) | |
| images.append(image_url) | |
| elif isinstance(img, str): | |
| if not img: | |
| raise HTTPException( | |
| status_code=400, | |
| detail="Invalid image_url content part: 'image_url' must be a non-empty string.", | |
| ) | |
| images.append(img) | |
| else: | |
| raise HTTPException( | |
| status_code=400, | |
| detail="Invalid image_url content part: 'image_url' must be a string or an object containing a non-empty 'url' field.", | |
| ) |
Critical fixes:
- Add recursion depth limit (50) and cycle detection for
previous_response_id chains to prevent stack overflow/DoS
- Fix tool call suppression: in_tool_call now resets on end tag,
partial-match only triggers on actual trailing prefix (not any <)
- Sanitize error messages in streaming — no longer leaks internals
Security fixes:
- Validate image URLs are non-empty before appending (prevents
load_image('') → 500 errors)
Spec compliance:
- Remove data: [DONE] sentinel (not part of Responses API spec,
response.completed is the terminal event)
Code quality:
- Fix mutable default lists in Pydantic models (use default_factory)
- Log tool parsing failures instead of silently swallowing
- Collect multi-part assistant messages into single message in replay
- Shorten import aliases to avoid unnecessary verbosity
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical fixes:
- Add recursion depth limit (50) and cycle detection for
previous_response_id chains to prevent stack overflow/DoS
- Fix tool call suppression: in_tool_call now resets on end tag,
partial-match only triggers on actual trailing prefix (not any <)
- Sanitize error messages in streaming — no longer leaks internals
Security fixes:
- Validate image URLs are non-empty before appending (prevents
load_image('') → 500 errors)
Spec compliance:
- Remove data: [DONE] sentinel (not part of Responses API spec,
response.completed is the terminal event)
Code quality:
- Fix mutable default lists in Pydantic models (use default_factory)
- Log tool parsing failures instead of silently swallowing
- Collect multi-part assistant messages into single message in replay
- Shorten import aliases to avoid unnecessary verbosity
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary\nFull OpenAI Responses API (
/v1/responses) compliance with structured tool calling,previous_response_id, streaming events with sequence numbers, tool call suppression in streaming, and 31 tests.\n\n## Files\n-responses_models.py— Pydantic models\n-responses_store.py— LRU store\n-server.py— Rewritten endpoint\n-tests/test_responses_api.py— 31 tests