Skip to content

feat: OpenAI Responses API compliance with structured tool calling#11

Open
eloe wants to merge 3 commits into
mainfrom
feature/openai-responses-api
Open

feat: OpenAI Responses API compliance with structured tool calling#11
eloe wants to merge 3 commits into
mainfrom
feature/openai-responses-api

Conversation

@eloe
Copy link
Copy Markdown
Owner

@eloe eloe commented Apr 6, 2026

Summary\nFull OpenAI Responses API (/v1/responses) compliance with structured tool calling, previous_response_id, streaming events with sequence numbers, tool call suppression in streaming, and 31 tests.\n\n## Files\n- responses_models.py — Pydantic models\n- responses_store.py — LRU store\n- server.py — Rewritten endpoint\n- tests/test_responses_api.py — 31 tests

eloe and others added 2 commits April 5, 2026 11:58
Add full OpenAI Responses API (/v1/responses) compliance including:

- Structured function_call output items (parsed from model text)
- function_call_output input items for multi-turn tool use
- previous_response_id with LRU response store (256 entries)
- instructions field with developer-to-system role normalization
- "text" type alias accepted alongside "input_text"
- tools/tool_choice passthrough to chat template and response echo
- Streaming SSE with sequence_number and [DONE] sentinel
- incomplete_details for length-truncated responses
- parallel_tool_calls, metadata field support

New files:
- responses_models.py: Self-contained Pydantic models for Responses API
- responses_store.py: Thread-safe LRU store for response replay
- tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming)

Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the model generates <tool_call>...</tool_call> markup during
streaming, detect the tag and suppress those tokens from being sent
as response.output_text.delta events. This prevents raw tool call
XML from being displayed to users (e.g., in Telegram via OpenClaw).

The tool call is still parsed and emitted as structured function_call
events after generation completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR rewrites the /responses (/v1/responses) endpoint to align more closely with the OpenAI Responses API, including structured tool calling, previous_response_id replay, and SSE streaming event shapes.

Changes:

  • Added Pydantic request/response + streaming event models for the Responses API.
  • Introduced an in-memory LRU ResponseStore to support previous_response_id conversation replay.
  • Reimplemented /responses endpoint with tool-call parsing, streaming SSE events with sequence numbers, and added a new Responses API test suite.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
mlx_vlm/responses_models.py Adds Responses API request/response + SSE event models.
mlx_vlm/responses_store.py Implements an LRU response store for previous_response_id replay.
mlx_vlm/server.py Rewrites /responses endpoint (streaming + non-streaming) using new models/store and tool parsing.
mlx_vlm/tests/test_responses_api.py Adds unit + functional + streaming tests for Responses API behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlx_vlm/responses_models.py Outdated
ResponseOutputMessageContentList,
]
] = Field(None, description="Content of the message.")
tool_calls: List = []
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several Pydantic models use mutable list literals as defaults (e.g., ChatMessage.tool_calls). This can lead to shared state across instances and surprising mutations. Use Field(default_factory=list) (or default_factory for other containers) for these fields.

Suggested change
tool_calls: List = []
tool_calls: List = Field(default_factory=list)

Copilot uses AI. Check for mistakes.
Comment on lines +285 to +301
class ContentPartOutputText(BaseModel):
"""A text content part in an output message."""

type: Literal["output_text"] = "output_text"
text: str = ""
annotations: List[str] = []


class ResponseMessageItem(BaseModel):
"""An assistant message output item."""

id: str = Field(default_factory=lambda: f"msg_{uuid.uuid4().hex[:24]}")
type: Literal["message"] = "message"
role: Literal["assistant"] = "assistant"
status: Literal["in_progress", "completed"] = "completed"
content: List[ContentPartOutputText] = []

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable list literals are used as defaults for annotations/content fields. Even if most instances won’t mutate them, this is unsafe and can create cross-instance leakage. Prefer Field(default_factory=list) for annotations and content.

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/responses_store.py Outdated
Comment on lines +95 to +110
for part in content:
if (
isinstance(part, dict)
and part.get("type") == "output_text"
):
items.append(
{
"role": "assistant",
"content": [
{
"type": "output_text",
"text": part.get("text", ""),
}
],
}
)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replay_input() emits a separate assistant message for each output_text part in a stored message. This can change conversation structure when a single assistant message contains multiple content parts. Consider preserving the original message and including all output_text parts in one assistant message item.

Suggested change
for part in content:
if (
isinstance(part, dict)
and part.get("type") == "output_text"
):
items.append(
{
"role": "assistant",
"content": [
{
"type": "output_text",
"text": part.get("text", ""),
}
],
}
)
message_content = []
for part in content:
if (
isinstance(part, dict)
and part.get("type") == "output_text"
):
message_content.append(
{
"type": "output_text",
"text": part.get("text", ""),
}
)
if message_content:
items.append(
{
"role": "assistant",
"content": message_content,
}
)

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py Outdated
Comment on lines +1085 to +1090
# Suppress tool call tokens from being streamed as text
if not in_tool_call and tool_call_start_tag in full_text:
in_tool_call = True
if in_tool_call:
continue

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming tool-call suppression sets in_tool_call=True once the start tag appears, but never resets when the end tag is reached. This will suppress all subsequent text deltas, including legitimate text after a tool call. Track tool_module.tool_call_end (and possibly nested/multiple calls) so streaming resumes after the tool call closes.

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py Outdated
Comment on lines +827 to +833
images.append(ci.get("image_url", ""))
elif ci_type == "image_url":
img = ci.get("image_url", {})
if isinstance(img, dict):
images.append(img.get("url", ""))
elif isinstance(img, str):
images.append(img)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image URL extraction appends empty strings when image_url/url is missing. This will later cause load_image('') failures and return 500s for malformed input. Validate required fields and raise a 400 (or skip the image part) instead of appending "".

Suggested change
images.append(ci.get("image_url", ""))
elif ci_type == "image_url":
img = ci.get("image_url", {})
if isinstance(img, dict):
images.append(img.get("url", ""))
elif isinstance(img, str):
images.append(img)
image_url = ci.get("image_url")
if not isinstance(image_url, str) or not image_url:
raise HTTPException(
status_code=400,
detail="Invalid input_image content part: 'image_url' must be a non-empty string.",
)
images.append(image_url)
elif ci_type == "image_url":
img = ci.get("image_url")
if isinstance(img, dict):
image_url = img.get("url")
if not isinstance(image_url, str) or not image_url:
raise HTTPException(
status_code=400,
detail="Invalid image_url content part: 'image_url.url' must be a non-empty string.",
)
images.append(image_url)
elif isinstance(img, str):
if not img:
raise HTTPException(
status_code=400,
detail="Invalid image_url content part: 'image_url' must be a non-empty string.",
)
images.append(img)
else:
raise HTTPException(
status_code=400,
detail="Invalid image_url content part: 'image_url' must be a string or an object containing a non-empty 'url' field.",
)

Copilot uses AI. Check for mistakes.
Critical fixes:
- Add recursion depth limit (50) and cycle detection for
  previous_response_id chains to prevent stack overflow/DoS
- Fix tool call suppression: in_tool_call now resets on end tag,
  partial-match only triggers on actual trailing prefix (not any <)
- Sanitize error messages in streaming — no longer leaks internals

Security fixes:
- Validate image URLs are non-empty before appending (prevents
  load_image('') → 500 errors)

Spec compliance:
- Remove data: [DONE] sentinel (not part of Responses API spec,
  response.completed is the terminal event)

Code quality:
- Fix mutable default lists in Pydantic models (use default_factory)
- Log tool parsing failures instead of silently swallowing
- Collect multi-part assistant messages into single message in replay
- Shorten import aliases to avoid unnecessary verbosity

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eloe added a commit that referenced this pull request Apr 9, 2026
Critical fixes:
- Add recursion depth limit (50) and cycle detection for
  previous_response_id chains to prevent stack overflow/DoS
- Fix tool call suppression: in_tool_call now resets on end tag,
  partial-match only triggers on actual trailing prefix (not any <)
- Sanitize error messages in streaming — no longer leaks internals

Security fixes:
- Validate image URLs are non-empty before appending (prevents
  load_image('') → 500 errors)

Spec compliance:
- Remove data: [DONE] sentinel (not part of Responses API spec,
  response.completed is the terminal event)

Code quality:
- Fix mutable default lists in Pydantic models (use default_factory)
- Log tool parsing failures instead of silently swallowing
- Collect multi-part assistant messages into single message in replay
- Shorten import aliases to avoid unnecessary verbosity

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants