Skip to content

feat: OpenAI Responses API compliance with structured tool calling#952

Closed
eloe wants to merge 2 commits into
Blaizzy:mainfrom
eloe:feature/openai-responses-api
Closed

feat: OpenAI Responses API compliance with structured tool calling#952
eloe wants to merge 2 commits into
Blaizzy:mainfrom
eloe:feature/openai-responses-api

Conversation

@eloe
Copy link
Copy Markdown
Contributor

@eloe eloe commented Apr 6, 2026

Summary

Full OpenAI Responses API (/v1/responses) compliance for agentic coding workflows.

  • Structured function_call output items parsed from model text via existing tool parsers
  • function_call_output input items for multi-turn tool use loops
  • previous_response_id with LRU response store (256 entries) for conversation replay
  • instructions field with developer→system role normalization
  • "text" type alias accepted alongside "input_text"
  • Tools/tool_choice passthrough to chat template and response echo
  • Streaming SSE with sequence_number and [DONE] sentinel
  • Tool call token suppression in streaming (no raw XML leaked to clients)
  • incomplete_details for length-truncated responses
  • parallel_tool_calls, metadata field support

Files

  • mlx_vlm/responses_models.py — Self-contained Pydantic models for Responses API
  • mlx_vlm/responses_store.py — Thread-safe LRU store for response replay
  • mlx_vlm/server.py — Rewritten /v1/responses endpoint with input conversion + tool call output building
  • mlx_vlm/tests/test_responses_api.py — 31 tests (models, store, endpoint, streaming)

Motivation

Agent frameworks like OpenClaw and Hermes use the Responses API for tool-calling loops on local models. The existing /v1/responses endpoint returned raw text without structured tool call items, breaking agentic workflows.

Testing

python -m pytest mlx_vlm/tests/test_responses_api.py -v
# 31 passed

Reference

eloe and others added 2 commits April 5, 2026 11:58
Add full OpenAI Responses API (/v1/responses) compliance including:

- Structured function_call output items (parsed from model text)
- function_call_output input items for multi-turn tool use
- previous_response_id with LRU response store (256 entries)
- instructions field with developer-to-system role normalization
- "text" type alias accepted alongside "input_text"
- tools/tool_choice passthrough to chat template and response echo
- Streaming SSE with sequence_number and [DONE] sentinel
- incomplete_details for length-truncated responses
- parallel_tool_calls, metadata field support

New files:
- responses_models.py: Self-contained Pydantic models for Responses API
- responses_store.py: Thread-safe LRU store for response replay
- tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming)

Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the model generates <tool_call>...</tool_call> markup during
streaming, detect the tag and suppress those tokens from being sent
as response.output_text.delta events. This prevents raw tool call
XML from being displayed to users (e.g., in Telegram via OpenClaw).

The tool call is still parsed and emitted as structured function_call
events after generation completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant