feat: OpenAI Responses API compliance with structured tool calling by eloe · Pull Request #952 · Blaizzy/mlx-vlm

eloe · 2026-04-06T15:58:13Z

Summary

Full OpenAI Responses API (/v1/responses) compliance for agentic coding workflows.

Structured function_call output items parsed from model text via existing tool parsers
function_call_output input items for multi-turn tool use loops
previous_response_id with LRU response store (256 entries) for conversation replay
instructions field with developer→system role normalization
"text" type alias accepted alongside "input_text"
Tools/tool_choice passthrough to chat template and response echo
Streaming SSE with sequence_number and [DONE] sentinel
Tool call token suppression in streaming (no raw XML leaked to clients)
incomplete_details for length-truncated responses
parallel_tool_calls, metadata field support

Files

mlx_vlm/responses_models.py — Self-contained Pydantic models for Responses API
mlx_vlm/responses_store.py — Thread-safe LRU store for response replay
mlx_vlm/server.py — Rewritten /v1/responses endpoint with input conversion + tool call output building
mlx_vlm/tests/test_responses_api.py — 31 tests (models, store, endpoint, streaming)

Motivation

Agent frameworks like OpenClaw and Hermes use the Responses API for tool-calling loops on local models. The existing /v1/responses endpoint returned raw text without structured tool call items, breaking agentic workflows.

Testing

python -m pytest mlx_vlm/tests/test_responses_api.py -v
# 31 passed

Reference

OpenAI Responses API spec
waybarrios/vllm-mlx#214 — reference implementation for vllm-mlx

Add full OpenAI Responses API (/v1/responses) compliance including: - Structured function_call output items (parsed from model text) - function_call_output input items for multi-turn tool use - previous_response_id with LRU response store (256 entries) - instructions field with developer-to-system role normalization - "text" type alias accepted alongside "input_text" - tools/tool_choice passthrough to chat template and response echo - Streaming SSE with sequence_number and [DONE] sentinel - incomplete_details for length-truncated responses - parallel_tool_calls, metadata field support New files: - responses_models.py: Self-contained Pydantic models for Responses API - responses_store.py: Thread-safe LRU store for response replay - tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming) Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When the model generates <tool_call>...</tool_call> markup during streaming, detect the tag and suppress those tokens from being sent as response.output_text.delta events. This prevents raw tool call XML from being displayed to users (e.g., in Telegram via OpenClaw). The tool call is still parsed and emitted as structured function_call events after generation completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eloe and others added 2 commits April 5, 2026 11:58

eloe closed this Apr 6, 2026

eloe mentioned this pull request Apr 8, 2026

Combined server enhancements: OpenAI API compliance, prompt caching, concurrency eloe/mlx-vlm#21

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: OpenAI Responses API compliance with structured tool calling#952

feat: OpenAI Responses API compliance with structured tool calling#952
eloe wants to merge 2 commits into
Blaizzy:mainfrom
eloe:feature/openai-responses-api

eloe commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

eloe commented Apr 6, 2026

Summary

Files

Motivation

Testing

Reference

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant