server: add OpenAI-compatible /v1/responses endpoint by krystophny · Pull Request #214 · waybarrios/vllm-mlx

krystophny · 2026-03-24T12:15:50Z

Summary

Rebuilt on current upstream/main (b4fa030) as a narrower /v1/responses diff for local coding-agent workflows.

Retained Scope

OpenAI-compatible /v1/responses endpoint with streaming and non-streaming responses
Pydantic models for Responses request, response, and SSE event payloads
Responses-to-chat conversion for text messages, function tools, function_call, and function_call_output
previous_response_id replay backed by a stored replayable message history
reasoning input items converted to assistant context instead of crashing the stream
reasoning config logged and ignored instead of raising after the response has started
LRU-bounded _responses_store to avoid unbounded response accumulation

Intentionally Dropped From This PR

request-level chat_template_kwargs forwarding and related simple-engine changes; that overlap stays in #218
unrelated cross-endpoint normalization or chat-path refactors

Review Guide

vllm_mlx/server.py: request conversion, response assembly, SSE streaming, persistence
vllm_mlx/api/responses_models.py: request/response/event models
tests/test_responses_api.py: endpoint coverage for replay, persistence, reasoning, and SSE lifecycle

Behavior Checklist

previous_response_id replays prior replayable items, but does not replay prior instructions
store=False skips persistence, and missing or evicted response ids return 404
unsupported non-function tools are skipped while supported function tools are forwarded
streaming emits response.created, response.in_progress, text deltas, and response.completed with monotonic sequence_number
text.format.type="json_object" is rejected; reasoning config is ignored; reasoning input items are accepted

Validation

python -m pytest tests/test_responses_api.py -q
# 19 passed

python -m pytest tests/test_server.py tests/test_responses_api.py -q
# 53 passed, 3 deselected

Add full OpenAI Responses API (/v1/responses) compliance including: - Structured function_call output items (parsed from model text) - function_call_output input items for multi-turn tool use - previous_response_id with LRU response store (256 entries) - instructions field with developer-to-system role normalization - "text" type alias accepted alongside "input_text" - tools/tool_choice passthrough to chat template and response echo - Streaming SSE with sequence_number and [DONE] sentinel - incomplete_details for length-truncated responses - parallel_tool_calls, metadata field support New files: - responses_models.py: Self-contained Pydantic models for Responses API - responses_store.py: Thread-safe LRU store for response replay - tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming) Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Thump604 · 2026-04-08T00:26:32Z

@waybarrios, @krystophny: independent review and coordination note on this PR.

Scope

Adds an OpenAI-compatible /v1/responses endpoint with text messages, function tools, function call outputs, streaming + non-streaming, previous_response_id replay for persisted replayable input items, developer/instructions normalization onto one leading system prompt, request-level chat_template_kwargs forwarding, LRU-bounded response store (max 1000 entries), and reasoning input items converted to assistant messages for model context. 1992+/5- across 8 files.

Coordination with other open PRs

This PR has substantial overlap with two other currently-open PRs that are smaller in scope and target the same gaps incidentally:

PR fix: normalize messages before chat template application #240 (Thump604) "fix: normalize messages before chat template application" — adds _normalize_messages() in server.py that maps non-standard roles (developer to system) and merges consecutive same-role messages, applied in all four request paths. This overlaps with the "developer/instructions normalization onto one leading system prompt" item in server: add OpenAI-compatible /v1/responses endpoint #214`s scope. 236+/0- across 2 files. MERGEABLE on main.
PR chat: forward chat_template_kwargs on simple-engine paths #218 (krystophny, also yours) "chat: forward chat_template_kwargs on simple-engine paths" — plumbs chat_template_kwargs through every simple-engine path that previously dropped it (SimpleEngine multimodal chat, multimodal stream_chat, MTP _stream_generate_text), plus the BatchedEngine paths and the LLM MLXLanguageModel.chat(). This overlaps with the "request-level chat_template_kwargs forwarding" item in server: add OpenAI-compatible /v1/responses endpoint #214`s scope. 237+/13- across 7 files. MERGEABLE on main with 7 tests and a CI workflow update.

Recommended landing order

Land the smaller focused PRs first, then rebase #214 to a smaller scope:

fix: normalize messages before chat template application #240 + chat: forward chat_template_kwargs on simple-engine paths #218 (both small, focused, mergeable, and address gaps that server: add OpenAI-compatible /v1/responses endpoint #214 includes incidentally)
server: add OpenAI-compatible /v1/responses endpoint #214 rebased with the now-redundant normalization and chat_template_kwargs forwarding dropped, leaving just the /v1/responses endpoint, the LRU response store, the previous_response_id replay, and the reasoning-input-item conversion

That gives the maintainer three smaller reviews instead of one 1992-line review, and the failure mode of any one PR being wrong is bounded to that PR rather than to a single big merge.

If waybarrios prefers to land #214 as-is for speed, that also works. The duplicate work between #214 and #240/#218 is not destructive (the same normalization logic just lives in the new endpoint instead of the old). The risk is that the duplication will rot apart over time as one path is updated and the other is not.

Status

PR is MERGEABLE on current main per the PR JSON. Last activity Mar 27 (~10 days ago). I have not done a line-by-line review of all 1992 lines, but the architectural shape is reasonable for a /v1/responses endpoint and the description is detailed enough to support a maintainer review.

Add full OpenAI Responses API (/v1/responses) compliance including: - Structured function_call output items (parsed from model text) - function_call_output input items for multi-turn tool use - previous_response_id with LRU response store (256 entries) - instructions field with developer-to-system role normalization - "text" type alias accepted alongside "input_text" - tools/tool_choice passthrough to chat template and response echo - Streaming SSE with sequence_number and [DONE] sentinel - incomplete_details for length-truncated responses - parallel_tool_calls, metadata field support New files: - responses_models.py: Self-contained Pydantic models for Responses API - responses_store.py: Thread-safe LRU store for response replay - tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming) Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

krystophny · 2026-04-09T07:27:06Z

Rebuilt this branch on current upstream/main (b4fa030) as a narrower Responses-only diff.

Dropped the incidental chat_template_kwargs / simple-engine overlap so this PR now stays focused on /v1/responses, request/response models, previous_response_id replay, reasoning-item handling, and the bounded response store.

I also tightened the dedicated coverage around the retained scope:

missing previous_response_id -> 404
store=False skips persistence
LRU eviction of _responses_store
reasoning config is ignored without crashing
reasoning input items are converted into assistant context
streaming SSE lifecycle keeps monotonic sequence_number

Validation:

python -m pytest tests/test_responses_api.py -q -> 19 passed
python -m pytest tests/test_server.py tests/test_responses_api.py -q -> 53 passed, 3 deselected

Thump604

Rebuild confirmed on head 23d68e5 against upstream main b4fa030. The narrowing to a Responses-only diff resolves the overlap with #218 cleanly: the file surface is now tests/test_responses_api.py (+654 new), vllm_mlx/api/responses_models.py (+316 new), vllm_mlx/api/init.py (+34), vllm_mlx/server.py (+930/-1). No chat_template_kwargs or simple-engine hunks remain, so #218 and #214 can now merge in either order without conflict.

The retained scope covers:

/v1/responses endpoint with streaming + non-streaming, Pydantic request / response / event models
previous_response_id replay backed by a stored replayable message history with LRU cap (1000 entries)
reasoning input items converted to assistant context instead of crashing the stream
reasoning config ignored rather than raised mid-stream
store=False path, 404 on missing response id, LRU eviction

The reasoning-items-as-assistant-context conversion is the key behavioral fix on top of the original #28-era rejection path: raising HTTPException inside an already-started SSE stream caused "response already started" crashes, and the new path converts reasoning content to assistant messages so the model sees its prior chain-of-thought instead.

CI green across lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, tests. tests/test_responses_api.py -> 19 passed on head, combined test_server.py + test_responses_api.py -> 53 passed. Approved.

Thump604 · 2026-04-13T19:18:13Z

Hey @krystophny - nice work on this. The spec conformance is thorough and it integrates cleanly with the existing server. Currently conflicting with main - can you rebase? Should be good to merge after that.

krystophny · 2026-04-14T09:31:42Z

Rebased on current upstream/main and force-pushed. Re-ran python -m pytest tests/test_responses_api.py tests/test_server.py -q locally (56 passed, 3 deselected).

Thump604

Thorough line-by-line review of the rebased diff on head e081695.

Approve. The spec conformance is solid for the local coding-agent subset, the integration with existing server.py patterns is clean, the bounded store prevents memory leaks, and the 19 tests cover the key behavioral paths.

Observations (none blocking):

Streaming + persistence round-trip: the non-streaming path has explicit test coverage for previous_response_id chaining. The streaming path has the same persistence logic but no dedicated replay-after-stream test. Low risk since the code paths are structurally identical, but worth adding if you're doing a follow-up pass.
Concurrency on _responses_store: the module-level OrderedDict is fine for single-worker uvicorn, but not thread/process-safe. If multi-worker deployment ever becomes relevant, this would need an asyncio.Lock or external store. Not a concern for the current deployment model.
function_call input to text fallback: when preserve_native_tool_format is false, function_call input items are converted to [Calling tool: ...] text in assistant messages rather than native tool_calls format. This works for the current text-based tool calling path but is worth noting as an assumption that would change if native tool format becomes the default.
ResponsesRequest.input union type: the | dict fallback in the input union means Pydantic won't reject malformed items at validation time — they'll be silently passed through to the dict-handling branch in _responses_input_to_chat_messages. This is probably intentional (forward compatibility with new item types from clients), but it does mean validation errors surface as runtime behavior (skipped items with a log message) rather than 422s.

All four are "nice to know" observations, not merge blockers. Good work on the narrowing and the dedicated test coverage for the edge cases (LRU eviction, store=false, reasoning items, incomplete status).

Thump604

Looks good — approving. Clean integration with the existing server patterns and solid test coverage across the key paths.

Four non-blocking observations for future iterations:

Streaming persistence gap: no test verifies that a streamed response is retrievable via previous_response_id after the stream completes. Worth adding if replay-after-stream is a supported path.
Module-level OrderedDict concurrency: the _responses_store is a module-level OrderedDict without locking. Fine for single-worker uvicorn, but will need a lock if multi-worker ever lands.
function_call text fallback: _convert_function_call_to_text assumes the model will produce parseable tool-call text. If the model hallucinates malformed JSON, the response will contain the raw text as arguments. Not a bug — just worth documenting the contract.
str | dict union on input items: Pydantic will silently coerce some invalid inputs rather than rejecting them. Not a merge blocker but something to watch for in conformance testing.

CI all green, merge state clean. Nice work on the rebase.

janhilgard

Code Review — /v1/responses endpoint

Thorough implementation of the OpenAI Responses API subset needed for local coding-agent workflows. The scope is well-chosen: text messages, function tools, function call outputs, streaming, previous_response_id replay, reasoning items, and an LRU-bounded response store.

Strengths

Clean separation of concerns: responses_models.py (316 lines) defines all Pydantic models independently from the endpoint logic in server.py. The conversion pipeline is clear: ResponsesRequest -> _responses_input_to_chat_messages -> ChatCompletionRequest -> engine.chat -> _build_responses_output_items -> ResponseObject.
Correct previous_response_id chaining: The persistence layer stores messages in chat-completions form, and replay correctly reconstructs the full conversation history without leaking instructions across chains. The test coverage for multi-hop chaining (test_previous_response_id_chains_across_multiple_follow_ups) and instruction isolation is good.
System message merging: Multiple system messages (from instructions, developer role items, and unsupported-tool warnings) are correctly merged into a single leading system message. This avoids template issues with models that only support one system message.
Streaming event conformance: The SSE event sequence follows the expected pattern: response.created -> response.in_progress -> deltas -> response.output_item.done -> response.completed. Sequence numbers are monotonically increasing. Reasoning and text items are properly separated in the streaming path.
LRU store with store=false support: The OrderedDict-based store with popitem(last=False) eviction is simple and correct for the single-process case.

Observations (non-blocking, as Thump604 already noted most of these)

Streaming persistence gap: No test verifies that a streamed response is retrievable via previous_response_id. The code does persist in the streaming path (_stream_responses_request stores at the end), but a test would lock this in.
Concurrency on _responses_store: The OrderedDict is module-level without locking. Under concurrent requests, popitem / __setitem__ interleaving could cause issues. For the typical single-user local use case this is fine, but worth documenting the assumption.
Function call output mapping uses text fallback: function_call input items are mapped to assistant messages with "[Calling tool: ...]" text content rather than proper tool_calls structure when the engine doesn't support native tool format. This is pragmatic for local models but means the round-trip isn't lossless for tool-calling conversations. The test test_function_call_output_input_is_mapped_cleanly verifies this behavior correctly.
_tool_parser_instance shared across endpoints: The streaming path in _stream_responses_request reuses and resets the global _tool_parser_instance. This is the same pattern as the chat completions endpoint, so it's consistent, but worth noting that concurrent streaming requests to both /v1/responses and /v1/chat/completions could interfere.

Test coverage

19 tests covering: basic response, previous_response_id chaining (single and multi-hop), instruction isolation, developer role normalization, instruction+developer merge, function call mapping, unsupported tools, function call response items, store=false, LRU eviction, streaming SSE events, streaming metadata monotonicity, json_object rejection, reasoning configuration, reasoning input items, and incomplete (length) responses.

LGTM. Solid spec conformance for the targeted subset, clean integration with existing server patterns.

Thump604 · 2026-04-17T12:26:04Z

This has merge conflicts with current main (several recent merges changed server.py). Needs a rebase before it can merge.

Implements the Responses API subset needed for local coding-agent workflows: text messages, function tools, function call outputs, streaming and non-streaming, previous_response_id replay backed by a bounded LRU response store (max 1000 entries), reasoning input items converted to assistant messages for model context, and reasoning config silently ignored instead of raised mid-stream. developer and instructions are merged into a single leading system prompt. New vllm_mlx/api/responses_models.py defines the request, response, item, and streaming-event Pydantic models. vllm_mlx/api/__init__.py re-exports them. vllm_mlx/server.py wires the endpoint, the input and output conversion pipeline (ResponsesRequest -> chat messages -> engine.chat -> ResponseObject), and the streaming SSE event sequence (response.created -> in_progress -> deltas -> output_item.done -> completed) with monotonic sequence numbers. tests/test_responses_api.py covers 19 cases including single and multi-hop previous_response_id chaining, instruction isolation, developer role normalization, unsupported tools, store=False, LRU eviction, SSE lifecycle, reasoning config and items, and incomplete (length) responses.

krystophny · 2026-04-17T13:19:33Z

Rebased onto current upstream/main (b0a79f5) and force-pushed. Two conflict regions in vllm_mlx/server.py:

Top-of-file imports: kept both from pydantic import BaseModel (needed for _responses_sse_event's BaseModel | dict payload) and from starlette.routing import Match (added to main for the 404 route-matcher).
Block after _normalize_messages: kept both _get_engine_tokenizer (added to main for constrained decoding) and the /v1/responses endpoint. The endpoint still uses _disconnect_guard and _normalize_messages which both exist on current main.

Net diff is +1934/-1 across the same 3 files as before (vllm_mlx/server.py, vllm_mlx/api/responses_models.py, vllm_mlx/api/__init__.py) plus tests/test_responses_api.py. Squashed the branch into a single commit for a clean rebase.

Local verification: python -m py_compile vllm_mlx/server.py passes. Waiting on CI.

Implements the Responses API subset needed for local coding-agent workflows: text messages, function tools, function call outputs, streaming and non-streaming, previous_response_id replay backed by a bounded LRU response store (max 1000 entries), reasoning input items converted to assistant messages for model context, and reasoning config silently ignored instead of raised mid-stream. developer and instructions are merged into a single leading system prompt. New vllm_mlx/api/responses_models.py defines the request, response, item, and streaming-event Pydantic models. vllm_mlx/api/__init__.py re-exports them. vllm_mlx/server.py wires the endpoint, the input and output conversion pipeline (ResponsesRequest -> chat messages -> engine.chat -> ResponseObject), and the streaming SSE event sequence (response.created -> in_progress -> deltas -> output_item.done -> completed) with monotonic sequence numbers. tests/test_responses_api.py covers 19 cases including single and multi-hop previous_response_id chaining, instruction isolation, developer role normalization, unsupported tools, store=False, LRU eviction, SSE lifecycle, reasoning config and items, and incomplete (length) responses.

krystophny mentioned this pull request Mar 24, 2026

responses: normalize developer and instructions for Codex #219

Closed

krystophny changed the title ~~Add OpenAI Responses API core~~ server: add OpenAI-compatible /v1/responses endpoint Mar 24, 2026

krystophny force-pushed the feature/openai-responses-api branch from c7f7364 to ad483cc Compare March 24, 2026 12:26

krystophny changed the title ~~server: add OpenAI-compatible /v1/responses endpoint~~ server: add non-streaming OpenAI-compatible /v1/responses endpoint Mar 24, 2026

krystophny changed the title ~~server: add non-streaming OpenAI-compatible /v1/responses endpoint~~ server: add OpenAI-compatible /v1/responses endpoint Mar 24, 2026

This was referenced Mar 25, 2026

[Tracking] Upstream backlog and merge plan computor-org/vllm-mlx#12

Open

server: close out the upstream /v1/responses merge plan computor-org/vllm-mlx#21

Closed

krystophny force-pushed the feature/openai-responses-api branch from df4f9af to 05838da Compare March 25, 2026 22:52

eloe mentioned this pull request Apr 6, 2026

feat: OpenAI Responses API compliance with structured tool calling Blaizzy/mlx-vlm#952

Closed

krystophny force-pushed the feature/openai-responses-api branch from c9f6bdc to fe9fb90 Compare April 9, 2026 07:25

Thump604 approved these changes Apr 9, 2026

View reviewed changes

krystophny force-pushed the feature/openai-responses-api branch from 23d68e5 to 05372b7 Compare April 14, 2026 09:29

Thump604 approved these changes Apr 14, 2026

View reviewed changes

janhilgard approved these changes Apr 15, 2026

View reviewed changes

krystophny force-pushed the feature/openai-responses-api branch from 05372b7 to 422258a Compare April 17, 2026 13:18

janhilgard merged commit e9fd921 into waybarrios:main Apr 17, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: add OpenAI-compatible /v1/responses endpoint#214

server: add OpenAI-compatible /v1/responses endpoint#214
janhilgard merged 1 commit into
waybarrios:mainfrom
computor-org:feature/openai-responses-api

krystophny commented Mar 24, 2026 •

edited

Loading

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

krystophny commented Apr 9, 2026

Uh oh!

Thump604 left a comment

Uh oh!

Thump604 commented Apr 13, 2026

Uh oh!

krystophny commented Apr 14, 2026

Uh oh!

Thump604 left a comment

Uh oh!

Thump604 left a comment

Uh oh!

janhilgard left a comment

Uh oh!

Thump604 commented Apr 17, 2026

Uh oh!

krystophny commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krystophny commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Retained Scope

Intentionally Dropped From This PR

Review Guide

Behavior Checklist

Validation

Uh oh!

Thump604 commented Apr 8, 2026

Scope

Coordination with other open PRs

Recommended landing order

Status

Uh oh!

krystophny commented Apr 9, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

Thump604 commented Apr 13, 2026

Uh oh!

krystophny commented Apr 14, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Code Review — /v1/responses endpoint

Strengths

Observations (non-blocking, as Thump604 already noted most of these)

Test coverage

Uh oh!

Thump604 commented Apr 17, 2026

Uh oh!

krystophny commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krystophny commented Mar 24, 2026 •

edited

Loading