Skip to content

server: add OpenAI-compatible /v1/responses endpoint#214

Merged
janhilgard merged 1 commit into
waybarrios:mainfrom
computor-org:feature/openai-responses-api
Apr 17, 2026
Merged

server: add OpenAI-compatible /v1/responses endpoint#214
janhilgard merged 1 commit into
waybarrios:mainfrom
computor-org:feature/openai-responses-api

Conversation

@krystophny
Copy link
Copy Markdown
Contributor

@krystophny krystophny commented Mar 24, 2026

Summary

Rebuilt on current upstream/main (b4fa030) as a narrower /v1/responses diff for local coding-agent workflows.

Retained Scope

  • OpenAI-compatible /v1/responses endpoint with streaming and non-streaming responses
  • Pydantic models for Responses request, response, and SSE event payloads
  • Responses-to-chat conversion for text messages, function tools, function_call, and function_call_output
  • previous_response_id replay backed by a stored replayable message history
  • reasoning input items converted to assistant context instead of crashing the stream
  • reasoning config logged and ignored instead of raising after the response has started
  • LRU-bounded _responses_store to avoid unbounded response accumulation

Intentionally Dropped From This PR

  • request-level chat_template_kwargs forwarding and related simple-engine changes; that overlap stays in #218
  • unrelated cross-endpoint normalization or chat-path refactors

Review Guide

  • vllm_mlx/server.py: request conversion, response assembly, SSE streaming, persistence
  • vllm_mlx/api/responses_models.py: request/response/event models
  • tests/test_responses_api.py: endpoint coverage for replay, persistence, reasoning, and SSE lifecycle

Behavior Checklist

  • previous_response_id replays prior replayable items, but does not replay prior instructions
  • store=False skips persistence, and missing or evicted response ids return 404
  • unsupported non-function tools are skipped while supported function tools are forwarded
  • streaming emits response.created, response.in_progress, text deltas, and response.completed with monotonic sequence_number
  • text.format.type="json_object" is rejected; reasoning config is ignored; reasoning input items are accepted

Validation

python -m pytest tests/test_responses_api.py -q
# 19 passed

python -m pytest tests/test_server.py tests/test_responses_api.py -q
# 53 passed, 3 deselected

@krystophny krystophny changed the title Add OpenAI Responses API core server: add OpenAI-compatible /v1/responses endpoint Mar 24, 2026
@krystophny krystophny force-pushed the feature/openai-responses-api branch from c7f7364 to ad483cc Compare March 24, 2026 12:26
@krystophny krystophny changed the title server: add OpenAI-compatible /v1/responses endpoint server: add non-streaming OpenAI-compatible /v1/responses endpoint Mar 24, 2026
@krystophny krystophny changed the title server: add non-streaming OpenAI-compatible /v1/responses endpoint server: add OpenAI-compatible /v1/responses endpoint Mar 24, 2026
@krystophny krystophny force-pushed the feature/openai-responses-api branch from df4f9af to 05838da Compare March 25, 2026 22:52
eloe added a commit to eloe/mlx-vlm that referenced this pull request Apr 5, 2026
Add full OpenAI Responses API (/v1/responses) compliance including:

- Structured function_call output items (parsed from model text)
- function_call_output input items for multi-turn tool use
- previous_response_id with LRU response store (256 entries)
- instructions field with developer-to-system role normalization
- "text" type alias accepted alongside "input_text"
- tools/tool_choice passthrough to chat template and response echo
- Streaming SSE with sequence_number and [DONE] sentinel
- incomplete_details for length-truncated responses
- parallel_tool_calls, metadata field support

New files:
- responses_models.py: Self-contained Pydantic models for Responses API
- responses_store.py: Thread-safe LRU store for response replay
- tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming)

Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 8, 2026

@waybarrios, @krystophny: independent review and coordination note on this PR.

Scope

Adds an OpenAI-compatible /v1/responses endpoint with text messages, function tools, function call outputs, streaming + non-streaming, previous_response_id replay for persisted replayable input items, developer/instructions normalization onto one leading system prompt, request-level chat_template_kwargs forwarding, LRU-bounded response store (max 1000 entries), and reasoning input items converted to assistant messages for model context. 1992+/5- across 8 files.

Coordination with other open PRs

This PR has substantial overlap with two other currently-open PRs that are smaller in scope and target the same gaps incidentally:

  1. PR fix: normalize messages before chat template application #240 (Thump604) "fix: normalize messages before chat template application" — adds _normalize_messages() in server.py that maps non-standard roles (developer to system) and merges consecutive same-role messages, applied in all four request paths. This overlaps with the "developer/instructions normalization onto one leading system prompt" item in server: add OpenAI-compatible /v1/responses endpoint #214`s scope. 236+/0- across 2 files. MERGEABLE on main.

  2. PR chat: forward chat_template_kwargs on simple-engine paths #218 (krystophny, also yours) "chat: forward chat_template_kwargs on simple-engine paths" — plumbs chat_template_kwargs through every simple-engine path that previously dropped it (SimpleEngine multimodal chat, multimodal stream_chat, MTP _stream_generate_text), plus the BatchedEngine paths and the LLM MLXLanguageModel.chat(). This overlaps with the "request-level chat_template_kwargs forwarding" item in server: add OpenAI-compatible /v1/responses endpoint #214`s scope. 237+/13- across 7 files. MERGEABLE on main with 7 tests and a CI workflow update.

Recommended landing order

Land the smaller focused PRs first, then rebase #214 to a smaller scope:

  1. fix: normalize messages before chat template application #240 + chat: forward chat_template_kwargs on simple-engine paths #218 (both small, focused, mergeable, and address gaps that server: add OpenAI-compatible /v1/responses endpoint #214 includes incidentally)
  2. server: add OpenAI-compatible /v1/responses endpoint #214 rebased with the now-redundant normalization and chat_template_kwargs forwarding dropped, leaving just the /v1/responses endpoint, the LRU response store, the previous_response_id replay, and the reasoning-input-item conversion

That gives the maintainer three smaller reviews instead of one 1992-line review, and the failure mode of any one PR being wrong is bounded to that PR rather than to a single big merge.

If waybarrios prefers to land #214 as-is for speed, that also works. The duplicate work between #214 and #240/#218 is not destructive (the same normalization logic just lives in the new endpoint instead of the old). The risk is that the duplication will rot apart over time as one path is updated and the other is not.

Status

PR is MERGEABLE on current main per the PR JSON. Last activity Mar 27 (~10 days ago). I have not done a line-by-line review of all 1992 lines, but the architectural shape is reasonable for a /v1/responses endpoint and the description is detailed enough to support a maintainer review.

eloe added a commit to eloe/mlx-vlm that referenced this pull request Apr 9, 2026
Add full OpenAI Responses API (/v1/responses) compliance including:

- Structured function_call output items (parsed from model text)
- function_call_output input items for multi-turn tool use
- previous_response_id with LRU response store (256 entries)
- instructions field with developer-to-system role normalization
- "text" type alias accepted alongside "input_text"
- tools/tool_choice passthrough to chat template and response echo
- Streaming SSE with sequence_number and [DONE] sentinel
- incomplete_details for length-truncated responses
- parallel_tool_calls, metadata field support

New files:
- responses_models.py: Self-contained Pydantic models for Responses API
- responses_store.py: Thread-safe LRU store for response replay
- tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming)

Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@krystophny krystophny force-pushed the feature/openai-responses-api branch from c9f6bdc to fe9fb90 Compare April 9, 2026 07:25
@krystophny
Copy link
Copy Markdown
Contributor Author

Rebuilt this branch on current upstream/main (b4fa030) as a narrower Responses-only diff.

Dropped the incidental chat_template_kwargs / simple-engine overlap so this PR now stays focused on /v1/responses, request/response models, previous_response_id replay, reasoning-item handling, and the bounded response store.

I also tightened the dedicated coverage around the retained scope:

  • missing previous_response_id -> 404
  • store=False skips persistence
  • LRU eviction of _responses_store
  • reasoning config is ignored without crashing
  • reasoning input items are converted into assistant context
  • streaming SSE lifecycle keeps monotonic sequence_number

Validation:

  • python -m pytest tests/test_responses_api.py -q -> 19 passed
  • python -m pytest tests/test_server.py tests/test_responses_api.py -q -> 53 passed, 3 deselected

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebuild confirmed on head 23d68e5 against upstream main b4fa030. The narrowing to a Responses-only diff resolves the overlap with #218 cleanly: the file surface is now tests/test_responses_api.py (+654 new), vllm_mlx/api/responses_models.py (+316 new), vllm_mlx/api/init.py (+34), vllm_mlx/server.py (+930/-1). No chat_template_kwargs or simple-engine hunks remain, so #218 and #214 can now merge in either order without conflict.

The retained scope covers:

  • /v1/responses endpoint with streaming + non-streaming, Pydantic request / response / event models
  • previous_response_id replay backed by a stored replayable message history with LRU cap (1000 entries)
  • reasoning input items converted to assistant context instead of crashing the stream
  • reasoning config ignored rather than raised mid-stream
  • store=False path, 404 on missing response id, LRU eviction

The reasoning-items-as-assistant-context conversion is the key behavioral fix on top of the original #28-era rejection path: raising HTTPException inside an already-started SSE stream caused "response already started" crashes, and the new path converts reasoning content to assistant messages so the model sees its prior chain-of-thought instead.

CI green across lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, tests. tests/test_responses_api.py -> 19 passed on head, combined test_server.py + test_responses_api.py -> 53 passed. Approved.

@Thump604
Copy link
Copy Markdown
Collaborator

Hey @krystophny - nice work on this. The spec conformance is thorough and it integrates cleanly with the existing server. Currently conflicting with main - can you rebase? Should be good to merge after that.

@krystophny krystophny force-pushed the feature/openai-responses-api branch from 23d68e5 to 05372b7 Compare April 14, 2026 09:29
@krystophny
Copy link
Copy Markdown
Contributor Author

Rebased on current upstream/main and force-pushed. Re-ran python -m pytest tests/test_responses_api.py tests/test_server.py -q locally (56 passed, 3 deselected).

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough line-by-line review of the rebased diff on head e081695.

Approve. The spec conformance is solid for the local coding-agent subset, the integration with existing server.py patterns is clean, the bounded store prevents memory leaks, and the 19 tests cover the key behavioral paths.

Observations (none blocking):

  1. Streaming + persistence round-trip: the non-streaming path has explicit test coverage for previous_response_id chaining. The streaming path has the same persistence logic but no dedicated replay-after-stream test. Low risk since the code paths are structurally identical, but worth adding if you're doing a follow-up pass.

  2. Concurrency on _responses_store: the module-level OrderedDict is fine for single-worker uvicorn, but not thread/process-safe. If multi-worker deployment ever becomes relevant, this would need an asyncio.Lock or external store. Not a concern for the current deployment model.

  3. function_call input to text fallback: when preserve_native_tool_format is false, function_call input items are converted to [Calling tool: ...] text in assistant messages rather than native tool_calls format. This works for the current text-based tool calling path but is worth noting as an assumption that would change if native tool format becomes the default.

  4. ResponsesRequest.input union type: the | dict fallback in the input union means Pydantic won't reject malformed items at validation time — they'll be silently passed through to the dict-handling branch in _responses_input_to_chat_messages. This is probably intentional (forward compatibility with new item types from clients), but it does mean validation errors surface as runtime behavior (skipped items with a log message) rather than 422s.

All four are "nice to know" observations, not merge blockers. Good work on the narrowing and the dedicated test coverage for the edge cases (LRU eviction, store=false, reasoning items, incomplete status).

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good — approving. Clean integration with the existing server patterns and solid test coverage across the key paths.

Four non-blocking observations for future iterations:

  1. Streaming persistence gap: no test verifies that a streamed response is retrievable via previous_response_id after the stream completes. Worth adding if replay-after-stream is a supported path.

  2. Module-level OrderedDict concurrency: the _responses_store is a module-level OrderedDict without locking. Fine for single-worker uvicorn, but will need a lock if multi-worker ever lands.

  3. function_call text fallback: _convert_function_call_to_text assumes the model will produce parseable tool-call text. If the model hallucinates malformed JSON, the response will contain the raw text as arguments. Not a bug — just worth documenting the contract.

  4. str | dict union on input items: Pydantic will silently coerce some invalid inputs rather than rejecting them. Not a merge blocker but something to watch for in conformance testing.

CI all green, merge state clean. Nice work on the rebase.

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — /v1/responses endpoint

Thorough implementation of the OpenAI Responses API subset needed for local coding-agent workflows. The scope is well-chosen: text messages, function tools, function call outputs, streaming, previous_response_id replay, reasoning items, and an LRU-bounded response store.

Strengths

  1. Clean separation of concerns: responses_models.py (316 lines) defines all Pydantic models independently from the endpoint logic in server.py. The conversion pipeline is clear: ResponsesRequest -> _responses_input_to_chat_messages -> ChatCompletionRequest -> engine.chat -> _build_responses_output_items -> ResponseObject.

  2. Correct previous_response_id chaining: The persistence layer stores messages in chat-completions form, and replay correctly reconstructs the full conversation history without leaking instructions across chains. The test coverage for multi-hop chaining (test_previous_response_id_chains_across_multiple_follow_ups) and instruction isolation is good.

  3. System message merging: Multiple system messages (from instructions, developer role items, and unsupported-tool warnings) are correctly merged into a single leading system message. This avoids template issues with models that only support one system message.

  4. Streaming event conformance: The SSE event sequence follows the expected pattern: response.created -> response.in_progress -> deltas -> response.output_item.done -> response.completed. Sequence numbers are monotonically increasing. Reasoning and text items are properly separated in the streaming path.

  5. LRU store with store=false support: The OrderedDict-based store with popitem(last=False) eviction is simple and correct for the single-process case.

Observations (non-blocking, as Thump604 already noted most of these)

  1. Streaming persistence gap: No test verifies that a streamed response is retrievable via previous_response_id. The code does persist in the streaming path (_stream_responses_request stores at the end), but a test would lock this in.

  2. Concurrency on _responses_store: The OrderedDict is module-level without locking. Under concurrent requests, popitem / __setitem__ interleaving could cause issues. For the typical single-user local use case this is fine, but worth documenting the assumption.

  3. Function call output mapping uses text fallback: function_call input items are mapped to assistant messages with "[Calling tool: ...]" text content rather than proper tool_calls structure when the engine doesn't support native tool format. This is pragmatic for local models but means the round-trip isn't lossless for tool-calling conversations. The test test_function_call_output_input_is_mapped_cleanly verifies this behavior correctly.

  4. _tool_parser_instance shared across endpoints: The streaming path in _stream_responses_request reuses and resets the global _tool_parser_instance. This is the same pattern as the chat completions endpoint, so it's consistent, but worth noting that concurrent streaming requests to both /v1/responses and /v1/chat/completions could interfere.

Test coverage

19 tests covering: basic response, previous_response_id chaining (single and multi-hop), instruction isolation, developer role normalization, instruction+developer merge, function call mapping, unsupported tools, function call response items, store=false, LRU eviction, streaming SSE events, streaming metadata monotonicity, json_object rejection, reasoning configuration, reasoning input items, and incomplete (length) responses.

LGTM. Solid spec conformance for the targeted subset, clean integration with existing server patterns.

@Thump604
Copy link
Copy Markdown
Collaborator

This has merge conflicts with current main (several recent merges changed server.py). Needs a rebase before it can merge.

Implements the Responses API subset needed for local coding-agent
workflows: text messages, function tools, function call outputs,
streaming and non-streaming, previous_response_id replay backed by a
bounded LRU response store (max 1000 entries), reasoning input items
converted to assistant messages for model context, and reasoning config
silently ignored instead of raised mid-stream. developer and
instructions are merged into a single leading system prompt.

New vllm_mlx/api/responses_models.py defines the request, response,
item, and streaming-event Pydantic models. vllm_mlx/api/__init__.py
re-exports them. vllm_mlx/server.py wires the endpoint, the input and
output conversion pipeline (ResponsesRequest -> chat messages ->
engine.chat -> ResponseObject), and the streaming SSE event sequence
(response.created -> in_progress -> deltas -> output_item.done ->
completed) with monotonic sequence numbers.

tests/test_responses_api.py covers 19 cases including single and
multi-hop previous_response_id chaining, instruction isolation,
developer role normalization, unsupported tools, store=False, LRU
eviction, SSE lifecycle, reasoning config and items, and incomplete
(length) responses.
@krystophny krystophny force-pushed the feature/openai-responses-api branch from 05372b7 to 422258a Compare April 17, 2026 13:18
@krystophny
Copy link
Copy Markdown
Contributor Author

Rebased onto current upstream/main (b0a79f5) and force-pushed. Two conflict regions in vllm_mlx/server.py:

  1. Top-of-file imports: kept both from pydantic import BaseModel (needed for _responses_sse_event's BaseModel | dict payload) and from starlette.routing import Match (added to main for the 404 route-matcher).
  2. Block after _normalize_messages: kept both _get_engine_tokenizer (added to main for constrained decoding) and the /v1/responses endpoint. The endpoint still uses _disconnect_guard and _normalize_messages which both exist on current main.

Net diff is +1934/-1 across the same 3 files as before (vllm_mlx/server.py, vllm_mlx/api/responses_models.py, vllm_mlx/api/__init__.py) plus tests/test_responses_api.py. Squashed the branch into a single commit for a clean rebase.

Local verification: python -m py_compile vllm_mlx/server.py passes. Waiting on CI.

@janhilgard janhilgard merged commit e9fd921 into waybarrios:main Apr 17, 2026
9 checks passed
arozanov pushed a commit to arozanov/vllm-mlx that referenced this pull request Apr 30, 2026
Implements the Responses API subset needed for local coding-agent
workflows: text messages, function tools, function call outputs,
streaming and non-streaming, previous_response_id replay backed by a
bounded LRU response store (max 1000 entries), reasoning input items
converted to assistant messages for model context, and reasoning config
silently ignored instead of raised mid-stream. developer and
instructions are merged into a single leading system prompt.

New vllm_mlx/api/responses_models.py defines the request, response,
item, and streaming-event Pydantic models. vllm_mlx/api/__init__.py
re-exports them. vllm_mlx/server.py wires the endpoint, the input and
output conversion pipeline (ResponsesRequest -> chat messages ->
engine.chat -> ResponseObject), and the streaming SSE event sequence
(response.created -> in_progress -> deltas -> output_item.done ->
completed) with monotonic sequence numbers.

tests/test_responses_api.py covers 19 cases including single and
multi-hop previous_response_id chaining, instruction isolation,
developer role normalization, unsupported tools, store=False, LRU
eviction, SSE lifecycle, reasoning config and items, and incomplete
(length) responses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants