server: add OpenAI-compatible /v1/responses endpoint#214
Conversation
c7f7364 to
ad483cc
Compare
df4f9af to
05838da
Compare
Add full OpenAI Responses API (/v1/responses) compliance including: - Structured function_call output items (parsed from model text) - function_call_output input items for multi-turn tool use - previous_response_id with LRU response store (256 entries) - instructions field with developer-to-system role normalization - "text" type alias accepted alongside "input_text" - tools/tool_choice passthrough to chat template and response echo - Streaming SSE with sequence_number and [DONE] sentinel - incomplete_details for length-truncated responses - parallel_tool_calls, metadata field support New files: - responses_models.py: Self-contained Pydantic models for Responses API - responses_store.py: Thread-safe LRU store for response replay - tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming) Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@waybarrios, @krystophny: independent review and coordination note on this PR. ScopeAdds an OpenAI-compatible Coordination with other open PRsThis PR has substantial overlap with two other currently-open PRs that are smaller in scope and target the same gaps incidentally:
Recommended landing orderLand the smaller focused PRs first, then rebase #214 to a smaller scope:
That gives the maintainer three smaller reviews instead of one 1992-line review, and the failure mode of any one PR being wrong is bounded to that PR rather than to a single big merge. If waybarrios prefers to land #214 as-is for speed, that also works. The duplicate work between #214 and #240/#218 is not destructive (the same normalization logic just lives in the new endpoint instead of the old). The risk is that the duplication will rot apart over time as one path is updated and the other is not. StatusPR is MERGEABLE on current main per the PR JSON. Last activity Mar 27 (~10 days ago). I have not done a line-by-line review of all 1992 lines, but the architectural shape is reasonable for a |
Add full OpenAI Responses API (/v1/responses) compliance including: - Structured function_call output items (parsed from model text) - function_call_output input items for multi-turn tool use - previous_response_id with LRU response store (256 entries) - instructions field with developer-to-system role normalization - "text" type alias accepted alongside "input_text" - tools/tool_choice passthrough to chat template and response echo - Streaming SSE with sequence_number and [DONE] sentinel - incomplete_details for length-truncated responses - parallel_tool_calls, metadata field support New files: - responses_models.py: Self-contained Pydantic models for Responses API - responses_store.py: Thread-safe LRU store for response replay - tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming) Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c9f6bdc to
fe9fb90
Compare
|
Rebuilt this branch on current upstream/main ( Dropped the incidental I also tightened the dedicated coverage around the retained scope:
Validation:
|
Thump604
left a comment
There was a problem hiding this comment.
Rebuild confirmed on head 23d68e5 against upstream main b4fa030. The narrowing to a Responses-only diff resolves the overlap with #218 cleanly: the file surface is now tests/test_responses_api.py (+654 new), vllm_mlx/api/responses_models.py (+316 new), vllm_mlx/api/init.py (+34), vllm_mlx/server.py (+930/-1). No chat_template_kwargs or simple-engine hunks remain, so #218 and #214 can now merge in either order without conflict.
The retained scope covers:
- /v1/responses endpoint with streaming + non-streaming, Pydantic request / response / event models
- previous_response_id replay backed by a stored replayable message history with LRU cap (1000 entries)
- reasoning input items converted to assistant context instead of crashing the stream
- reasoning config ignored rather than raised mid-stream
- store=False path, 404 on missing response id, LRU eviction
The reasoning-items-as-assistant-context conversion is the key behavioral fix on top of the original #28-era rejection path: raising HTTPException inside an already-started SSE stream caused "response already started" crashes, and the new path converts reasoning content to assistant messages so the model sees its prior chain-of-thought instead.
CI green across lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, tests. tests/test_responses_api.py -> 19 passed on head, combined test_server.py + test_responses_api.py -> 53 passed. Approved.
|
Hey @krystophny - nice work on this. The spec conformance is thorough and it integrates cleanly with the existing server. Currently conflicting with main - can you rebase? Should be good to merge after that. |
23d68e5 to
05372b7
Compare
|
Rebased on current upstream/main and force-pushed. Re-ran python -m pytest tests/test_responses_api.py tests/test_server.py -q locally (56 passed, 3 deselected). |
Thump604
left a comment
There was a problem hiding this comment.
Thorough line-by-line review of the rebased diff on head e081695.
Approve. The spec conformance is solid for the local coding-agent subset, the integration with existing server.py patterns is clean, the bounded store prevents memory leaks, and the 19 tests cover the key behavioral paths.
Observations (none blocking):
-
Streaming + persistence round-trip: the non-streaming path has explicit test coverage for
previous_response_idchaining. The streaming path has the same persistence logic but no dedicated replay-after-stream test. Low risk since the code paths are structurally identical, but worth adding if you're doing a follow-up pass. -
Concurrency on
_responses_store: the module-levelOrderedDictis fine for single-worker uvicorn, but not thread/process-safe. If multi-worker deployment ever becomes relevant, this would need anasyncio.Lockor external store. Not a concern for the current deployment model. -
function_callinput to text fallback: whenpreserve_native_tool_formatis false,function_callinput items are converted to[Calling tool: ...]text in assistant messages rather than nativetool_callsformat. This works for the current text-based tool calling path but is worth noting as an assumption that would change if native tool format becomes the default. -
ResponsesRequest.inputunion type: the| dictfallback in the input union means Pydantic won't reject malformed items at validation time — they'll be silently passed through to the dict-handling branch in_responses_input_to_chat_messages. This is probably intentional (forward compatibility with new item types from clients), but it does mean validation errors surface as runtime behavior (skipped items with a log message) rather than 422s.
All four are "nice to know" observations, not merge blockers. Good work on the narrowing and the dedicated test coverage for the edge cases (LRU eviction, store=false, reasoning items, incomplete status).
Thump604
left a comment
There was a problem hiding this comment.
Looks good — approving. Clean integration with the existing server patterns and solid test coverage across the key paths.
Four non-blocking observations for future iterations:
-
Streaming persistence gap: no test verifies that a streamed response is retrievable via
previous_response_idafter the stream completes. Worth adding if replay-after-stream is a supported path. -
Module-level OrderedDict concurrency: the
_responses_storeis a module-levelOrderedDictwithout locking. Fine for single-worker uvicorn, but will need a lock if multi-worker ever lands. -
function_call text fallback:
_convert_function_call_to_textassumes the model will produce parseable tool-call text. If the model hallucinates malformed JSON, the response will contain the raw text asarguments. Not a bug — just worth documenting the contract. -
str | dictunion on input items: Pydantic will silently coerce some invalid inputs rather than rejecting them. Not a merge blocker but something to watch for in conformance testing.
CI all green, merge state clean. Nice work on the rebase.
janhilgard
left a comment
There was a problem hiding this comment.
Code Review — /v1/responses endpoint
Thorough implementation of the OpenAI Responses API subset needed for local coding-agent workflows. The scope is well-chosen: text messages, function tools, function call outputs, streaming, previous_response_id replay, reasoning items, and an LRU-bounded response store.
Strengths
-
Clean separation of concerns:
responses_models.py(316 lines) defines all Pydantic models independently from the endpoint logic inserver.py. The conversion pipeline is clear:ResponsesRequest->_responses_input_to_chat_messages->ChatCompletionRequest-> engine.chat ->_build_responses_output_items->ResponseObject. -
Correct
previous_response_idchaining: The persistence layer stores messages in chat-completions form, and replay correctly reconstructs the full conversation history without leakinginstructionsacross chains. The test coverage for multi-hop chaining (test_previous_response_id_chains_across_multiple_follow_ups) and instruction isolation is good. -
System message merging: Multiple system messages (from
instructions,developerrole items, and unsupported-tool warnings) are correctly merged into a single leading system message. This avoids template issues with models that only support one system message. -
Streaming event conformance: The SSE event sequence follows the expected pattern:
response.created->response.in_progress-> deltas ->response.output_item.done->response.completed. Sequence numbers are monotonically increasing. Reasoning and text items are properly separated in the streaming path. -
LRU store with
store=falsesupport: TheOrderedDict-based store withpopitem(last=False)eviction is simple and correct for the single-process case.
Observations (non-blocking, as Thump604 already noted most of these)
-
Streaming persistence gap: No test verifies that a streamed response is retrievable via
previous_response_id. The code does persist in the streaming path (_stream_responses_requeststores at the end), but a test would lock this in. -
Concurrency on
_responses_store: TheOrderedDictis module-level without locking. Under concurrent requests,popitem/__setitem__interleaving could cause issues. For the typical single-user local use case this is fine, but worth documenting the assumption. -
Function call output mapping uses text fallback:
function_callinput items are mapped to assistant messages with"[Calling tool: ...]"text content rather than propertool_callsstructure when the engine doesn't support native tool format. This is pragmatic for local models but means the round-trip isn't lossless for tool-calling conversations. The testtest_function_call_output_input_is_mapped_cleanlyverifies this behavior correctly. -
_tool_parser_instanceshared across endpoints: The streaming path in_stream_responses_requestreuses and resets the global_tool_parser_instance. This is the same pattern as the chat completions endpoint, so it's consistent, but worth noting that concurrent streaming requests to both/v1/responsesand/v1/chat/completionscould interfere.
Test coverage
19 tests covering: basic response, previous_response_id chaining (single and multi-hop), instruction isolation, developer role normalization, instruction+developer merge, function call mapping, unsupported tools, function call response items, store=false, LRU eviction, streaming SSE events, streaming metadata monotonicity, json_object rejection, reasoning configuration, reasoning input items, and incomplete (length) responses.
LGTM. Solid spec conformance for the targeted subset, clean integration with existing server patterns.
|
This has merge conflicts with current main (several recent merges changed server.py). Needs a rebase before it can merge. |
Implements the Responses API subset needed for local coding-agent workflows: text messages, function tools, function call outputs, streaming and non-streaming, previous_response_id replay backed by a bounded LRU response store (max 1000 entries), reasoning input items converted to assistant messages for model context, and reasoning config silently ignored instead of raised mid-stream. developer and instructions are merged into a single leading system prompt. New vllm_mlx/api/responses_models.py defines the request, response, item, and streaming-event Pydantic models. vllm_mlx/api/__init__.py re-exports them. vllm_mlx/server.py wires the endpoint, the input and output conversion pipeline (ResponsesRequest -> chat messages -> engine.chat -> ResponseObject), and the streaming SSE event sequence (response.created -> in_progress -> deltas -> output_item.done -> completed) with monotonic sequence numbers. tests/test_responses_api.py covers 19 cases including single and multi-hop previous_response_id chaining, instruction isolation, developer role normalization, unsupported tools, store=False, LRU eviction, SSE lifecycle, reasoning config and items, and incomplete (length) responses.
05372b7 to
422258a
Compare
|
Rebased onto current upstream/main (b0a79f5) and force-pushed. Two conflict regions in vllm_mlx/server.py:
Net diff is +1934/-1 across the same 3 files as before ( Local verification: |
Implements the Responses API subset needed for local coding-agent workflows: text messages, function tools, function call outputs, streaming and non-streaming, previous_response_id replay backed by a bounded LRU response store (max 1000 entries), reasoning input items converted to assistant messages for model context, and reasoning config silently ignored instead of raised mid-stream. developer and instructions are merged into a single leading system prompt. New vllm_mlx/api/responses_models.py defines the request, response, item, and streaming-event Pydantic models. vllm_mlx/api/__init__.py re-exports them. vllm_mlx/server.py wires the endpoint, the input and output conversion pipeline (ResponsesRequest -> chat messages -> engine.chat -> ResponseObject), and the streaming SSE event sequence (response.created -> in_progress -> deltas -> output_item.done -> completed) with monotonic sequence numbers. tests/test_responses_api.py covers 19 cases including single and multi-hop previous_response_id chaining, instruction isolation, developer role normalization, unsupported tools, store=False, LRU eviction, SSE lifecycle, reasoning config and items, and incomplete (length) responses.
Summary
Rebuilt on current
upstream/main(b4fa030) as a narrower/v1/responsesdiff for local coding-agent workflows.Retained Scope
/v1/responsesendpoint with streaming and non-streaming responsesfunction_call, andfunction_call_outputprevious_response_idreplay backed by a stored replayable message history_responses_storeto avoid unbounded response accumulationIntentionally Dropped From This PR
chat_template_kwargsforwarding and related simple-engine changes; that overlap stays in#218Review Guide
vllm_mlx/server.py: request conversion, response assembly, SSE streaming, persistencevllm_mlx/api/responses_models.py: request/response/event modelstests/test_responses_api.py: endpoint coverage for replay, persistence, reasoning, and SSE lifecycleBehavior Checklist
previous_response_idreplays prior replayable items, but does not replay priorinstructionsstore=Falseskips persistence, and missing or evicted response ids return404response.created,response.in_progress, text deltas, andresponse.completedwith monotonicsequence_numbertext.format.type="json_object"is rejected; reasoning config is ignored; reasoning input items are acceptedValidation