Support previous_response_id replay for /responses#23
Conversation
There was a problem hiding this comment.
Pull request overview
Adds support for replaying prior /responses context via previous_response_id, backed by a new in-memory response store and helper utilities to expand/parse Responses-style input items.
Changes:
- Add
ResponseStore(thread-safe LRU) to persist completed response input/output snapshots for replay. - Add
responses_replayhelpers to expandprevious_response_idand convert Responses input items into server chat/messages/images. - Wire replay expansion + store persistence into the
/responsesendpoint and add pure unit tests for replay/store behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
mlx_vlm/server.py |
Adds previous_response_id handling, expands inputs via replay helpers, and saves completed snapshots into a global store. |
mlx_vlm/responses_store.py |
Implements an in-memory LRU store and a replay_input() method to rebuild prior context. |
mlx_vlm/responses_replay.py |
Implements replay expansion and conversion from Responses input items to server chat/messages/images. |
mlx_vlm/tests/test_responses_store.py |
Unit tests for LRU store behavior and replay input reconstruction. |
mlx_vlm/tests/test_responses_replay.py |
Unit tests for replay expansion and Responses input parsing helpers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _responses_store = ResponseStore() | ||
|
|
||
|
|
There was a problem hiding this comment.
A module-level _responses_store = ResponseStore() will retain recent requests/responses in memory for the life of the process. Because you’re saving expanded_input (which can include base64 input_image payloads and can grow with replay chaining), this can become a significant memory footprint even with LRU eviction. Consider making the store size configurable (env/flag), adding an optional TTL / byte-size cap per entry, or avoiding storing large image payloads when replay is enabled.
| _responses_store = ResponseStore() | |
| def _get_bool_env(name: str, default: bool) -> bool: | |
| value = os.environ.get(name) | |
| if value is None: | |
| return default | |
| return value.strip().lower() in {"1", "true", "yes", "on"} | |
| class _LazyResponseStore: | |
| def __init__(self): | |
| self._enabled = _get_bool_env("RESPONSES_STORE_ENABLED", True) | |
| self._store = None | |
| def _get_store(self): | |
| if not self._enabled: | |
| return None | |
| if self._store is None: | |
| self._store = ResponseStore() | |
| return self._store | |
| def __getattr__(self, name: str): | |
| store = self._get_store() | |
| if store is None: | |
| def _noop(*args, **kwargs): | |
| return None | |
| return _noop | |
| return getattr(store, name) | |
| _responses_store = _LazyResponseStore() |
| try: | ||
| expanded_input = resolve_responses_input_items( | ||
| openai_request.input, | ||
| previous_response_id=openai_request.previous_response_id, | ||
| response_store=_responses_store, | ||
| ) | ||
| chat_messages, images, instructions = responses_input_to_messages( | ||
| expanded_input | ||
| ) | ||
| except LookupError as exc: | ||
| raise HTTPException( | ||
| status_code=404, | ||
| detail=f"Previous response not found: {exc.args[0]}", | ||
| ) from exc | ||
| except ValueError as exc: | ||
| raise HTTPException(status_code=400, detail=str(exc)) from exc |
There was a problem hiding this comment.
previous_response_id handling and error mapping (404 for missing response, 400 for invalid input items) is newly introduced here, but there’s no endpoint-level test coverage exercising these paths (existing /responses tests only assert sampling args forwarding). Adding a lightweight FastAPI TestClient test that seeds the store and verifies replay expansion (and the 404 path) would help prevent regressions.
| item_type = output_dict.get("type", "") | ||
| if item_type == "message": | ||
| content = output_dict.get("content", []) | ||
| output_text_parts = [] | ||
| for part in content: |
There was a problem hiding this comment.
replay_input() only rehydrates assistant output when stored output items have type == "message". However, the /responses non-streaming path stores response.output items shaped like ChatMessage (e.g., {role, content}) without a type field, so those assistant outputs will be silently dropped during replay and chaining will lose assistant context. Consider also accepting ChatMessage-shaped outputs (role/content) here, or normalizing stored outputs to the Responses API message item schema before replaying.
| "total_tokens": prompt_tokens + output_tokens, | ||
| }, | ||
| ) | ||
| _responses_store.save(response_id, expanded_input, response.output) | ||
|
|
There was a problem hiding this comment.
The non-streaming /responses code saves response.output into the replay store, but the output item constructed above does not include a Responses-API type: "message" (it’s currently a {role, content, reasoning} dict). As a result, ResponseStore.replay_input() will not include this assistant output in replay context, breaking previous_response_id for the most common (non-streaming) case. Please align the non-streaming output shape with the streaming MessageItem/type:"message" format (or normalize the output before saving).
Summary
/responsesrequests that referenceprevious_response_idNotes
/responsesacceptsprevious_response_idbut does not replay previous response context Blaizzy/mlx-vlm#1046 is still under discussion