Skip to content

Support previous_response_id replay for /responses#23

Draft
eloe wants to merge 2 commits into
mainfrom
codex/issue-1046-previous-response-id
Draft

Support previous_response_id replay for /responses#23
eloe wants to merge 2 commits into
mainfrom
codex/issue-1046-previous-response-id

Conversation

@eloe
Copy link
Copy Markdown
Owner

@eloe eloe commented Apr 28, 2026

Summary

  • add replay expansion for /responses requests that reference previous_response_id
  • persist completed response input/output snapshots in the in-memory response store
  • add pure unit tests for replay chaining without importing the MLX server stack

Notes

Copilot AI review requested due to automatic review settings April 28, 2026 04:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for replaying prior /responses context via previous_response_id, backed by a new in-memory response store and helper utilities to expand/parse Responses-style input items.

Changes:

  • Add ResponseStore (thread-safe LRU) to persist completed response input/output snapshots for replay.
  • Add responses_replay helpers to expand previous_response_id and convert Responses input items into server chat/messages/images.
  • Wire replay expansion + store persistence into the /responses endpoint and add pure unit tests for replay/store behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mlx_vlm/server.py Adds previous_response_id handling, expands inputs via replay helpers, and saves completed snapshots into a global store.
mlx_vlm/responses_store.py Implements an in-memory LRU store and a replay_input() method to rebuild prior context.
mlx_vlm/responses_replay.py Implements replay expansion and conversion from Responses input items to server chat/messages/images.
mlx_vlm/tests/test_responses_store.py Unit tests for LRU store behavior and replay input reconstruction.
mlx_vlm/tests/test_responses_replay.py Unit tests for replay expansion and Responses input parsing helpers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlx_vlm/server.py
Comment on lines +61 to 63
_responses_store = ResponseStore()


Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A module-level _responses_store = ResponseStore() will retain recent requests/responses in memory for the life of the process. Because you’re saving expanded_input (which can include base64 input_image payloads and can grow with replay chaining), this can become a significant memory footprint even with LRU eviction. Consider making the store size configurable (env/flag), adding an optional TTL / byte-size cap per entry, or avoiding storing large image payloads when replay is enabled.

Suggested change
_responses_store = ResponseStore()
def _get_bool_env(name: str, default: bool) -> bool:
value = os.environ.get(name)
if value is None:
return default
return value.strip().lower() in {"1", "true", "yes", "on"}
class _LazyResponseStore:
def __init__(self):
self._enabled = _get_bool_env("RESPONSES_STORE_ENABLED", True)
self._store = None
def _get_store(self):
if not self._enabled:
return None
if self._store is None:
self._store = ResponseStore()
return self._store
def __getattr__(self, name: str):
store = self._get_store()
if store is None:
def _noop(*args, **kwargs):
return None
return _noop
return getattr(store, name)
_responses_store = _LazyResponseStore()

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines +1584 to +1599
try:
expanded_input = resolve_responses_input_items(
openai_request.input,
previous_response_id=openai_request.previous_response_id,
response_store=_responses_store,
)
chat_messages, images, instructions = responses_input_to_messages(
expanded_input
)
except LookupError as exc:
raise HTTPException(
status_code=404,
detail=f"Previous response not found: {exc.args[0]}",
) from exc
except ValueError as exc:
raise HTTPException(status_code=400, detail=str(exc)) from exc
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous_response_id handling and error mapping (404 for missing response, 400 for invalid input items) is newly introduced here, but there’s no endpoint-level test coverage exercising these paths (existing /responses tests only assert sampling args forwarding). Adding a lightweight FastAPI TestClient test that seeds the store and verifies replay expansion (and the 404 path) would help prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +82
item_type = output_dict.get("type", "")
if item_type == "message":
content = output_dict.get("content", [])
output_text_parts = []
for part in content:
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replay_input() only rehydrates assistant output when stored output items have type == "message". However, the /responses non-streaming path stores response.output items shaped like ChatMessage (e.g., {role, content}) without a type field, so those assistant outputs will be silently dropped during replay and chaining will lose assistant context. Consider also accepting ChatMessage-shaped outputs (role/content) here, or normalizing stored outputs to the Responses API message item schema before replaying.

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines 1861 to 1865
"total_tokens": prompt_tokens + output_tokens,
},
)
_responses_store.save(response_id, expanded_input, response.output)

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-streaming /responses code saves response.output into the replay store, but the output item constructed above does not include a Responses-API type: "message" (it’s currently a {role, content, reasoning} dict). As a result, ResponseStore.replay_input() will not include this assistant output in replay context, breaking previous_response_id for the most common (non-streaming) case. Please align the non-streaming output shape with the streaming MessageItem/type:"message" format (or normalize the output before saving).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants