feat: add stop sequences support for both endpoints#15
Conversation
Accept `stop` parameter (string or list of up to 4 strings) in both /v1/responses and /v1/chat/completions. Stop strings are tokenized and passed as `eos_tokens` to the generation pipeline. New helper: resolve_stop_tokens() converts stop strings to token IDs using the model's tokenizer. Adds 7 tests: endpoint integration (responses + chat/completions), unit tests for resolve_stop_tokens (single, list, None, limit to 4). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
add_eos_token_ids() accepts strings and tokenizes internally. Renamed resolve_stop_tokens → resolve_stop_sequences, returns string list directly. Fixes "can only concatenate str (not int)" error when stop sequences were used. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds OpenAI-compatible stop sequence support to both /responses and /chat/completions by normalizing the request’s stop field (string or list) and forwarding it into generation as eos_tokens.
Changes:
- Added
stopfield to both request models (OpenAIRequest,VLMRequest/ChatRequest). - Introduced
resolve_stop_sequences()to normalize/limit stop sequences. - Added tests asserting
stopis forwarded aseos_tokensin non-streaming requests.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
mlx_vlm/server.py |
Adds request schema support for stop, normalizes it, and forwards it as eos_tokens to generation kwargs. |
mlx_vlm/tests/test_server.py |
Adds unit tests for stop forwarding behavior and stop normalization behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| stop_seqs = resolve_stop_sequences(getattr(openai_request, "stop", None)) | ||
| if stop_seqs: | ||
| generation_kwargs["eos_tokens"] = stop_seqs | ||
|
|
There was a problem hiding this comment.
eos_tokens is added to generation_kwargs, but the streaming path calls stream_generate(...) directly. stream_generate does not currently read eos_tokens (the eos_tokens -> tokenizer.stopping_criteria.add_eos_token_ids(...) logic exists in generate()), so stop sequences won’t take effect when stream=true on /responses. Consider either (a) adding eos_tokens handling to stream_generate, or (b) resetting/augmenting tokenizer.stopping_criteria in the endpoint before starting streaming, and add a test that covers the streamed case.
| # `generate()` applies `eos_tokens` to the tokenizer's stopping | |
| # criteria internally, but the streaming path calls | |
| # `stream_generate(...)` directly. Apply the same stopping criteria | |
| # here so streamed `/responses` requests honor caller-provided stop | |
| # sequences as well. | |
| if openai_request.stream: | |
| stopping_criteria = getattr(tokenizer, "stopping_criteria", None) | |
| add_eos_token_ids = getattr( | |
| stopping_criteria, "add_eos_token_ids", None | |
| ) | |
| if add_eos_token_ids is None: | |
| raise HTTPException( | |
| status_code=500, | |
| detail=( | |
| "Streaming stop sequences are not supported by the " | |
| "configured tokenizer." | |
| ), | |
| ) | |
| add_eos_token_ids(stop_seqs) |
| generation_kwargs = build_generation_kwargs(request, template_kwargs) | ||
|
|
||
| # Resolve stop sequences to token IDs | ||
| stop_seqs = resolve_stop_sequences(getattr(request, "stop", None)) | ||
| if stop_seqs: | ||
| generation_kwargs["eos_tokens"] = stop_seqs | ||
|
|
There was a problem hiding this comment.
Same issue as /responses: eos_tokens is appended to generation_kwargs, but the streaming implementation uses stream_generate(...) which does not apply eos_tokens. This means stop won’t work for streaming chat completions. Fix by applying eos_tokens in stream_generate or by setting/resetting tokenizer.stopping_criteria before calling it, and add a streaming test to prevent regressions.
| ) | ||
| generation_kwargs = build_generation_kwargs(openai_request, template_kwargs) | ||
|
|
||
| # Resolve stop sequences to token IDs |
There was a problem hiding this comment.
The comment says “Resolve stop sequences to token IDs”, but resolve_stop_sequences returns strings and relies on StoppingCriteria.add_eos_token_ids to tokenize internally. Please update the comment to avoid implying token IDs are being computed here.
| # Resolve stop sequences to token IDs | |
| # Resolve stop sequences for downstream EOS handling |
| def resolve_stop_sequences( | ||
| stop: Optional[Union[str, list]], | ||
| ) -> Optional[list]: | ||
| """Normalize stop sequences for the generation stopping criteria. |
There was a problem hiding this comment.
resolve_stop_sequences uses very broad types (Optional[Union[str, list]] -> Optional[list]). Since this is a public helper used by both endpoints, tightening to Optional[Union[str, List[str]]] -> Optional[List[str]] (and ideally Sequence[str]) will improve type checking and generated docs, and make it clearer that non-string list entries are not supported.
| fake_processor = SimpleNamespace( | ||
| tokenizer=SimpleNamespace( | ||
| encode=lambda s, add_special_tokens=False: [10, 20], | ||
| ), | ||
| ) | ||
| result = server.resolve_stop_sequences("hello") | ||
| assert result == ["hello"] | ||
|
|
||
|
|
||
| def test_resolve_stop_sequences_list(): | ||
| """resolve_stop_sequences should handle a list of strings.""" | ||
| call_count = [0] | ||
| token_map = {0: [10], 1: [20, 30]} | ||
|
|
||
| def fake_encode(s, add_special_tokens=False): | ||
| idx = call_count[0] | ||
| call_count[0] += 1 | ||
| return token_map.get(idx, []) | ||
|
|
||
| fake_processor = SimpleNamespace( | ||
| tokenizer=SimpleNamespace(encode=fake_encode), | ||
| ) | ||
| result = server.resolve_stop_sequences(["a", "b"]) | ||
| assert result == ["a", "b"] | ||
|
|
||
|
|
||
| def test_resolve_stop_sequences_none(): | ||
| """resolve_stop_sequences should return None for None input.""" | ||
| assert server.resolve_stop_sequences(None) is None | ||
|
|
||
|
|
||
| def test_resolve_stop_sequences_limits_to_four(): | ||
| """resolve_stop_sequences should process at most 4 sequences.""" | ||
| call_count = [0] | ||
|
|
||
| def fake_encode(s, add_special_tokens=False): | ||
| call_count[0] += 1 | ||
| return [call_count[0]] | ||
|
|
||
| fake_processor = SimpleNamespace( | ||
| tokenizer=SimpleNamespace(encode=fake_encode), | ||
| ) | ||
| result = server.resolve_stop_sequences(["a", "b", "c", "d", "e", "f"]) | ||
| assert len(result) == 4 |
There was a problem hiding this comment.
These tests create fake_processor / fake_encode fixtures but they are never used (the helper no longer tokenizes). This makes the tests misleading and harder to maintain; either remove the unused setup or add assertions that match the current behavior (e.g., filtering empty / non-string entries and limiting to 4).
| fake_processor = SimpleNamespace( | |
| tokenizer=SimpleNamespace( | |
| encode=lambda s, add_special_tokens=False: [10, 20], | |
| ), | |
| ) | |
| result = server.resolve_stop_sequences("hello") | |
| assert result == ["hello"] | |
| def test_resolve_stop_sequences_list(): | |
| """resolve_stop_sequences should handle a list of strings.""" | |
| call_count = [0] | |
| token_map = {0: [10], 1: [20, 30]} | |
| def fake_encode(s, add_special_tokens=False): | |
| idx = call_count[0] | |
| call_count[0] += 1 | |
| return token_map.get(idx, []) | |
| fake_processor = SimpleNamespace( | |
| tokenizer=SimpleNamespace(encode=fake_encode), | |
| ) | |
| result = server.resolve_stop_sequences(["a", "b"]) | |
| assert result == ["a", "b"] | |
| def test_resolve_stop_sequences_none(): | |
| """resolve_stop_sequences should return None for None input.""" | |
| assert server.resolve_stop_sequences(None) is None | |
| def test_resolve_stop_sequences_limits_to_four(): | |
| """resolve_stop_sequences should process at most 4 sequences.""" | |
| call_count = [0] | |
| def fake_encode(s, add_special_tokens=False): | |
| call_count[0] += 1 | |
| return [call_count[0]] | |
| fake_processor = SimpleNamespace( | |
| tokenizer=SimpleNamespace(encode=fake_encode), | |
| ) | |
| result = server.resolve_stop_sequences(["a", "b", "c", "d", "e", "f"]) | |
| assert len(result) == 4 | |
| result = server.resolve_stop_sequences("hello") | |
| assert result == ["hello"] | |
| def test_resolve_stop_sequences_list_filters_invalid_entries(): | |
| """resolve_stop_sequences should keep only non-empty string entries.""" | |
| result = server.resolve_stop_sequences(["a", "", None, "b", 3, "c"]) | |
| assert result == ["a", "b", "c"] | |
| def test_resolve_stop_sequences_none(): | |
| """resolve_stop_sequences should return None for None input.""" | |
| assert server.resolve_stop_sequences(None) is None | |
| def test_resolve_stop_sequences_limits_to_four(): | |
| """resolve_stop_sequences should return at most 4 valid sequences.""" | |
| result = server.resolve_stop_sequences(["a", "", "b", None, "c", "d", "e", "f"]) | |
| assert result == ["a", "b", "c", "d"] |
| def test_chat_completions_stop_string_passed_as_eos_tokens(client): | ||
| """stop parameter should be resolved to eos_tokens in generate kwargs.""" | ||
| model = SimpleNamespace() | ||
| processor = SimpleNamespace( | ||
| tokenizer=SimpleNamespace( | ||
| chat_template="", | ||
| encode=lambda s, add_special_tokens=False: [42], | ||
| ), | ||
| ) | ||
| config = SimpleNamespace(model_type="test") | ||
| result = SimpleNamespace( | ||
| text="Hello", | ||
| prompt_tokens=5, | ||
| generation_tokens=1, | ||
| total_tokens=6, | ||
| prompt_tps=100.0, | ||
| generation_tps=50.0, | ||
| peak_memory=1.0, | ||
| ) | ||
|
|
||
| with ( | ||
| patch.object(server, "get_cached_model", return_value=(model, processor, config)), | ||
| patch.object(server, "apply_chat_template", return_value="prompt"), | ||
| patch.object(server, "generate", return_value=result) as mock_gen, | ||
| ): | ||
| resp = client.post( | ||
| "/chat/completions", | ||
| json={ | ||
| "model": "demo", | ||
| "messages": [{"role": "user", "content": "hello"}], | ||
| "stop": ["\n\n", "</s>"], | ||
| }, |
There was a problem hiding this comment.
Test name says “stop_string” but the request sends a list of stop sequences. Rename to reflect list input (or change the payload to a single string) to keep intent clear.
- Fix comment "token IDs" → "strings" (eos_tokens takes strings) - Tighten resolve_stop_sequences type hints to List[str] - Remove unused fake_processor/fake_encode from tests - Rename misleading test names to drop "string" from name Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix comment "token IDs" → "strings" (eos_tokens takes strings) - Tighten resolve_stop_sequences type hints to List[str] - Remove unused fake_processor/fake_encode from tests - Rename misleading test names to drop "string" from name Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary\nAccept stop parameter (string or list of up to 4 strings) in both endpoints. Passed as eos_tokens strings. 7 tests.