feat: add stop sequences support for both endpoints by eloe · Pull Request #15 · eloe/mlx-vlm

eloe · 2026-04-06T16:16:26Z

Summary\nAccept stop parameter (string or list of up to 4 strings) in both endpoints. Passed as eos_tokens strings. 7 tests.

Accept `stop` parameter (string or list of up to 4 strings) in both /v1/responses and /v1/chat/completions. Stop strings are tokenized and passed as `eos_tokens` to the generation pipeline. New helper: resolve_stop_tokens() converts stop strings to token IDs using the model's tokenizer. Adds 7 tests: endpoint integration (responses + chat/completions), unit tests for resolve_stop_tokens (single, list, None, limit to 4). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add_eos_token_ids() accepts strings and tokenizes internally. Renamed resolve_stop_tokens → resolve_stop_sequences, returns string list directly. Fixes "can only concatenate str (not int)" error when stop sequences were used. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds OpenAI-compatible stop sequence support to both /responses and /chat/completions by normalizing the request’s stop field (string or list) and forwarding it into generation as eos_tokens.

Changes:

Added stop field to both request models (OpenAIRequest, VLMRequest/ChatRequest).
Introduced resolve_stop_sequences() to normalize/limit stop sequences.
Added tests asserting stop is forwarded as eos_tokens in non-streaming requests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`mlx_vlm/server.py`	Adds request schema support for `stop`, normalizes it, and forwards it as `eos_tokens` to generation kwargs.
`mlx_vlm/tests/test_server.py`	Adds unit tests for stop forwarding behavior and stop normalization behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T16:31:17Z

+        stop_seqs = resolve_stop_sequences(getattr(openai_request, "stop", None))
+        if stop_seqs:
+            generation_kwargs["eos_tokens"] = stop_seqs
+


eos_tokens is added to generation_kwargs, but the streaming path calls stream_generate(...) directly. stream_generate does not currently read eos_tokens (the eos_tokens -> tokenizer.stopping_criteria.add_eos_token_ids(...) logic exists in generate()), so stop sequences won’t take effect when stream=true on /responses. Consider either (a) adding eos_tokens handling to stream_generate, or (b) resetting/augmenting tokenizer.stopping_criteria in the endpoint before starting streaming, and add a test that covers the streamed case.

Suggested change

# `generate()` applies `eos_tokens` to the tokenizer's stopping

# criteria internally, but the streaming path calls

# `stream_generate(...)` directly. Apply the same stopping criteria

# here so streamed `/responses` requests honor caller-provided stop

# sequences as well.

if openai_request.stream:

stopping_criteria = getattr(tokenizer, "stopping_criteria", None)

add_eos_token_ids = getattr(

stopping_criteria, "add_eos_token_ids", None

)

if add_eos_token_ids is None:

raise HTTPException(

status_code=500,

detail=(

"Streaming stop sequences are not supported by the "

"configured tokenizer."

),

)

add_eos_token_ids(stop_seqs)

Copilot · 2026-04-06T16:31:17Z

        generation_kwargs = build_generation_kwargs(request, template_kwargs)

+        # Resolve stop sequences to token IDs
+        stop_seqs = resolve_stop_sequences(getattr(request, "stop", None))
+        if stop_seqs:
+            generation_kwargs["eos_tokens"] = stop_seqs
+


Same issue as /responses: eos_tokens is appended to generation_kwargs, but the streaming implementation uses stream_generate(...) which does not apply eos_tokens. This means stop won’t work for streaming chat completions. Fix by applying eos_tokens in stream_generate or by setting/resetting tokenizer.stopping_criteria before calling it, and add a streaming test to prevent regressions.

Copilot · 2026-04-06T16:31:17Z

        )
        generation_kwargs = build_generation_kwargs(openai_request, template_kwargs)

+        # Resolve stop sequences to token IDs


The comment says “Resolve stop sequences to token IDs”, but resolve_stop_sequences returns strings and relies on StoppingCriteria.add_eos_token_ids to tokenize internally. Please update the comment to avoid implying token IDs are being computed here.

Suggested change

# Resolve stop sequences to token IDs

# Resolve stop sequences for downstream EOS handling

Copilot · 2026-04-06T16:31:18Z

+def resolve_stop_sequences(
+    stop: Optional[Union[str, list]],
+) -> Optional[list]:
+    """Normalize stop sequences for the generation stopping criteria.


resolve_stop_sequences uses very broad types (Optional[Union[str, list]] -> Optional[list]). Since this is a public helper used by both endpoints, tightening to Optional[Union[str, List[str]]] -> Optional[List[str]] (and ideally Sequence[str]) will improve type checking and generated docs, and make it clearer that non-string list entries are not supported.

Copilot · 2026-04-06T16:31:18Z

+    fake_processor = SimpleNamespace(
+        tokenizer=SimpleNamespace(
+            encode=lambda s, add_special_tokens=False: [10, 20],
+        ),
+    )
+    result = server.resolve_stop_sequences("hello")
+    assert result == ["hello"]
+
+
+def test_resolve_stop_sequences_list():
+    """resolve_stop_sequences should handle a list of strings."""
+    call_count = [0]
+    token_map = {0: [10], 1: [20, 30]}
+
+    def fake_encode(s, add_special_tokens=False):
+        idx = call_count[0]
+        call_count[0] += 1
+        return token_map.get(idx, [])
+
+    fake_processor = SimpleNamespace(
+        tokenizer=SimpleNamespace(encode=fake_encode),
+    )
+    result = server.resolve_stop_sequences(["a", "b"])
+    assert result == ["a", "b"]
+
+
+def test_resolve_stop_sequences_none():
+    """resolve_stop_sequences should return None for None input."""
+    assert server.resolve_stop_sequences(None) is None
+
+
+def test_resolve_stop_sequences_limits_to_four():
+    """resolve_stop_sequences should process at most 4 sequences."""
+    call_count = [0]
+
+    def fake_encode(s, add_special_tokens=False):
+        call_count[0] += 1
+        return [call_count[0]]
+
+    fake_processor = SimpleNamespace(
+        tokenizer=SimpleNamespace(encode=fake_encode),
+    )
+    result = server.resolve_stop_sequences(["a", "b", "c", "d", "e", "f"])
+    assert len(result) == 4


These tests create fake_processor / fake_encode fixtures but they are never used (the helper no longer tokenizes). This makes the tests misleading and harder to maintain; either remove the unused setup or add assertions that match the current behavior (e.g., filtering empty / non-string entries and limiting to 4).

Suggested change

fake_processor = SimpleNamespace(

tokenizer=SimpleNamespace(

encode=lambda s, add_special_tokens=False: [10, 20],

),

)

result = server.resolve_stop_sequences("hello")

assert result == ["hello"]

def test_resolve_stop_sequences_list():

"""resolve_stop_sequences should handle a list of strings."""

call_count = [0]

token_map = {0: [10], 1: [20, 30]}

def fake_encode(s, add_special_tokens=False):

idx = call_count[0]

call_count[0] += 1

return token_map.get(idx, [])

fake_processor = SimpleNamespace(

tokenizer=SimpleNamespace(encode=fake_encode),

)

result = server.resolve_stop_sequences(["a", "b"])

assert result == ["a", "b"]

def test_resolve_stop_sequences_none():

"""resolve_stop_sequences should return None for None input."""

assert server.resolve_stop_sequences(None) is None

def test_resolve_stop_sequences_limits_to_four():

"""resolve_stop_sequences should process at most 4 sequences."""

call_count = [0]

def fake_encode(s, add_special_tokens=False):

call_count[0] += 1

return [call_count[0]]

fake_processor = SimpleNamespace(

tokenizer=SimpleNamespace(encode=fake_encode),

)

result = server.resolve_stop_sequences(["a", "b", "c", "d", "e", "f"])

assert len(result) == 4

result = server.resolve_stop_sequences("hello")

assert result == ["hello"]

def test_resolve_stop_sequences_list_filters_invalid_entries():

"""resolve_stop_sequences should keep only non-empty string entries."""

result = server.resolve_stop_sequences(["a", "", None, "b", 3, "c"])

assert result == ["a", "b", "c"]

def test_resolve_stop_sequences_none():

"""resolve_stop_sequences should return None for None input."""

assert server.resolve_stop_sequences(None) is None

def test_resolve_stop_sequences_limits_to_four():

"""resolve_stop_sequences should return at most 4 valid sequences."""

result = server.resolve_stop_sequences(["a", "", "b", None, "c", "d", "e", "f"])

assert result == ["a", "b", "c", "d"]

Copilot · 2026-04-06T16:31:18Z

+def test_chat_completions_stop_string_passed_as_eos_tokens(client):
+    """stop parameter should be resolved to eos_tokens in generate kwargs."""
+    model = SimpleNamespace()
+    processor = SimpleNamespace(
+        tokenizer=SimpleNamespace(
+            chat_template="",
+            encode=lambda s, add_special_tokens=False: [42],
+        ),
+    )
+    config = SimpleNamespace(model_type="test")
+    result = SimpleNamespace(
+        text="Hello",
+        prompt_tokens=5,
+        generation_tokens=1,
+        total_tokens=6,
+        prompt_tps=100.0,
+        generation_tps=50.0,
+        peak_memory=1.0,
+    )
+
+    with (
+        patch.object(server, "get_cached_model", return_value=(model, processor, config)),
+        patch.object(server, "apply_chat_template", return_value="prompt"),
+        patch.object(server, "generate", return_value=result) as mock_gen,
+    ):
+        resp = client.post(
+            "/chat/completions",
+            json={
+                "model": "demo",
+                "messages": [{"role": "user", "content": "hello"}],
+                "stop": ["\n\n", "</s>"],
+            },


Test name says “stop_string” but the request sends a list of stop sequences. Rename to reflect list input (or change the payload to a single string) to keep intent clear.

- Fix comment "token IDs" → "strings" (eos_tokens takes strings) - Tighten resolve_stop_sequences type hints to List[str] - Remove unused fake_processor/fake_encode from tests - Rename misleading test names to drop "string" from name Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eloe and others added 2 commits April 5, 2026 14:31

eloe requested a review from Copilot April 6, 2026 16:28

Copilot started reviewing on behalf of eloe April 6, 2026 16:28 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

eloe mentioned this pull request Apr 8, 2026

Combined server enhancements: OpenAI API compliance, prompt caching, concurrency #21

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add stop sequences support for both endpoints#15

feat: add stop sequences support for both endpoints#15
eloe wants to merge 3 commits into
mainfrom
feature/stop-sequences

eloe commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+            # `generate()` applies `eos_tokens` to the tokenizer's stopping
+            # criteria internally, but the streaming path calls
+            # `stream_generate(...)` directly. Apply the same stopping criteria
+            # here so streamed `/responses` requests honor caller-provided stop
+            # sequences as well.
+            if openai_request.stream:
+                stopping_criteria = getattr(tokenizer, "stopping_criteria", None)
+                add_eos_token_ids = getattr(
+                    stopping_criteria, "add_eos_token_ids", None
+                )
+                if add_eos_token_ids is None:
+                    raise HTTPException(
+                        status_code=500,
+                        detail=(
+                            "Streaming stop sequences are not supported by the "
+                            "configured tokenizer."
+                        ),
+                    )
+                add_eos_token_ids(stop_seqs)

	# Resolve stop sequences to token IDs
	# Resolve stop sequences for downstream EOS handling

Conversation

eloe commented Apr 6, 2026

Summary\nAccept stop parameter (string or list of up to 4 strings) in both endpoints. Passed as eos_tokens strings. 7 tests.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants