Skip to content

feat: add stop sequences support for both endpoints#15

Open
eloe wants to merge 3 commits into
mainfrom
feature/stop-sequences
Open

feat: add stop sequences support for both endpoints#15
eloe wants to merge 3 commits into
mainfrom
feature/stop-sequences

Conversation

@eloe
Copy link
Copy Markdown
Owner

@eloe eloe commented Apr 6, 2026

Summary\nAccept stop parameter (string or list of up to 4 strings) in both endpoints. Passed as eos_tokens strings. 7 tests.

eloe and others added 2 commits April 5, 2026 14:31
Accept `stop` parameter (string or list of up to 4 strings) in both
/v1/responses and /v1/chat/completions. Stop strings are tokenized
and passed as `eos_tokens` to the generation pipeline.

New helper: resolve_stop_tokens() converts stop strings to token IDs
using the model's tokenizer.

Adds 7 tests: endpoint integration (responses + chat/completions),
unit tests for resolve_stop_tokens (single, list, None, limit to 4).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
add_eos_token_ids() accepts strings and tokenizes internally.
Renamed resolve_stop_tokens → resolve_stop_sequences, returns
string list directly. Fixes "can only concatenate str (not int)"
error when stop sequences were used.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds OpenAI-compatible stop sequence support to both /responses and /chat/completions by normalizing the request’s stop field (string or list) and forwarding it into generation as eos_tokens.

Changes:

  • Added stop field to both request models (OpenAIRequest, VLMRequest/ChatRequest).
  • Introduced resolve_stop_sequences() to normalize/limit stop sequences.
  • Added tests asserting stop is forwarded as eos_tokens in non-streaming requests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
mlx_vlm/server.py Adds request schema support for stop, normalizes it, and forwards it as eos_tokens to generation kwargs.
mlx_vlm/tests/test_server.py Adds unit tests for stop forwarding behavior and stop normalization behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlx_vlm/server.py
stop_seqs = resolve_stop_sequences(getattr(openai_request, "stop", None))
if stop_seqs:
generation_kwargs["eos_tokens"] = stop_seqs

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eos_tokens is added to generation_kwargs, but the streaming path calls stream_generate(...) directly. stream_generate does not currently read eos_tokens (the eos_tokens -> tokenizer.stopping_criteria.add_eos_token_ids(...) logic exists in generate()), so stop sequences won’t take effect when stream=true on /responses. Consider either (a) adding eos_tokens handling to stream_generate, or (b) resetting/augmenting tokenizer.stopping_criteria in the endpoint before starting streaming, and add a test that covers the streamed case.

Suggested change
# `generate()` applies `eos_tokens` to the tokenizer's stopping
# criteria internally, but the streaming path calls
# `stream_generate(...)` directly. Apply the same stopping criteria
# here so streamed `/responses` requests honor caller-provided stop
# sequences as well.
if openai_request.stream:
stopping_criteria = getattr(tokenizer, "stopping_criteria", None)
add_eos_token_ids = getattr(
stopping_criteria, "add_eos_token_ids", None
)
if add_eos_token_ids is None:
raise HTTPException(
status_code=500,
detail=(
"Streaming stop sequences are not supported by the "
"configured tokenizer."
),
)
add_eos_token_ids(stop_seqs)

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines 1151 to +1157
generation_kwargs = build_generation_kwargs(request, template_kwargs)

# Resolve stop sequences to token IDs
stop_seqs = resolve_stop_sequences(getattr(request, "stop", None))
if stop_seqs:
generation_kwargs["eos_tokens"] = stop_seqs

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as /responses: eos_tokens is appended to generation_kwargs, but the streaming implementation uses stream_generate(...) which does not apply eos_tokens. This means stop won’t work for streaming chat completions. Fix by applying eos_tokens in stream_generate or by setting/resetting tokenizer.stopping_criteria before calling it, and add a streaming test to prevent regressions.

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py Outdated
)
generation_kwargs = build_generation_kwargs(openai_request, template_kwargs)

# Resolve stop sequences to token IDs
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says “Resolve stop sequences to token IDs”, but resolve_stop_sequences returns strings and relies on StoppingCriteria.add_eos_token_ids to tokenize internally. Please update the comment to avoid implying token IDs are being computed here.

Suggested change
# Resolve stop sequences to token IDs
# Resolve stop sequences for downstream EOS handling

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines +641 to +644
def resolve_stop_sequences(
stop: Optional[Union[str, list]],
) -> Optional[list]:
"""Normalize stop sequences for the generation stopping criteria.
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_stop_sequences uses very broad types (Optional[Union[str, list]] -> Optional[list]). Since this is a public helper used by both endpoints, tightening to Optional[Union[str, List[str]]] -> Optional[List[str]] (and ideally Sequence[str]) will improve type checking and generated docs, and make it clearer that non-string list entries are not supported.

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/tests/test_server.py Outdated
Comment on lines +242 to +285
fake_processor = SimpleNamespace(
tokenizer=SimpleNamespace(
encode=lambda s, add_special_tokens=False: [10, 20],
),
)
result = server.resolve_stop_sequences("hello")
assert result == ["hello"]


def test_resolve_stop_sequences_list():
"""resolve_stop_sequences should handle a list of strings."""
call_count = [0]
token_map = {0: [10], 1: [20, 30]}

def fake_encode(s, add_special_tokens=False):
idx = call_count[0]
call_count[0] += 1
return token_map.get(idx, [])

fake_processor = SimpleNamespace(
tokenizer=SimpleNamespace(encode=fake_encode),
)
result = server.resolve_stop_sequences(["a", "b"])
assert result == ["a", "b"]


def test_resolve_stop_sequences_none():
"""resolve_stop_sequences should return None for None input."""
assert server.resolve_stop_sequences(None) is None


def test_resolve_stop_sequences_limits_to_four():
"""resolve_stop_sequences should process at most 4 sequences."""
call_count = [0]

def fake_encode(s, add_special_tokens=False):
call_count[0] += 1
return [call_count[0]]

fake_processor = SimpleNamespace(
tokenizer=SimpleNamespace(encode=fake_encode),
)
result = server.resolve_stop_sequences(["a", "b", "c", "d", "e", "f"])
assert len(result) == 4
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests create fake_processor / fake_encode fixtures but they are never used (the helper no longer tokenizes). This makes the tests misleading and harder to maintain; either remove the unused setup or add assertions that match the current behavior (e.g., filtering empty / non-string entries and limiting to 4).

Suggested change
fake_processor = SimpleNamespace(
tokenizer=SimpleNamespace(
encode=lambda s, add_special_tokens=False: [10, 20],
),
)
result = server.resolve_stop_sequences("hello")
assert result == ["hello"]
def test_resolve_stop_sequences_list():
"""resolve_stop_sequences should handle a list of strings."""
call_count = [0]
token_map = {0: [10], 1: [20, 30]}
def fake_encode(s, add_special_tokens=False):
idx = call_count[0]
call_count[0] += 1
return token_map.get(idx, [])
fake_processor = SimpleNamespace(
tokenizer=SimpleNamespace(encode=fake_encode),
)
result = server.resolve_stop_sequences(["a", "b"])
assert result == ["a", "b"]
def test_resolve_stop_sequences_none():
"""resolve_stop_sequences should return None for None input."""
assert server.resolve_stop_sequences(None) is None
def test_resolve_stop_sequences_limits_to_four():
"""resolve_stop_sequences should process at most 4 sequences."""
call_count = [0]
def fake_encode(s, add_special_tokens=False):
call_count[0] += 1
return [call_count[0]]
fake_processor = SimpleNamespace(
tokenizer=SimpleNamespace(encode=fake_encode),
)
result = server.resolve_stop_sequences(["a", "b", "c", "d", "e", "f"])
assert len(result) == 4
result = server.resolve_stop_sequences("hello")
assert result == ["hello"]
def test_resolve_stop_sequences_list_filters_invalid_entries():
"""resolve_stop_sequences should keep only non-empty string entries."""
result = server.resolve_stop_sequences(["a", "", None, "b", 3, "c"])
assert result == ["a", "b", "c"]
def test_resolve_stop_sequences_none():
"""resolve_stop_sequences should return None for None input."""
assert server.resolve_stop_sequences(None) is None
def test_resolve_stop_sequences_limits_to_four():
"""resolve_stop_sequences should return at most 4 valid sequences."""
result = server.resolve_stop_sequences(["a", "", "b", None, "c", "d", "e", "f"])
assert result == ["a", "b", "c", "d"]

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/tests/test_server.py Outdated
Comment on lines +140 to +171
def test_chat_completions_stop_string_passed_as_eos_tokens(client):
"""stop parameter should be resolved to eos_tokens in generate kwargs."""
model = SimpleNamespace()
processor = SimpleNamespace(
tokenizer=SimpleNamespace(
chat_template="",
encode=lambda s, add_special_tokens=False: [42],
),
)
config = SimpleNamespace(model_type="test")
result = SimpleNamespace(
text="Hello",
prompt_tokens=5,
generation_tokens=1,
total_tokens=6,
prompt_tps=100.0,
generation_tps=50.0,
peak_memory=1.0,
)

with (
patch.object(server, "get_cached_model", return_value=(model, processor, config)),
patch.object(server, "apply_chat_template", return_value="prompt"),
patch.object(server, "generate", return_value=result) as mock_gen,
):
resp = client.post(
"/chat/completions",
json={
"model": "demo",
"messages": [{"role": "user", "content": "hello"}],
"stop": ["\n\n", "</s>"],
},
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test name says “stop_string” but the request sends a list of stop sequences. Rename to reflect list input (or change the payload to a single string) to keep intent clear.

Copilot uses AI. Check for mistakes.
- Fix comment "token IDs" → "strings" (eos_tokens takes strings)
- Tighten resolve_stop_sequences type hints to List[str]
- Remove unused fake_processor/fake_encode from tests
- Rename misleading test names to drop "string" from name

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eloe added a commit that referenced this pull request Apr 9, 2026
- Fix comment "token IDs" → "strings" (eos_tokens takes strings)
- Tighten resolve_stop_sequences type hints to List[str]
- Remove unused fake_processor/fake_encode from tests
- Rename misleading test names to drop "string" from name

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants