Skip to content

feat: add logprobs support to /chat/completions#17

Open
eloe wants to merge 1 commit into
mainfrom
feature/logprobs
Open

feat: add logprobs support to /chat/completions#17
eloe wants to merge 1 commit into
mainfrom
feature/logprobs

Conversation

@eloe
Copy link
Copy Markdown
Owner

@eloe eloe commented Apr 6, 2026

Summary\nPer-token log probabilities when logprobs=True. OpenAI format. 4 tests.

Return per-token log probabilities when logprobs=True is set in the
request. Each token includes the decoded text, its log probability,
and UTF-8 byte representation matching the OpenAI format.

When logprobs is requested in non-streaming mode, uses stream_generate
internally to collect per-token probabilities.

Adds: logprobs/top_logprobs fields on ChatRequest, TokenLogprob and
ChoiceLogprobs models, logprobs on ChatChoice/ChatStreamChoice.

4 new tests: present when requested, absent by default, format
validation, streaming logprobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenAI-style per-token log probability support to the /chat/completions endpoint when logprobs=True, including streaming behavior and accompanying tests.

Changes:

  • Extend /chat/completions request/response schemas to include logprobs (and top_logprobs in the request).
  • Emit per-token logprobs.content entries in both streaming and non-streaming responses.
  • Add new server tests covering presence/absence, format, and streaming logprobs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
mlx_vlm/server.py Adds request/response models for logprobs and emits per-token logprob objects in streaming and non-streaming /chat/completions.
mlx_vlm/tests/test_server.py Adds tests validating logprobs are returned when requested, omitted by default, and present in streaming chunks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlx_vlm/server.py
Comment on lines +608 to +613
top_logprobs: Optional[int] = Field(
None,
ge=0,
le=20,
description="Number of most likely tokens to return at each position.",
)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

top_logprobs is added to ChatRequest but is never read in the handler, so the API will accept it and silently ignore it. Either implement top_logprobs in the /chat/completions response (include the top-N alternatives per token) or remove the field until it’s supported; also consider validating that top_logprobs can only be set when logprobs=True to match the OpenAI request contract.

Suggested change
top_logprobs: Optional[int] = Field(
None,
ge=0,
le=20,
description="Number of most likely tokens to return at each position.",
)

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines +1176 to +1187
chunk_logprobs = None
if want_logprobs and chunk.token is not None and chunk.logprobs is not None:
token_text = tokenizer.decode([chunk.token])
chosen_logprob = float(chunk.logprobs[chunk.token])
chunk_logprobs = ChoiceLogprobs(
content=[
TokenLogprob(
token=token_text,
logprob=chosen_logprob,
bytes=list(token_text.encode("utf-8")),
)
]
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream_generate() always yields a final “flush” chunk after detokenizer.finalize() that carries the last token/logprobs even when chunk.text is empty. The current logic will attach logprobs for that final chunk as well, which can leak EOS/special-token logprobs into the streamed response. Consider skipping logprob emission for special tokens (e.g., tokens in tokenizer.all_special_ids / eos_token_id) so the logprobs.content aligns with actual assistant content tokens.

Copilot uses AI. Check for mistakes.
Comment thread mlx_vlm/server.py
Comment on lines +1281 to +1291
full_text += chunk.text
if chunk.token is not None and chunk.logprobs is not None:
token_text = tokenizer.decode([chunk.token])
chosen_logprob = float(chunk.logprobs[chunk.token])
token_logprobs.append(
TokenLogprob(
token=token_text,
logprob=chosen_logprob,
bytes=list(token_text.encode("utf-8")),
)
)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When collecting non-streaming token_logprobs, this will also include the final “flush” chunk from stream_generate() (post-finalize()), which may correspond to EOS/special tokens rather than user-visible text. Filter out special tokens (or otherwise exclude the terminal flush chunk) so returned logprobs.content matches the tokens that form message.content.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants