feat: add logprobs support to /chat/completions#17
Conversation
Return per-token log probabilities when logprobs=True is set in the request. Each token includes the decoded text, its log probability, and UTF-8 byte representation matching the OpenAI format. When logprobs is requested in non-streaming mode, uses stream_generate internally to collect per-token probabilities. Adds: logprobs/top_logprobs fields on ChatRequest, TokenLogprob and ChoiceLogprobs models, logprobs on ChatChoice/ChatStreamChoice. 4 new tests: present when requested, absent by default, format validation, streaming logprobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds OpenAI-style per-token log probability support to the /chat/completions endpoint when logprobs=True, including streaming behavior and accompanying tests.
Changes:
- Extend
/chat/completionsrequest/response schemas to includelogprobs(andtop_logprobsin the request). - Emit per-token
logprobs.contententries in both streaming and non-streaming responses. - Add new server tests covering presence/absence, format, and streaming logprobs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
mlx_vlm/server.py |
Adds request/response models for logprobs and emits per-token logprob objects in streaming and non-streaming /chat/completions. |
mlx_vlm/tests/test_server.py |
Adds tests validating logprobs are returned when requested, omitted by default, and present in streaming chunks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| top_logprobs: Optional[int] = Field( | ||
| None, | ||
| ge=0, | ||
| le=20, | ||
| description="Number of most likely tokens to return at each position.", | ||
| ) |
There was a problem hiding this comment.
top_logprobs is added to ChatRequest but is never read in the handler, so the API will accept it and silently ignore it. Either implement top_logprobs in the /chat/completions response (include the top-N alternatives per token) or remove the field until it’s supported; also consider validating that top_logprobs can only be set when logprobs=True to match the OpenAI request contract.
| top_logprobs: Optional[int] = Field( | |
| None, | |
| ge=0, | |
| le=20, | |
| description="Number of most likely tokens to return at each position.", | |
| ) |
| chunk_logprobs = None | ||
| if want_logprobs and chunk.token is not None and chunk.logprobs is not None: | ||
| token_text = tokenizer.decode([chunk.token]) | ||
| chosen_logprob = float(chunk.logprobs[chunk.token]) | ||
| chunk_logprobs = ChoiceLogprobs( | ||
| content=[ | ||
| TokenLogprob( | ||
| token=token_text, | ||
| logprob=chosen_logprob, | ||
| bytes=list(token_text.encode("utf-8")), | ||
| ) | ||
| ] |
There was a problem hiding this comment.
stream_generate() always yields a final “flush” chunk after detokenizer.finalize() that carries the last token/logprobs even when chunk.text is empty. The current logic will attach logprobs for that final chunk as well, which can leak EOS/special-token logprobs into the streamed response. Consider skipping logprob emission for special tokens (e.g., tokens in tokenizer.all_special_ids / eos_token_id) so the logprobs.content aligns with actual assistant content tokens.
| full_text += chunk.text | ||
| if chunk.token is not None and chunk.logprobs is not None: | ||
| token_text = tokenizer.decode([chunk.token]) | ||
| chosen_logprob = float(chunk.logprobs[chunk.token]) | ||
| token_logprobs.append( | ||
| TokenLogprob( | ||
| token=token_text, | ||
| logprob=chosen_logprob, | ||
| bytes=list(token_text.encode("utf-8")), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
When collecting non-streaming token_logprobs, this will also include the final “flush” chunk from stream_generate() (post-finalize()), which may correspond to EOS/special tokens rather than user-visible text. Filter out special tokens (or otherwise exclude the terminal flush chunk) so returned logprobs.content matches the tokens that form message.content.
Summary\nPer-token log probabilities when logprobs=True. OpenAI format. 4 tests.