feat(api): OpenAI-compatible logprobs for /v1/chat/completions (#1549) by popfido · Pull Request #1591 · jundot/omlx

popfido · 2026-06-02T01:39:01Z

Summary

Since I also think that logprobs is necessary for using oMLX as an local RL evaluation API source, I took a look at the influence scale of adding this functionality and found it can be quite a little.

Implements OpenAI-compatible per-token logprobs for /v1/chat/completions(streaming and non-streaming), closing #1549. Previously the server accepted logprobs: true / top_logprobs: N but always returned an empty
choices[].logprobs.content. The engine already computed a per-token logprob distribution for sampling; this PR plumbs it through the request → engine → scheduler → API layers and serializes it in the OpenAI shape ({token, logprob, bytes, top_logprobs[]}). Ported from mlx-vlm's implementation (the VLM-capable reference).

What changed

Request/response schema (api/openai_models.py): logprobs/top_logprobs request fields + validator (range 0–20, requires logprobs); TopLogprob / ChatCompletionTokenLogprob / ChoiceLogprobs; logprobs on the choice and streaming-chunk-choice models.
Config (settings.py, admin): SamplingSettings.top_logprobs_k server-side cap (default 20, clamped 0–20).
Engine path (logprobs.py, request.py, scheduler.py, engine/*): early top-K extraction from the full-vocab vector (frees the ~800 KB array), RequestOutput.logprobs, accumulation across collector merges, GenerationOutput.logprobs.
API serialization (server.py, api/utils.py): tokenizer-based token/bytes decoding; non-streaming attaches ChoiceLogprobs; streaming attaches aligned per-chunk logprobs.
Streaming alignment (api/thinking.py): logprob-aware thinking parser so per-chunk logprobs.content stays aligned to content tokens through tag-lookahead buffering (fixes a <-in-content drift). Text-only path is byte-identical.

Design decisions

Token order under thinking/tools: raw generation order (content tokens).
top_logprobs cap: OpenAI's 0–20, plus a configurable server cap (top_logprobs_k, default 20); over-cap requests clamp.
Speculative decoding: MTP and native DFlash report the proposing head's distribution, not the target model's, so logprobs are suppressed (null) there for now (DFlash fallback to the standard engine returns them).
Performance: all logprobs work is gated behind the per-request flag; the disabled path is byte-for-byte unchanged.
Invalid params return 422 via the repo's existing OpenAI-formatted RequestValidationError handler (house convention).

Verification

Unit + integration: new tests in tests/test_logprobs_extraction.py and tests/integration/test_logprobs_api.py (extraction, formatter, thinking-parser alignment incl. the a<b case, non-streaming shape, streaming alignment, validation, server-cap clamp, disabled-path-forwards-no-kwargs). Full affected suite green (800+ tests).
Live, real model (Qwen3.6-27B-UD-MLX-4bit):
- Non-streaming logprobs:true, top_logprobs:2 → populated content with real top_logprobs and correct UTF-8 bytes.
- Alignment: len(logprobs.content) == completion_tokens (99 == 99).
- Streaming: per-chunk logprobs, aligned to content tokens.
- Perf (128 tokens, temp 0): logprobs off ≈ 12.5 tok/s, on/top5 ≈ 12.9, on/top20 ≈ 11.9 — negligible overhead; off-path unaffected.

Limitations / follow-ups

Streaming + active tool-call filtering omits logprobs (the filter buffers
independently); non-streaming with tools returns them.
True logprobs for MTP/DFlash speculative decoding.

Test plan

pytest tests/test_logprobs_extraction.py tests/integration/test_logprobs_api.py
Real-model curl smoke test (non-streaming + streaming)
Perf check (off vs on)
Reviewer: confirm 422 (vs OpenAI 400) is acceptable for invalid params

) Temporary tracking spec for the /v1/chat/completions logprobs feature. Lives on feat/openai-logprobs-1549 during development; to be removed before the PR is opened.

…ig (jundot#1549) Phase 1 of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md). - ChatCompletionRequest: logprobs/top_logprobs + validator (range 0-20, requires logprobs) - response models TopLogprob/ChatCompletionTokenLogprob/ChoiceLogprobs; logprobs field on ChatCompletionChoice and ChatCompletionChunkChoice - SamplingSettings.top_logprobs_k launch cap (default 20, clamped 0-20) + admin get/set - No behavior change yet; the logprobs-disabled path is untouched.

…#1549) Phase 2a of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md). - request.py: TokenLogprob dataclass + RequestOutput.logprobs field - logprobs.py: extract_token_logprob() reduces the full-vocab vector to the chosen-token logprob plus top-K (ids/logprobs sorted desc), then the caller frees the array - scheduler.py: extract on the standard decode path only when requested; the logprobs-disabled path stays a plain discard with no added compute/transfer

…lash (jundot#1549) Phase 2b of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md). - base.GenerationOutput gains a logprobs field - output_collector: accumulate per-token logprobs across merged steps (the non-streaming generate() path relies on this single-slot merge) - batched/vlm: thread logprobs/top_logprobs into SamplingParams and map output.logprobs onto GenerationOutput for both generate() and stream_generate() - scheduler: suppress logprobs on speculative paths (VLM MTP shim or model.mtp head) — those report the proposing head's distribution, not the target model's - DFlash native path leaves logprobs unwired (-> null); its fallback delegates to batched/vlm which handle logprobs correctly

Phase 3 of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md). - api/utils.py: build_choice_logprobs() / token_logprob_to_openai() — decode token text + utf-8 bytes via the tokenizer (ported from mlx-vlm) - server.py: forward logprobs/top_logprobs into chat_kwargs, clamping the requested top_logprobs to the server cap (SamplingSettings.top_logprobs_k) - non-streaming: attach ChoiceLogprobs to the chat choice (D1 raw order) - streaming: attach per-step logprobs to content-delta chunks (D4 content-only; exact for plain text, best-effort under thinking/tool buffering) - disabled path untouched (fields only set when logprobs requested)

jundot#1549) Phase 3b: guarantee OpenAI-style per-chunk logprob/content-token alignment. - ThinkingParser: logprob-aware feed_with_logprob/finish_with_logprob carry per-char logprobs through tag-lookahead buffering and emit one entry per content-contributing token; text-only feed/finish stay byte-identical - TokenLogprob.text (set by the scheduler) lets the stream feed the parser per token, so alignment holds even when the collector batches tokens - stream_chat_completion: per-token feed + aligned content-chunk logprobs; omit logprobs when the tool-call filter is active (buffers independently) rather than risk misalignment - fixes the '<'-in-content drift (e.g. code/HTML) that dropped a token entry

…t#1549) Phase 4 of OpenAI-style chat logprobs. - tests/integration/test_logprobs_api.py: TestClient + mock engine covering non-streaming shape, streaming content-token alignment (the '<' drift case), logprobs-off omission, 422 validation, server-cap clamp, and the perf invariant (no logprob kwargs forwarded when disabled) - spec: record implementation status + deviations (422 vs 400, tools-streaming omit, perf-gate note)

Mechanical ruff-driven cleanup, no behavior change: - openai_models.py: Optional[X]->X|None, List/Dict->list/dict, Union->|, drop now-unused typing imports - output_collector.py: remove unused field/List imports Both files are now ruff-clean against the project UP ruleset.

…ream (jundot#1549) CI failure: two streaming-abort tests mock the inner engine output with a minimal SimpleNamespace that lacks 'logprobs'. Use getattr(output, 'logprobs', None) at the GenerationOutput construction sites (generate + stream_generate, batched + vlm) so the engine tolerates duck-typed outputs. Production RequestOutput always carries the field — no behavior change.

…p wiring (jundot#1549) - server.py streaming: rely on build_choice_logprobs() returning None on empty (content_lps is already gated by want_logprobs) instead of the repeated '... if content_lps else None'. Non-streaming keeps its request-intent gate (a test pins that logprobs are never serialized unless requested, regardless of engine output). - admin/routes.py: remove the top_logprobs_k admin get/set wiring; the cap still round-trips via settings.json (default 20) — smaller surface.

popfido added 10 commits June 1, 2026 21:27

docs(logprobs): add dev spec for OpenAI-style chat logprobs (jundot#1549

3f0da9a

) Temporary tracking spec for the /v1/chat/completions logprobs feature. Lives on feat/openai-logprobs-1549 during development; to be removed before the PR is opened.

docs(logprobs): finalize top_logprobs_k launch default = 20

432343d

chore(logprobs): drop dev tracking spec ahead of PR (jundot#1549)

52a2653

popfido marked this pull request as ready for review June 2, 2026 01:42

popfido mentioned this pull request Jun 2, 2026

/v1/chat/completions accepts logprobs: true / top_logprobs: N but returns empty logprobs.content #1549

Open

popfido added 2 commits June 2, 2026 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): OpenAI-compatible logprobs for /v1/chat/completions (#1549)#1591

feat(api): OpenAI-compatible logprobs for /v1/chat/completions (#1549)#1591
popfido wants to merge 12 commits into
jundot:mainfrom
popfido:feat/openai-logprobs-1549

popfido commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

popfido commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Design decisions

Verification

Limitations / follow-ups

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

popfido commented Jun 2, 2026 •

edited

Loading