feat(api): OpenAI-compatible logprobs for /v1/chat/completions (#1549)#1591
Open
popfido wants to merge 12 commits into
Open
feat(api): OpenAI-compatible logprobs for /v1/chat/completions (#1549)#1591popfido wants to merge 12 commits into
popfido wants to merge 12 commits into
Conversation
…ig (jundot#1549) Phase 1 of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md). - ChatCompletionRequest: logprobs/top_logprobs + validator (range 0-20, requires logprobs) - response models TopLogprob/ChatCompletionTokenLogprob/ChoiceLogprobs; logprobs field on ChatCompletionChoice and ChatCompletionChunkChoice - SamplingSettings.top_logprobs_k launch cap (default 20, clamped 0-20) + admin get/set - No behavior change yet; the logprobs-disabled path is untouched.
…#1549) Phase 2a of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md). - request.py: TokenLogprob dataclass + RequestOutput.logprobs field - logprobs.py: extract_token_logprob() reduces the full-vocab vector to the chosen-token logprob plus top-K (ids/logprobs sorted desc), then the caller frees the array - scheduler.py: extract on the standard decode path only when requested; the logprobs-disabled path stays a plain discard with no added compute/transfer
…lash (jundot#1549) Phase 2b of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md). - base.GenerationOutput gains a logprobs field - output_collector: accumulate per-token logprobs across merged steps (the non-streaming generate() path relies on this single-slot merge) - batched/vlm: thread logprobs/top_logprobs into SamplingParams and map output.logprobs onto GenerationOutput for both generate() and stream_generate() - scheduler: suppress logprobs on speculative paths (VLM MTP shim or model.mtp head) — those report the proposing head's distribution, not the target model's - DFlash native path leaves logprobs unwired (-> null); its fallback delegates to batched/vlm which handle logprobs correctly
Phase 3 of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md). - api/utils.py: build_choice_logprobs() / token_logprob_to_openai() — decode token text + utf-8 bytes via the tokenizer (ported from mlx-vlm) - server.py: forward logprobs/top_logprobs into chat_kwargs, clamping the requested top_logprobs to the server cap (SamplingSettings.top_logprobs_k) - non-streaming: attach ChoiceLogprobs to the chat choice (D1 raw order) - streaming: attach per-step logprobs to content-delta chunks (D4 content-only; exact for plain text, best-effort under thinking/tool buffering) - disabled path untouched (fields only set when logprobs requested)
jundot#1549) Phase 3b: guarantee OpenAI-style per-chunk logprob/content-token alignment. - ThinkingParser: logprob-aware feed_with_logprob/finish_with_logprob carry per-char logprobs through tag-lookahead buffering and emit one entry per content-contributing token; text-only feed/finish stay byte-identical - TokenLogprob.text (set by the scheduler) lets the stream feed the parser per token, so alignment holds even when the collector batches tokens - stream_chat_completion: per-token feed + aligned content-chunk logprobs; omit logprobs when the tool-call filter is active (buffers independently) rather than risk misalignment - fixes the '<'-in-content drift (e.g. code/HTML) that dropped a token entry
…t#1549) Phase 4 of OpenAI-style chat logprobs. - tests/integration/test_logprobs_api.py: TestClient + mock engine covering non-streaming shape, streaming content-token alignment (the '<' drift case), logprobs-off omission, 422 validation, server-cap clamp, and the perf invariant (no logprob kwargs forwarded when disabled) - spec: record implementation status + deviations (422 vs 400, tools-streaming omit, perf-gate note)
Mechanical ruff-driven cleanup, no behavior change: - openai_models.py: Optional[X]->X|None, List/Dict->list/dict, Union->|, drop now-unused typing imports - output_collector.py: remove unused field/List imports Both files are now ruff-clean against the project UP ruleset.
…ream (jundot#1549) CI failure: two streaming-abort tests mock the inner engine output with a minimal SimpleNamespace that lacks 'logprobs'. Use getattr(output, 'logprobs', None) at the GenerationOutput construction sites (generate + stream_generate, batched + vlm) so the engine tolerates duck-typed outputs. Production RequestOutput always carries the field — no behavior change.
…p wiring (jundot#1549) - server.py streaming: rely on build_choice_logprobs() returning None on empty (content_lps is already gated by want_logprobs) instead of the repeated '... if content_lps else None'. Non-streaming keeps its request-intent gate (a test pins that logprobs are never serialized unless requested, regardless of engine output). - admin/routes.py: remove the top_logprobs_k admin get/set wiring; the cap still round-trips via settings.json (default 20) — smaller surface.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Since I also think that logprobs is necessary for using oMLX as an local RL evaluation API source, I took a look at the influence scale of adding this functionality and found it can be quite a little.
Implements OpenAI-compatible per-token logprobs for
/v1/chat/completions(streaming and non-streaming), closing #1549. Previously the server acceptedlogprobs: true/top_logprobs: Nbut always returned an emptychoices[].logprobs.content. The engine already computed a per-token logprob distribution for sampling; this PR plumbs it through the request → engine → scheduler → API layers and serializes it in the OpenAI shape ({token, logprob, bytes, top_logprobs[]}). Ported from mlx-vlm's implementation (the VLM-capable reference).What changed
api/openai_models.py):logprobs/top_logprobsrequest fields + validator (range 0–20, requireslogprobs);TopLogprob/ChatCompletionTokenLogprob/ChoiceLogprobs;logprobson the choice and streaming-chunk-choice models.settings.py, admin):SamplingSettings.top_logprobs_kserver-side cap (default 20, clamped 0–20).logprobs.py,request.py,scheduler.py,engine/*): early top-K extraction from the full-vocab vector (frees the ~800 KB array),RequestOutput.logprobs, accumulation across collector merges,GenerationOutput.logprobs.server.py,api/utils.py): tokenizer-based token/bytesdecoding; non-streaming attachesChoiceLogprobs; streaming attaches aligned per-chunk logprobs.api/thinking.py): logprob-aware thinking parser so per-chunklogprobs.contentstays aligned to content tokens through tag-lookahead buffering (fixes a<-in-content drift). Text-only path is byte-identical.Design decisions
top_logprobscap: OpenAI's 0–20, plus a configurable server cap (top_logprobs_k, default 20); over-cap requests clamp.422via the repo's existing OpenAI-formattedRequestValidationErrorhandler (house convention).Verification
tests/test_logprobs_extraction.pyandtests/integration/test_logprobs_api.py(extraction, formatter, thinking-parser alignment incl. thea<bcase, non-streaming shape, streaming alignment, validation, server-cap clamp, disabled-path-forwards-no-kwargs). Full affected suite green (800+ tests).logprobs:true, top_logprobs:2→ populatedcontentwith realtop_logprobsand correct UTF-8bytes.len(logprobs.content) == completion_tokens(99 == 99).Limitations / follow-ups
independently); non-streaming with tools returns them.
Test plan
pytest tests/test_logprobs_extraction.py tests/integration/test_logprobs_api.py422(vs OpenAI400) is acceptable for invalid params