Skip to content

feat(api): OpenAI-compatible logprobs for /v1/chat/completions (#1549)#1591

Open
popfido wants to merge 12 commits into
jundot:mainfrom
popfido:feat/openai-logprobs-1549
Open

feat(api): OpenAI-compatible logprobs for /v1/chat/completions (#1549)#1591
popfido wants to merge 12 commits into
jundot:mainfrom
popfido:feat/openai-logprobs-1549

Conversation

@popfido
Copy link
Copy Markdown
Contributor

@popfido popfido commented Jun 2, 2026

Summary

Since I also think that logprobs is necessary for using oMLX as an local RL evaluation API source, I took a look at the influence scale of adding this functionality and found it can be quite a little.

Implements OpenAI-compatible per-token logprobs for /v1/chat/completions(streaming and non-streaming), closing #1549. Previously the server accepted logprobs: true / top_logprobs: N but always returned an empty
choices[].logprobs.content. The engine already computed a per-token logprob distribution for sampling; this PR plumbs it through the request → engine → scheduler → API layers and serializes it in the OpenAI shape ({token, logprob, bytes, top_logprobs[]}). Ported from mlx-vlm's implementation (the VLM-capable reference).

What changed

  • Request/response schema (api/openai_models.py): logprobs/top_logprobs request fields + validator (range 0–20, requires logprobs); TopLogprob / ChatCompletionTokenLogprob / ChoiceLogprobs; logprobs on the choice and streaming-chunk-choice models.
  • Config (settings.py, admin): SamplingSettings.top_logprobs_k server-side cap (default 20, clamped 0–20).
  • Engine path (logprobs.py, request.py, scheduler.py, engine/*): early top-K extraction from the full-vocab vector (frees the ~800 KB array), RequestOutput.logprobs, accumulation across collector merges, GenerationOutput.logprobs.
  • API serialization (server.py, api/utils.py): tokenizer-based token/bytes decoding; non-streaming attaches ChoiceLogprobs; streaming attaches aligned per-chunk logprobs.
  • Streaming alignment (api/thinking.py): logprob-aware thinking parser so per-chunk logprobs.content stays aligned to content tokens through tag-lookahead buffering (fixes a <-in-content drift). Text-only path is byte-identical.

Design decisions

  • Token order under thinking/tools: raw generation order (content tokens).
  • top_logprobs cap: OpenAI's 0–20, plus a configurable server cap (top_logprobs_k, default 20); over-cap requests clamp.
  • Speculative decoding: MTP and native DFlash report the proposing head's distribution, not the target model's, so logprobs are suppressed (null) there for now (DFlash fallback to the standard engine returns them).
  • Performance: all logprobs work is gated behind the per-request flag; the disabled path is byte-for-byte unchanged.
  • Invalid params return 422 via the repo's existing OpenAI-formatted RequestValidationError handler (house convention).

Verification

  • Unit + integration: new tests in tests/test_logprobs_extraction.py and tests/integration/test_logprobs_api.py (extraction, formatter, thinking-parser alignment incl. the a<b case, non-streaming shape, streaming alignment, validation, server-cap clamp, disabled-path-forwards-no-kwargs). Full affected suite green (800+ tests).
  • Live, real model (Qwen3.6-27B-UD-MLX-4bit):
    • Non-streaming logprobs:true, top_logprobs:2 → populated content with real top_logprobs and correct UTF-8 bytes.
    • Alignment: len(logprobs.content) == completion_tokens (99 == 99).
    • Streaming: per-chunk logprobs, aligned to content tokens.
    • Perf (128 tokens, temp 0): logprobs off ≈ 12.5 tok/s, on/top5 ≈ 12.9, on/top20 ≈ 11.9 — negligible overhead; off-path unaffected.

Limitations / follow-ups

  • Streaming + active tool-call filtering omits logprobs (the filter buffers
    independently); non-streaming with tools returns them.
  • True logprobs for MTP/DFlash speculative decoding.

Test plan

  • pytest tests/test_logprobs_extraction.py tests/integration/test_logprobs_api.py
  • Real-model curl smoke test (non-streaming + streaming)
  • Perf check (off vs on)
  • Reviewer: confirm 422 (vs OpenAI 400) is acceptable for invalid params

popfido added 10 commits June 1, 2026 21:27
)

Temporary tracking spec for the /v1/chat/completions logprobs feature.
Lives on feat/openai-logprobs-1549 during development; to be removed
before the PR is opened.
…ig (jundot#1549)

Phase 1 of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md).
- ChatCompletionRequest: logprobs/top_logprobs + validator (range 0-20, requires logprobs)
- response models TopLogprob/ChatCompletionTokenLogprob/ChoiceLogprobs; logprobs
  field on ChatCompletionChoice and ChatCompletionChunkChoice
- SamplingSettings.top_logprobs_k launch cap (default 20, clamped 0-20) + admin get/set
- No behavior change yet; the logprobs-disabled path is untouched.
…#1549)

Phase 2a of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md).
- request.py: TokenLogprob dataclass + RequestOutput.logprobs field
- logprobs.py: extract_token_logprob() reduces the full-vocab vector to the
  chosen-token logprob plus top-K (ids/logprobs sorted desc), then the caller
  frees the array
- scheduler.py: extract on the standard decode path only when requested; the
  logprobs-disabled path stays a plain discard with no added compute/transfer
…lash (jundot#1549)

Phase 2b of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md).
- base.GenerationOutput gains a logprobs field
- output_collector: accumulate per-token logprobs across merged steps (the
  non-streaming generate() path relies on this single-slot merge)
- batched/vlm: thread logprobs/top_logprobs into SamplingParams and map
  output.logprobs onto GenerationOutput for both generate() and stream_generate()
- scheduler: suppress logprobs on speculative paths (VLM MTP shim or model.mtp
  head) — those report the proposing head's distribution, not the target model's
- DFlash native path leaves logprobs unwired (-> null); its fallback delegates to
  batched/vlm which handle logprobs correctly
Phase 3 of OpenAI-style chat logprobs (spec: docs/specs/openai-logprobs-1549.md).
- api/utils.py: build_choice_logprobs() / token_logprob_to_openai() — decode
  token text + utf-8 bytes via the tokenizer (ported from mlx-vlm)
- server.py: forward logprobs/top_logprobs into chat_kwargs, clamping the
  requested top_logprobs to the server cap (SamplingSettings.top_logprobs_k)
- non-streaming: attach ChoiceLogprobs to the chat choice (D1 raw order)
- streaming: attach per-step logprobs to content-delta chunks (D4 content-only;
  exact for plain text, best-effort under thinking/tool buffering)
- disabled path untouched (fields only set when logprobs requested)
jundot#1549)

Phase 3b: guarantee OpenAI-style per-chunk logprob/content-token alignment.
- ThinkingParser: logprob-aware feed_with_logprob/finish_with_logprob carry
  per-char logprobs through tag-lookahead buffering and emit one entry per
  content-contributing token; text-only feed/finish stay byte-identical
- TokenLogprob.text (set by the scheduler) lets the stream feed the parser
  per token, so alignment holds even when the collector batches tokens
- stream_chat_completion: per-token feed + aligned content-chunk logprobs;
  omit logprobs when the tool-call filter is active (buffers independently)
  rather than risk misalignment
- fixes the '<'-in-content drift (e.g. code/HTML) that dropped a token entry
…t#1549)

Phase 4 of OpenAI-style chat logprobs.
- tests/integration/test_logprobs_api.py: TestClient + mock engine covering
  non-streaming shape, streaming content-token alignment (the '<' drift case),
  logprobs-off omission, 422 validation, server-cap clamp, and the perf
  invariant (no logprob kwargs forwarded when disabled)
- spec: record implementation status + deviations (422 vs 400, tools-streaming
  omit, perf-gate note)
Mechanical ruff-driven cleanup, no behavior change:
- openai_models.py: Optional[X]->X|None, List/Dict->list/dict, Union->|,
  drop now-unused typing imports
- output_collector.py: remove unused field/List imports
Both files are now ruff-clean against the project UP ruleset.
popfido added 2 commits June 2, 2026 09:59
…ream (jundot#1549)

CI failure: two streaming-abort tests mock the inner engine output with a
minimal SimpleNamespace that lacks 'logprobs'. Use getattr(output, 'logprobs',
None) at the GenerationOutput construction sites (generate + stream_generate,
batched + vlm) so the engine tolerates duck-typed outputs. Production
RequestOutput always carries the field — no behavior change.
…p wiring (jundot#1549)

- server.py streaming: rely on build_choice_logprobs() returning None on empty
  (content_lps is already gated by want_logprobs) instead of the repeated
  '... if content_lps else None'. Non-streaming keeps its request-intent gate
  (a test pins that logprobs are never serialized unless requested, regardless
  of engine output).
- admin/routes.py: remove the top_logprobs_k admin get/set wiring; the cap
  still round-trips via settings.json (default 20) — smaller surface.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant