[Frontend] Add x-vllm-* response headers for per-request stats by vrdn-23 · Pull Request #42198 · vllm-project/vllm

vrdn-23 · 2026-05-10T03:35:49Z

Summary

Adds an opt-in --enable-request-stats-headers flag that attaches per-request timing and token-count headers to non-streaming OpenAI responses (chat, completion, responses):

x-vllm-total-time, x-vllm-queue-time, x-vllm-prefill-time,
x-vllm-decode-time, x-vllm-inference-time,
x-vllm-prompt-tokens, x-vllm-completion-tokens,
x-vllm-cached-tokens, x-vllm-time-per-output-token

Single source of truth. The timing intervals are computed in exactly one place — IterationStats.update_from_finished_request — which now returns the FinishedRequestStats it builds. Both the existing Prometheus path and the new headers middleware consume the same object; the middleware does no arithmetic, only formatting. This is a deliberate refactor on top of my earlier prototype: the previous version recomputed intervals inside the middleware, creating a third copy of the same math.

Headers, not body. Putting metrics in HTTP headers (rather than extending the response body) keeps the OpenAI response schema unchanged, so strict client validators (OpenAI SDK in some configurations, Pydantic-based proxies) don't need to be updated.

Streaming and error responses unchanged. The middleware is a no-op when request_metadata.finished_stats is None, which covers streaming (no terminal RequestOutput), errors, and non-OpenAI routes.

Bonus refactor: extract `FinishReason` to a leaf module

The first commit needed RequestResponseMetadata to carry a FinishedRequestStats field. That ran into a Pydantic forward-reference resolution failure: FinishedRequestStats lives in vllm.v1.metrics.stats and annotates finish_reason: "FinishReason", but FinishReason was defined in vllm.v1.engine.__init__ — which already imports from vllm.v1.metrics.stats at runtime. stats.py hid the FinishReason import under TYPE_CHECKING to break that cycle, which left the forward-ref string unresolvable when Pydantic introspected the dataclass.

The first commit worked around this with a model_rebuild(_types_namespace={"FinishReason": _FinishReason}) call at the bottom of protocol.py. The second commit fixes the structural issue properly:

New leaf module vllm/v1/finish_reason.py holds FinishReason and FINISH_REASON_STRINGS. It imports nothing else from vllm.v1.*.
vllm/v1/engine/__init__.py re-exports the symbols, so every existing from vllm.v1.engine import FinishReason keeps working unchanged. Class identity is preserved (re-export, not redefinition).
stats.py imports FinishReason at module top level. The TYPE_CHECKING dodge is gone.
The model_rebuild block in protocol.py is deleted; the forward reference resolves naturally.

Net: −1 line, 4 files touched, no behavior change. This makes future consumers of FinishedRequestStats (loggers, tracers, RPC schemas) "just work" without needing the rebuild trick.

Why this isn't a duplicate

Supersedes [Feature]: Per-Request Timing Headers (--enable-request-stats-headers) #38572 (also mine). That PR is the same feature without the single-source refactor; it has been closed in favor of this one. The old branch's build_request_stats_headers recomputed the five timing intervals inside the middleware; this version reuses FinishedRequestStats directly.
Different from feat(openai): add per-request timing metrics and completion_tokens_de… #36383 (DRAFT, last touched 2026-04-08). feat(openai): add per-request timing metrics and completion_tokens_de… #36383 adds a metrics field to ChatCompletionResponse body and bundles a completion_tokens_details.reasoning_tokens change. This PR is headers-only and makes no body-schema changes. The two could coexist; this one is targeted at the "don't change response schemas" use case.

Test Plan

# Unit + middleware tests (no GPU)
.venv/bin/python -m pytest \
    tests/entrypoints/openai/test_request_stats_headers.py \
    tests/v1/metrics/test_stats.py -v

# Live smoke test
vllm serve facebook/opt-125m --enable-request-stats-headers --port 8000
curl -is http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"facebook/opt-125m","prompt":"Hello","max_tokens":10}' \
    | grep -i '^x-vllm-'

Manual verification (Linux + CUDA):

All 9 x-vllm-* headers present on non-streaming responses; values internally consistent (prefill + decode ≈ inference, queue + inference + overhead = total).
Streaming requests ("stream": true) confirmed to NOT emit x-vllm-* headers.
Server started without the flag confirmed to NOT emit headers.
Prefix-cache hits (x-vllm-cached-tokens > 0) confirmed on repeated identical prompts past the block boundary.

Test Results

16 / 16 tests passed locally on Linux (5 new headers tests + 11 existing metrics tests, including a new assertion that update_from_finished_request returns the same object it appends).
pre-commit run clean on all touched files (ruff check, ruff format, mypy-3.10, SPDX headers, all other hooks).
Live smoke test on Linux + CUDA: facebook/opt-125m serving the /v1/completions endpoint with --enable-request-stats-headers returns all 9 headers with self-consistent values.

AI assistance

This change was developed with AI assistance (Claude Opus 4.7 via Claude Code). Every line was reviewed by me. I ran the test suite locally and verified the live behavior on my own hardware before opening this PR.

🤖 Generated with Claude Code

Adds an opt-in --enable-request-stats-headers flag that attaches per-request timing and token-count headers to non-streaming OpenAI responses (chat, completion, responses). The timing intervals (queue, prefill, decode, inference, e2e) and mean time per output token are computed in exactly one place: IterationStats.update_from_finished_request, which now returns the FinishedRequestStats it builds. Both the Prometheus path and the new headers middleware consume the same object - the middleware performs no arithmetic, only formatting. Headers added (all opt-in via --enable-request-stats-headers): x-vllm-total-time, x-vllm-queue-time, x-vllm-prefill-time, x-vllm-decode-time, x-vllm-inference-time, x-vllm-prompt-tokens, x-vllm-completion-tokens, x-vllm-cached-tokens, x-vllm-time-per-output-token Streaming responses are unchanged. Error responses are unchanged. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces per-request timing and compute statistics as x-vllm-* response headers for non-streaming OpenAI-compatible requests, controlled by a new --enable-request-stats-headers flag. The implementation includes a FastAPI middleware, updates to the V1 engine's output processing to capture FinishedRequestStats, and propagation of these stats through the OpenAI serving layer. Feedback indicates that finished_stats should also be integrated into PoolingRequestOutput to support embedding endpoints. Furthermore, it is recommended to calculate and attach these statistics before the RequestOutput is queued to avoid potential timing issues in asynchronous environments.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e725b27b42

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…d workaround The previous commit added a `RequestResponseMetadata.model_rebuild()` call at the bottom of `protocol.py` to repair a Pydantic forward-reference resolution failure. The root cause was structural: `FinishedRequestStats` lives in `vllm.v1.metrics.stats` and annotates `finish_reason: "FinishReason"`, but `FinishReason` was defined in `vllm.v1.engine.__init__` - which already imports `vllm.v1.metrics.stats` at runtime. To break that cycle, `stats.py` hid the import under `TYPE_CHECKING`, which left the forward-reference string unresolvable when Pydantic tried to introspect the dataclass. This change moves `FinishReason` and `FINISH_REASON_STRINGS` into a new leaf module `vllm/v1/finish_reason.py` that imports nothing else from `vllm.v1.*`. `vllm/v1/engine/__init__.py` re-exports the symbols so every existing `from vllm.v1.engine import FinishReason` keeps working unchanged. The class identity is preserved (re-export, not redefinition), so isinstance checks and enum value comparisons continue to work. `stats.py` can now import `FinishReason` at module top level without circularity, and the `model_rebuild` block in `protocol.py` is deleted - the forward reference resolves naturally. Net: -1 line, 4 files touched, no behavior change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…lidation, skip multi-prompt headers Three changes addressing PR vllm-project#42198 review feedback: 1. Reorder _update_stats_from_finished before make_request_output in the output processor. The previous version attached finished_stats to the already-built RequestOutput after the fact; in AsyncLLM that meant the queued output was momentarily stale. The reorder is safe because update_from_finished_request only reads req_state.stats, which is already finalized by _update_stats_from_output earlier in the same iteration. Removes the post-attach hack and the misleading comment claiming a side-effect ordering constraint. 2. Reject --enable-request-stats-headers + --disable-log-stats at startup. The two flags are silently incompatible: when log_stats is off, req_state.stats is None and finished_stats is never produced, so the middleware becomes a permanent no-op. Fail loudly instead. 3. Skip emitting x-vllm-* headers for multi-prompt batched /v1/completions requests. The previous code reported only the last prompt's stats with a comment flagging the limitation; that's misleading. Per-prompt FinishedRequestStats can't be meaningfully aggregated (queue/prefill/ decode intervals are per-prompt), so we skip headers entirely when len(final_res_batch) > 1. Single-prompt requests are unchanged. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

vrdn-23 · 2026-05-10T04:18:00Z

@codex

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 72507ce31d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T04:21:43Z

            num_cached_tokens=self.num_cached_tokens,
            metrics=self.stats,
            prompt_routed_experts=prompt_routed_experts,
+            finished_stats=self.finished_stats,


Aggregate finished stats for parallel sampling

For non-streaming requests with n > 1, ParentRequest.get_outputs() only emits a combined RequestOutput when the last child finishes, but this line copies finished_stats from just that one child request state. The OpenAI serving layers then use final_res.finished_stats for x-vllm-* headers, so headers like x-vllm-completion-tokens and timing fields reflect only the last branch instead of the whole API request (while usage is aggregated across choices). This produces misleading per-request stats whenever parallel sampling is used.

Useful? React with 👍 / 👎.

cjackal · 2026-05-10T05:11:07Z

Just a question: what benefits of receiving the per-requests stats via custom response headers over collecting via distributed tracing? There already are so-called AI gateways (smg, envoy AI gateway, litellm proxy to name a few) whose whole purpose is to collect the per-request stats and transform into a business-standard trace protocols, and vllm integrates well with these. Is there a specific usecase where distributed tracing does not cover well?

vrdn-23 · 2026-05-10T19:30:45Z

@cjackal Thanks for the interest in the PR!

So I am aware of a use-case where we have OTel data available as spans and they can be queried using Grafana. I think the value in adding headers is based on the fact that this gives us an easy in-line access path, which opens up much more configurability for proxies that are being run in front of vLLM deployments.

In our use-case, we have our own internal implementation of an AI Gateway, where we want to implement latency aware load balancing and also have cost-based rate limiting based on the inference time taken by the request. Since e2e wall time bundles queue waiting into the same number as compute, it's the wrong basis for cost attribution when multiple teams share a model — they'd be charged for each other's queueing. Exposing the headers gives us the opportunity to calculate cost and make routing decisions inline, without going through an OTel pipeline lookup that carries span-export lag (typically several seconds before a finished request's span is queryable) and a separate per-request query to the trace backend. It also helps us return values like per-request-cost in real time without having to thread that through the OTel pipeline.

For cross-service spans, audit, post-hoc analysis, OTel is clearly the right tool. I would think this to be a complementary source of the same data, without it being a replacement. Hope this makes sense and happy to answer any questions further!

…-headers # Conflicts: # vllm/entrypoints/openai/completion/serving.py # vllm/outputs.py # vllm/v1/engine/output_processor.py

vrdn-23 · 2026-05-19T17:26:14Z

@chaunceyjiang @DarkLight1337 @russellb @robertgshaw2-redhat
Any chance I could get a preliminary review on this PR?

vrdn-23 requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, markmc, njhill, robertgshaw2-redhat and russellb as code owners May 10, 2026 03:35

claude Bot reviewed May 10, 2026

View reviewed changes

vrdn-23 mentioned this pull request May 10, 2026

[Feature]: Per-Request Timing Headers (--enable-request-stats-headers) #38572

Closed

5 tasks

mergify Bot added frontend v1 labels May 10, 2026

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Comment thread vllm/v1/engine/output_processor.py

Comment thread vllm/v1/engine/output_processor.py Outdated

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

Comment thread vllm/entrypoints/openai/api_server.py

Comment thread vllm/entrypoints/openai/completion/serving.py Outdated

vrdn-23 and others added 3 commits May 9, 2026 21:02

Merge branch 'main' into vrdn-23/request-stats-headers

3b446c3

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

Merge branch 'main' into vrdn-23/request-stats-headers

b413ac7

markmc added this to Metrics & Tracing May 11, 2026

github-project-automation Bot moved this to Backlog in Metrics & Tracing May 11, 2026

vrdn-23 added 5 commits May 11, 2026 11:33

Merge branch 'main' into vrdn-23/request-stats-headers

961da6c

Merge branch 'main' into vrdn-23/request-stats-headers

ccf26b4

Merge branch 'main' into vrdn-23/request-stats-headers

5856977

Merge branch 'main' into vrdn-23/request-stats-headers

4ad0637

Merge branch 'main' into vrdn-23/request-stats-headers

42cbfcb

vrdn-23 added 4 commits May 14, 2026 18:58

Merge remote-tracking branch 'origin/main' into vrdn-23/request-stats…

aa84093

…-headers # Conflicts: # vllm/entrypoints/openai/completion/serving.py # vllm/outputs.py # vllm/v1/engine/output_processor.py

Merge branch 'main' into vrdn-23/request-stats-headers

2e89c87

Merge branch 'main' into vrdn-23/request-stats-headers

f96c2f5

Merge branch 'main' into vrdn-23/request-stats-headers

5298a95

Merge branch 'main' into vrdn-23/request-stats-headers

2976f6f

vrdn-23 requested a review from AndreasKaratzas as a code owner May 28, 2026 17:48

Merge branch 'main' into vrdn-23/request-stats-headers

576d27c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Frontend] Add x-vllm-* response headers for per-request stats#42198

[Frontend] Add x-vllm-* response headers for per-request stats#42198
vrdn-23 wants to merge 16 commits into
vllm-project:mainfrom
vrdn-23:vrdn-23/request-stats-headers

vrdn-23 commented May 10, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

vrdn-23 commented May 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Uh oh!

cjackal commented May 10, 2026

Uh oh!

vrdn-23 commented May 10, 2026

Uh oh!

vrdn-23 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

vrdn-23 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bonus refactor: extract FinishReason to a leaf module

Why this isn't a duplicate

Test Plan

Test Results

AI assistance

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

vrdn-23 commented May 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

cjackal commented May 10, 2026

Uh oh!

vrdn-23 commented May 10, 2026

Uh oh!

vrdn-23 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vrdn-23 commented May 10, 2026 •

edited

Loading

Bonus refactor: extract `FinishReason` to a leaf module