Skip to content

[Frontend] Add x-vllm-* response headers for per-request stats#42198

Open
vrdn-23 wants to merge 16 commits into
vllm-project:mainfrom
vrdn-23:vrdn-23/request-stats-headers
Open

[Frontend] Add x-vllm-* response headers for per-request stats#42198
vrdn-23 wants to merge 16 commits into
vllm-project:mainfrom
vrdn-23:vrdn-23/request-stats-headers

Conversation

@vrdn-23
Copy link
Copy Markdown
Contributor

@vrdn-23 vrdn-23 commented May 10, 2026

Summary

Adds an opt-in --enable-request-stats-headers flag that attaches per-request timing and token-count headers to non-streaming OpenAI responses (chat, completion, responses):

x-vllm-total-time, x-vllm-queue-time, x-vllm-prefill-time,
x-vllm-decode-time, x-vllm-inference-time,
x-vllm-prompt-tokens, x-vllm-completion-tokens,
x-vllm-cached-tokens, x-vllm-time-per-output-token

Single source of truth. The timing intervals are computed in exactly one place — IterationStats.update_from_finished_request — which now returns the FinishedRequestStats it builds. Both the existing Prometheus path and the new headers middleware consume the same object; the middleware does no arithmetic, only formatting. This is a deliberate refactor on top of my earlier prototype: the previous version recomputed intervals inside the middleware, creating a third copy of the same math.

Headers, not body. Putting metrics in HTTP headers (rather than extending the response body) keeps the OpenAI response schema unchanged, so strict client validators (OpenAI SDK in some configurations, Pydantic-based proxies) don't need to be updated.

Streaming and error responses unchanged. The middleware is a no-op when request_metadata.finished_stats is None, which covers streaming (no terminal RequestOutput), errors, and non-OpenAI routes.

Bonus refactor: extract FinishReason to a leaf module

The first commit needed RequestResponseMetadata to carry a FinishedRequestStats field. That ran into a Pydantic forward-reference resolution failure: FinishedRequestStats lives in vllm.v1.metrics.stats and annotates finish_reason: "FinishReason", but FinishReason was defined in vllm.v1.engine.__init__ — which already imports from vllm.v1.metrics.stats at runtime. stats.py hid the FinishReason import under TYPE_CHECKING to break that cycle, which left the forward-ref string unresolvable when Pydantic introspected the dataclass.

The first commit worked around this with a model_rebuild(_types_namespace={"FinishReason": _FinishReason}) call at the bottom of protocol.py. The second commit fixes the structural issue properly:

  • New leaf module vllm/v1/finish_reason.py holds FinishReason and FINISH_REASON_STRINGS. It imports nothing else from vllm.v1.*.
  • vllm/v1/engine/__init__.py re-exports the symbols, so every existing from vllm.v1.engine import FinishReason keeps working unchanged. Class identity is preserved (re-export, not redefinition).
  • stats.py imports FinishReason at module top level. The TYPE_CHECKING dodge is gone.
  • The model_rebuild block in protocol.py is deleted; the forward reference resolves naturally.

Net: −1 line, 4 files touched, no behavior change. This makes future consumers of FinishedRequestStats (loggers, tracers, RPC schemas) "just work" without needing the rebuild trick.

Why this isn't a duplicate

Test Plan

# Unit + middleware tests (no GPU)
.venv/bin/python -m pytest \
    tests/entrypoints/openai/test_request_stats_headers.py \
    tests/v1/metrics/test_stats.py -v

# Live smoke test
vllm serve facebook/opt-125m --enable-request-stats-headers --port 8000
curl -is http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"facebook/opt-125m","prompt":"Hello","max_tokens":10}' \
    | grep -i '^x-vllm-'

Manual verification (Linux + CUDA):

  • All 9 x-vllm-* headers present on non-streaming responses; values internally consistent (prefill + decode ≈ inference, queue + inference + overhead = total).
  • Streaming requests ("stream": true) confirmed to NOT emit x-vllm-* headers.
  • Server started without the flag confirmed to NOT emit headers.
  • Prefix-cache hits (x-vllm-cached-tokens > 0) confirmed on repeated identical prompts past the block boundary.

Test Results

  • 16 / 16 tests passed locally on Linux (5 new headers tests + 11 existing metrics tests, including a new assertion that update_from_finished_request returns the same object it appends).
  • pre-commit run clean on all touched files (ruff check, ruff format, mypy-3.10, SPDX headers, all other hooks).
  • Live smoke test on Linux + CUDA: facebook/opt-125m serving the /v1/completions endpoint with --enable-request-stats-headers returns all 9 headers with self-consistent values.

AI assistance

This change was developed with AI assistance (Claude Opus 4.7 via Claude Code). Every line was reviewed by me. I ran the test suite locally and verified the live behavior on my own hardware before opening this PR.

🤖 Generated with Claude Code

Adds an opt-in --enable-request-stats-headers flag that attaches
per-request timing and token-count headers to non-streaming OpenAI
responses (chat, completion, responses).

The timing intervals (queue, prefill, decode, inference, e2e) and
mean time per output token are computed in exactly one place:
IterationStats.update_from_finished_request, which now returns the
FinishedRequestStats it builds. Both the Prometheus path and the
new headers middleware consume the same object - the middleware
performs no arithmetic, only formatting.

Headers added (all opt-in via --enable-request-stats-headers):
  x-vllm-total-time, x-vllm-queue-time, x-vllm-prefill-time,
  x-vllm-decode-time, x-vllm-inference-time,
  x-vllm-prompt-tokens, x-vllm-completion-tokens,
  x-vllm-cached-tokens, x-vllm-time-per-output-token

Streaming responses are unchanged. Error responses are unchanged.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces per-request timing and compute statistics as x-vllm-* response headers for non-streaming OpenAI-compatible requests, controlled by a new --enable-request-stats-headers flag. The implementation includes a FastAPI middleware, updates to the V1 engine's output processing to capture FinishedRequestStats, and propagation of these stats through the OpenAI serving layer. Feedback indicates that finished_stats should also be integrated into PoolingRequestOutput to support embedding endpoints. Furthermore, it is recommended to calculate and attach these statistics before the RequestOutput is queued to avoid potential timing issues in asynchronous environments.

Comment thread vllm/v1/engine/output_processor.py
Comment thread vllm/v1/engine/output_processor.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e725b27b42

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm/entrypoints/openai/api_server.py
Comment thread vllm/entrypoints/openai/completion/serving.py Outdated
vrdn-23 and others added 3 commits May 9, 2026 21:02
…d workaround

The previous commit added a `RequestResponseMetadata.model_rebuild()` call at
the bottom of `protocol.py` to repair a Pydantic forward-reference resolution
failure. The root cause was structural: `FinishedRequestStats` lives in
`vllm.v1.metrics.stats` and annotates `finish_reason: "FinishReason"`, but
`FinishReason` was defined in `vllm.v1.engine.__init__` - which already
imports `vllm.v1.metrics.stats` at runtime. To break that cycle, `stats.py`
hid the import under `TYPE_CHECKING`, which left the forward-reference string
unresolvable when Pydantic tried to introspect the dataclass.

This change moves `FinishReason` and `FINISH_REASON_STRINGS` into a new leaf
module `vllm/v1/finish_reason.py` that imports nothing else from `vllm.v1.*`.
`vllm/v1/engine/__init__.py` re-exports the symbols so every existing
`from vllm.v1.engine import FinishReason` keeps working unchanged. The class
identity is preserved (re-export, not redefinition), so isinstance checks and
enum value comparisons continue to work.

`stats.py` can now import `FinishReason` at module top level without
circularity, and the `model_rebuild` block in `protocol.py` is deleted - the
forward reference resolves naturally.

Net: -1 line, 4 files touched, no behavior change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…lidation, skip multi-prompt headers

Three changes addressing PR vllm-project#42198 review feedback:

1. Reorder _update_stats_from_finished before make_request_output in the
   output processor. The previous version attached finished_stats to the
   already-built RequestOutput after the fact; in AsyncLLM that meant the
   queued output was momentarily stale. The reorder is safe because
   update_from_finished_request only reads req_state.stats, which is
   already finalized by _update_stats_from_output earlier in the same
   iteration. Removes the post-attach hack and the misleading comment
   claiming a side-effect ordering constraint.

2. Reject --enable-request-stats-headers + --disable-log-stats at startup.
   The two flags are silently incompatible: when log_stats is off,
   req_state.stats is None and finished_stats is never produced, so the
   middleware becomes a permanent no-op. Fail loudly instead.

3. Skip emitting x-vllm-* headers for multi-prompt batched /v1/completions
   requests. The previous code reported only the last prompt's stats with
   a comment flagging the limitation; that's misleading. Per-prompt
   FinishedRequestStats can't be meaningfully aggregated (queue/prefill/
   decode intervals are per-prompt), so we skip headers entirely when
   len(final_res_batch) > 1. Single-prompt requests are unchanged.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented May 10, 2026

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 72507ce31d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

num_cached_tokens=self.num_cached_tokens,
metrics=self.stats,
prompt_routed_experts=prompt_routed_experts,
finished_stats=self.finished_stats,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Aggregate finished stats for parallel sampling

For non-streaming requests with n > 1, ParentRequest.get_outputs() only emits a combined RequestOutput when the last child finishes, but this line copies finished_stats from just that one child request state. The OpenAI serving layers then use final_res.finished_stats for x-vllm-* headers, so headers like x-vllm-completion-tokens and timing fields reflect only the last branch instead of the whole API request (while usage is aggregated across choices). This produces misleading per-request stats whenever parallel sampling is used.

Useful? React with 👍 / 👎.

@cjackal
Copy link
Copy Markdown
Contributor

cjackal commented May 10, 2026

Just a question: what benefits of receiving the per-requests stats via custom response headers over collecting via distributed tracing? There already are so-called AI gateways (smg, envoy AI gateway, litellm proxy to name a few) whose whole purpose is to collect the per-request stats and transform into a business-standard trace protocols, and vllm integrates well with these. Is there a specific usecase where distributed tracing does not cover well?

@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented May 10, 2026

@cjackal Thanks for the interest in the PR!

So I am aware of a use-case where we have OTel data available as spans and they can be queried using Grafana. I think the value in adding headers is based on the fact that this gives us an easy in-line access path, which opens up much more configurability for proxies that are being run in front of vLLM deployments.

In our use-case, we have our own internal implementation of an AI Gateway, where we want to implement latency aware load balancing and also have cost-based rate limiting based on the inference time taken by the request. Since e2e wall time bundles queue waiting into the same number as compute, it's the wrong basis for cost attribution when multiple teams share a model — they'd be charged for each other's queueing. Exposing the headers gives us the opportunity to calculate cost and make routing decisions inline, without going through an OTel pipeline lookup that carries span-export lag (typically several seconds before a finished request's span is queryable) and a separate per-request query to the trace backend. It also helps us return values like per-request-cost in real time without having to thread that through the OTel pipeline.

For cross-service spans, audit, post-hoc analysis, OTel is clearly the right tool. I would think this to be a complementary source of the same data, without it being a replacement. Hope this makes sense and happy to answer any questions further!

@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented May 19, 2026

@chaunceyjiang @DarkLight1337 @russellb @robertgshaw2-redhat
Any chance I could get a preliminary review on this PR?

@vrdn-23 vrdn-23 requested a review from AndreasKaratzas as a code owner May 28, 2026 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants