[Frontend] Add x-vllm-* response headers for per-request stats#42198
[Frontend] Add x-vllm-* response headers for per-request stats#42198vrdn-23 wants to merge 16 commits into
Conversation
Adds an opt-in --enable-request-stats-headers flag that attaches per-request timing and token-count headers to non-streaming OpenAI responses (chat, completion, responses). The timing intervals (queue, prefill, decode, inference, e2e) and mean time per output token are computed in exactly one place: IterationStats.update_from_finished_request, which now returns the FinishedRequestStats it builds. Both the Prometheus path and the new headers middleware consume the same object - the middleware performs no arithmetic, only formatting. Headers added (all opt-in via --enable-request-stats-headers): x-vllm-total-time, x-vllm-queue-time, x-vllm-prefill-time, x-vllm-decode-time, x-vllm-inference-time, x-vllm-prompt-tokens, x-vllm-completion-tokens, x-vllm-cached-tokens, x-vllm-time-per-output-token Streaming responses are unchanged. Error responses are unchanged. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
There was a problem hiding this comment.
Code Review
This pull request introduces per-request timing and compute statistics as x-vllm-* response headers for non-streaming OpenAI-compatible requests, controlled by a new --enable-request-stats-headers flag. The implementation includes a FastAPI middleware, updates to the V1 engine's output processing to capture FinishedRequestStats, and propagation of these stats through the OpenAI serving layer. Feedback indicates that finished_stats should also be integrated into PoolingRequestOutput to support embedding endpoints. Furthermore, it is recommended to calculate and attach these statistics before the RequestOutput is queued to avoid potential timing issues in asynchronous environments.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e725b27b42
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…d workaround The previous commit added a `RequestResponseMetadata.model_rebuild()` call at the bottom of `protocol.py` to repair a Pydantic forward-reference resolution failure. The root cause was structural: `FinishedRequestStats` lives in `vllm.v1.metrics.stats` and annotates `finish_reason: "FinishReason"`, but `FinishReason` was defined in `vllm.v1.engine.__init__` - which already imports `vllm.v1.metrics.stats` at runtime. To break that cycle, `stats.py` hid the import under `TYPE_CHECKING`, which left the forward-reference string unresolvable when Pydantic tried to introspect the dataclass. This change moves `FinishReason` and `FINISH_REASON_STRINGS` into a new leaf module `vllm/v1/finish_reason.py` that imports nothing else from `vllm.v1.*`. `vllm/v1/engine/__init__.py` re-exports the symbols so every existing `from vllm.v1.engine import FinishReason` keeps working unchanged. The class identity is preserved (re-export, not redefinition), so isinstance checks and enum value comparisons continue to work. `stats.py` can now import `FinishReason` at module top level without circularity, and the `model_rebuild` block in `protocol.py` is deleted - the forward reference resolves naturally. Net: -1 line, 4 files touched, no behavior change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…lidation, skip multi-prompt headers Three changes addressing PR vllm-project#42198 review feedback: 1. Reorder _update_stats_from_finished before make_request_output in the output processor. The previous version attached finished_stats to the already-built RequestOutput after the fact; in AsyncLLM that meant the queued output was momentarily stale. The reorder is safe because update_from_finished_request only reads req_state.stats, which is already finalized by _update_stats_from_output earlier in the same iteration. Removes the post-attach hack and the misleading comment claiming a side-effect ordering constraint. 2. Reject --enable-request-stats-headers + --disable-log-stats at startup. The two flags are silently incompatible: when log_stats is off, req_state.stats is None and finished_stats is never produced, so the middleware becomes a permanent no-op. Fail loudly instead. 3. Skip emitting x-vllm-* headers for multi-prompt batched /v1/completions requests. The previous code reported only the last prompt's stats with a comment flagging the limitation; that's misleading. Per-prompt FinishedRequestStats can't be meaningfully aggregated (queue/prefill/ decode intervals are per-prompt), so we skip headers entirely when len(final_res_batch) > 1. Single-prompt requests are unchanged. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 72507ce31d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| num_cached_tokens=self.num_cached_tokens, | ||
| metrics=self.stats, | ||
| prompt_routed_experts=prompt_routed_experts, | ||
| finished_stats=self.finished_stats, |
There was a problem hiding this comment.
Aggregate finished stats for parallel sampling
For non-streaming requests with n > 1, ParentRequest.get_outputs() only emits a combined RequestOutput when the last child finishes, but this line copies finished_stats from just that one child request state. The OpenAI serving layers then use final_res.finished_stats for x-vllm-* headers, so headers like x-vllm-completion-tokens and timing fields reflect only the last branch instead of the whole API request (while usage is aggregated across choices). This produces misleading per-request stats whenever parallel sampling is used.
Useful? React with 👍 / 👎.
|
Just a question: what benefits of receiving the per-requests stats via custom response headers over collecting via distributed tracing? There already are so-called AI gateways (smg, envoy AI gateway, litellm proxy to name a few) whose whole purpose is to collect the per-request stats and transform into a business-standard trace protocols, and vllm integrates well with these. Is there a specific usecase where distributed tracing does not cover well? |
|
@cjackal Thanks for the interest in the PR! So I am aware of a use-case where we have OTel data available as spans and they can be queried using Grafana. I think the value in adding headers is based on the fact that this gives us an easy in-line access path, which opens up much more configurability for proxies that are being run in front of vLLM deployments. In our use-case, we have our own internal implementation of an AI Gateway, where we want to implement latency aware load balancing and also have cost-based rate limiting based on the inference time taken by the request. Since e2e wall time bundles queue waiting into the same number as compute, it's the wrong basis for cost attribution when multiple teams share a model — they'd be charged for each other's queueing. Exposing the headers gives us the opportunity to calculate cost and make routing decisions inline, without going through an OTel pipeline lookup that carries span-export lag (typically several seconds before a finished request's span is queryable) and a separate per-request query to the trace backend. It also helps us return values like For cross-service spans, audit, post-hoc analysis, OTel is clearly the right tool. I would think this to be a complementary source of the same data, without it being a replacement. Hope this makes sense and happy to answer any questions further! |
…-headers # Conflicts: # vllm/entrypoints/openai/completion/serving.py # vllm/outputs.py # vllm/v1/engine/output_processor.py
|
@chaunceyjiang @DarkLight1337 @russellb @robertgshaw2-redhat |
Summary
Adds an opt-in
--enable-request-stats-headersflag that attaches per-request timing and token-count headers to non-streaming OpenAI responses (chat, completion, responses):Single source of truth. The timing intervals are computed in exactly one place —
IterationStats.update_from_finished_request— which now returns theFinishedRequestStatsit builds. Both the existing Prometheus path and the new headers middleware consume the same object; the middleware does no arithmetic, only formatting. This is a deliberate refactor on top of my earlier prototype: the previous version recomputed intervals inside the middleware, creating a third copy of the same math.Headers, not body. Putting metrics in HTTP headers (rather than extending the response body) keeps the OpenAI response schema unchanged, so strict client validators (OpenAI SDK in some configurations, Pydantic-based proxies) don't need to be updated.
Streaming and error responses unchanged. The middleware is a no-op when
request_metadata.finished_stats is None, which covers streaming (no terminalRequestOutput), errors, and non-OpenAI routes.Bonus refactor: extract
FinishReasonto a leaf moduleThe first commit needed
RequestResponseMetadatato carry aFinishedRequestStatsfield. That ran into a Pydantic forward-reference resolution failure:FinishedRequestStatslives invllm.v1.metrics.statsand annotatesfinish_reason: "FinishReason", butFinishReasonwas defined invllm.v1.engine.__init__— which already imports fromvllm.v1.metrics.statsat runtime.stats.pyhid theFinishReasonimport underTYPE_CHECKINGto break that cycle, which left the forward-ref string unresolvable when Pydantic introspected the dataclass.The first commit worked around this with a
model_rebuild(_types_namespace={"FinishReason": _FinishReason})call at the bottom ofprotocol.py. The second commit fixes the structural issue properly:vllm/v1/finish_reason.pyholdsFinishReasonandFINISH_REASON_STRINGS. It imports nothing else fromvllm.v1.*.vllm/v1/engine/__init__.pyre-exports the symbols, so every existingfrom vllm.v1.engine import FinishReasonkeeps working unchanged. Class identity is preserved (re-export, not redefinition).stats.pyimportsFinishReasonat module top level. TheTYPE_CHECKINGdodge is gone.model_rebuildblock inprotocol.pyis deleted; the forward reference resolves naturally.Net: −1 line, 4 files touched, no behavior change. This makes future consumers of
FinishedRequestStats(loggers, tracers, RPC schemas) "just work" without needing the rebuild trick.Why this isn't a duplicate
build_request_stats_headersrecomputed the five timing intervals inside the middleware; this version reusesFinishedRequestStatsdirectly.metricsfield toChatCompletionResponsebody and bundles acompletion_tokens_details.reasoning_tokenschange. This PR is headers-only and makes no body-schema changes. The two could coexist; this one is targeted at the "don't change response schemas" use case.Test Plan
Manual verification (Linux + CUDA):
x-vllm-*headers present on non-streaming responses; values internally consistent (prefill + decode ≈ inference,queue + inference + overhead = total)."stream": true) confirmed to NOT emitx-vllm-*headers.x-vllm-cached-tokens > 0) confirmed on repeated identical prompts past the block boundary.Test Results
update_from_finished_requestreturns the same object it appends).pre-commit runclean on all touched files (ruff check, ruff format, mypy-3.10, SPDX headers, all other hooks).facebook/opt-125mserving the/v1/completionsendpoint with--enable-request-stats-headersreturns all 9 headers with self-consistent values.AI assistance
This change was developed with AI assistance (Claude Opus 4.7 via Claude Code). Every line was reviewed by me. I ran the test suite locally and verified the live behavior on my own hardware before opening this PR.
🤖 Generated with Claude Code