[Feature]: Per-Request Timing Headers (--enable-request-stats-headers)#38572
[Feature]: Per-Request Timing Headers (--enable-request-stats-headers)#38572vrdn-23 wants to merge 50 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 04fe9a07dd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces a feature to include per-request timing and compute statistics (such as queue time, inference time, and token counts) as HTTP response headers for non-streaming OpenAI-compatible endpoints. This functionality is toggled via a new CLI argument, enable_request_stats_headers. The review feedback identifies a critical missing null check in the header construction logic that could lead to an AttributeError and points out that the Responses API path currently fails to populate the usage metadata required for these headers to be generated.
04fe9a0 to
a618c95
Compare
|
I like this feature but don't have bandwidth to review at this time. Based on the description:
|
|
@benchislett Thank you for taking the time to chime in:
For streaming requests, my understanding is that because the HTTP headers are sent before the request starts streaming, we won't have access to send in the timing metrics until the final chunk is sent. Also since it's going to be continuous response sent back, we won't be able to re-send the headers with each chunk. I believe there is a concept of trailers (which are like HTTP headers sent after the response) which could work, but I would have to read up about it a bit more of how it would work and whether it's even widely supported in most cases.
That would be a nice addition but it might make sense for it to be a follow-up PR once we've landed on what this initial design would look like. I have toyed around with a couple approaches on how this should actually look like and one other approach I've considered is using the I'd be happy to contribute the spec-decoding PR once this is landed to ensure completeness of the feature! Let me know if there is anything else I can help answer! P.S- If possible, could you help put a |
|
marked as verified to unblock pre-commit. CI runs are expensive, let's wait for a second review from someone more experienced with our frontend |
|
Hi @vrdn-23, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @vrdn-23, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Thanks for adding the pre-commit check @benchislett. I wanted to bump this to @noooop or @markmc based on the discussion I saw over at #39979 and wanted to know if either of you have a preference to this approach or think there should be a refactoring to have a single source of truth for the calculation of timings? Looking forward to hearing any thoughts/questions/feedback that you may have. |
|
I'm not an expert in observability, but I don't think we need to reinvent the OpenTelemetry tracker. |
I'm sorry. I'm not quite sure I understood. While the metrics themselves are available via OpenTelemetry, in order to make decisions based on load balancing and cost attribution we would need some kind of indication with the request response (that doesn't involve us scraping metrics). Or maybe did I misunderstand what you were trying to convey? |
|
The biggest issue with using Timing Headers is that they do not support streaming responses, which makes their use cases very limited. At the same time, I also think that adding metrics in the response body not a very good solution. I'm trying to find a better solution, but I haven't found one yet. Also, I'm not entirely sure whether the best practice for observability is to have the telemetry plane separated from the data plane or to have it together with the data plane. |
|
Just wanted to add to this issue a comment I made in another discussion that might be relevant and would be curious to hear your thoughts: |
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
|
Hi @vrdn-23, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Closing in favor of #42198, which implements the same feature with a single-source-of-truth refactor: timing intervals are computed once in |
Purpose
Fixes #36189
Summary
Add opt-in
x-vllm-*HTTP response headers exposing per-request timing and compute stats on non-streaming completion responses. Controlled by--enable-request-stats-headers.Headers emitted:
x-vllm-total-timex-vllm-queue-timex-vllm-prefill-timex-vllm-decode-timex-vllm-inference-timex-vllm-tokens-per-secondx-vllm-prompt-tokensx-vllm-completion-tokensx-vllm-cached-tokensUse cases: operator observability dashboards, client-side latency optimization, load balancer routing decisions.
Design decisions
api_server.pyinjects headers after the serving layer completes. This avoids touching every router, automatically covers future endpoints, and keeps header logic in one place. The middleware is only registered when the flag is enabled (zero overhead when off).x-vllm-*prefix: Avoids collisions with proxies/CDNs that commonly use genericx-headers.RequestStateStatsruntime import inprotocol.py: Pydantic needs the type at runtime for model validation (arbitrary_types_allowed=True), so it cannot be behindTYPE_CHECKING.Changes
vllm/entrypoints/openai/request_stats_headers.py(new): Pure functionbuild_request_stats_headers+ sharedrequest_stats_headers_middleware. No FastAPI dependency in the header builder.vllm/entrypoints/openai/api_server.py: Register middleware conditionally when flag is set.vllm/entrypoints/openai/engine/protocol.py: Addrequest_statsandnum_cached_tokensfields toRequestResponseMetadata.vllm/entrypoints/openai/cli_args.py: Add--enable-request-stats-headersflag (default:False).vllm/entrypoints/openai/chat_completion/serving.py: Populaterequest_metadata.request_statsfromfinal_res.metrics.vllm/entrypoints/openai/completion/serving.py: Same, fromlast_final_res.metrics.vllm/entrypoints/openai/responses/serving.py: Populate stats fromcontext.last_output.metricsand mapResponseUsagefields intoUsageInfofor the middleware.Known limitations
x-vllm-total-time(wall-clock) is still correct.Both limitations are inherent in how
RequestStateStatsis scoped per-prompt/per-turn in the engine, not something that can be fixed at the serving layer without deeper changes.AI assistance was used (Claude). This is not duplicating any existing PR.
Test Plan
Starting the server
Testing a request
Test Result
Tested to see if there was any latency overhead with the addition of this middleware.
This PR
Main (baseline)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)