Skip to content

[Feature]: Per-Request Timing Headers (--enable-request-stats-headers)#38572

Closed
vrdn-23 wants to merge 50 commits into
vllm-project:mainfrom
vrdn-23:vrdn-23/add-reponse-headers
Closed

[Feature]: Per-Request Timing Headers (--enable-request-stats-headers)#38572
vrdn-23 wants to merge 50 commits into
vllm-project:mainfrom
vrdn-23:vrdn-23/add-reponse-headers

Conversation

@vrdn-23
Copy link
Copy Markdown
Contributor

@vrdn-23 vrdn-23 commented Mar 30, 2026

Purpose

Fixes #36189

Summary

Add opt-in x-vllm-* HTTP response headers exposing per-request timing and compute stats on non-streaming completion responses. Controlled by --enable-request-stats-headers.

Headers emitted:

Header Description
x-vllm-total-time Wall-clock ms, arrival to response
x-vllm-queue-time Ms waiting to be scheduled
x-vllm-prefill-time Ms from scheduled to first token
x-vllm-decode-time Ms from first to last token
x-vllm-inference-time Ms from scheduled to last token
x-vllm-tokens-per-second Decode throughput
x-vllm-prompt-tokens Input token count
x-vllm-completion-tokens Output token count
x-vllm-cached-tokens KV cache hits

Use cases: operator observability dashboards, client-side latency optimization, load balancer routing decisions.

Design decisions

  • Middleware, not per-router: A single HTTP middleware in api_server.py injects headers after the serving layer completes. This avoids touching every router, automatically covers future endpoints, and keeps header logic in one place. The middleware is only registered when the flag is enabled (zero overhead when off).
  • x-vllm-* prefix: Avoids collisions with proxies/CDNs that commonly use generic x- headers.
  • Non-streaming only: Streaming responses start sending before stats are available. Streaming support is explicitly out of scope.
  • RequestStateStats runtime import in protocol.py: Pydantic needs the type at runtime for model validation (arbitrary_types_allowed=True), so it cannot be behind TYPE_CHECKING.

Changes

  • vllm/entrypoints/openai/request_stats_headers.py (new): Pure function build_request_stats_headers + shared request_stats_headers_middleware. No FastAPI dependency in the header builder.
  • vllm/entrypoints/openai/api_server.py: Register middleware conditionally when flag is set.
  • vllm/entrypoints/openai/engine/protocol.py: Add request_stats and num_cached_tokens fields to RequestResponseMetadata.
  • vllm/entrypoints/openai/cli_args.py: Add --enable-request-stats-headers flag (default: False).
  • vllm/entrypoints/openai/chat_completion/serving.py: Populate request_metadata.request_stats from final_res.metrics.
  • vllm/entrypoints/openai/completion/serving.py: Same, from last_final_res.metrics.
  • vllm/entrypoints/openai/responses/serving.py: Populate stats from context.last_output.metrics and map ResponseUsage fields into UsageInfo for the middleware.

Known limitations

  • Multi-prompt batch completions: Timing headers reflect the last prompt's metrics, not an aggregate across all prompts. Token counts are correctly summed.
  • Multi-turn tool-calling (Responses API): Timing breakdown (queue/prefill/decode) reflects only the final turn. x-vllm-total-time (wall-clock) is still correct.

Both limitations are inherent in how RequestStateStats is scoped per-prompt/per-turn in the engine, not something that can be fixed at the serving layer without deeper changes.

AI assistance was used (Claude). This is not duplicating any existing PR.

Test Plan

Starting the server
(vllm) ssm-user@ip-192-168-100-135:/scratch/vllm$ vllm serve meta-llama/Llama-3.1-8B-Instruct --enable-request-stats-headers --max-model-len 8000 --max-num-seqs 128
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299] 
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.1.dev15338+g82e8db1f2.d20260329
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299]   █▄█▀ █     █     █     █  model   meta-llama/Llama-3.1-8B-Instruct
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299] 
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:233] non-default args: {'model_tag': 'meta-llama/Llama-3.1-8B-Instruct', 'enable_request_stats_headers': True, 'model': 'meta-llama/Llama-3.1-8B-Instruct', 'max_model_len': 8000, 'max_num_seqs': 128}
...
(APIServer pid=362812) INFO 03-30 18:34:22 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000
Testing a request
ssm-user@ip-192-168-100-135:/scratch$ curl -ik -XPOST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello!"}]}'
HTTP/1.1 200 OK
date: Mon, 30 Mar 2026 18:34:39 GMT
server: uvicorn
x-total-time: 1615.46
x-queue-time: 0.02
x-inference-time: 1583.64
x-prefill-time: 163.91
x-decode-time: 1419.72
x-prompt-tokens: 43
x-completion-tokens: 25
x-cached-tokens: 0
content-length: 682
content-type: application/json

{"id":"chatcmpl-b0ca3f3b0b282b5d","object":"chat.completion","created":1774895680,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":43,"total_tokens":68,"completion_tokens":25,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Test Result

Tested to see if there was any latency overhead with the addition of this middleware.

This PR
(vllm) ssm-user@ip-192-168-100-135:/scratch/vllm$ vllm bench serve --backend openai-chat --endpoint /v1/chat/completions --num-prompts 200 --input-len 512 --output-len 128 --request-rate 1 --save-result --result-filename headers.json
...
============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           1.00      
Benchmark duration (s):                  208.51    
Total input tokens:                      102200    
Total generated tokens:                  25600     
Request throughput (req/s):              0.96      
Output token throughput (tok/s):         122.77    
Peak output token throughput (tok/s):    242.00    
Peak concurrent requests:                22.00     
Total token throughput (tok/s):          612.91    
---------------Time to First Token----------------
Mean TTFT (ms):                          351.87    
Median TTFT (ms):                        328.36    
P99 TTFT (ms):                           649.73    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.67     
Median TPOT (ms):                        76.19     
P99 TPOT (ms):                           88.84     
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.07     
Median ITL (ms):                         65.64     
P99 ITL (ms):                            217.71    
==================================================
Main (baseline)
(vllm) ssm-user@ip-192-168-100-135:/scratch/vllm$ vllm bench serve --backend openai-chat --endpoint /v1/chat/completions --num-prompts 200 --input-len 512 --output-len 128 --request-rate 1 --save-result --result-filename baseline.json
...
============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           1.00      
Benchmark duration (s):                  208.51    
Total input tokens:                      102200    
Total generated tokens:                  25600     
Request throughput (req/s):              0.96      
Output token throughput (tok/s):         122.78    
Peak output token throughput (tok/s):    252.00    
Peak concurrent requests:                22.00     
Total token throughput (tok/s):          612.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          351.48    
Median TTFT (ms):                        322.23    
P99 TTFT (ms):                           641.76    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.47     
Median TPOT (ms):                        76.14     
P99 TPOT (ms):                           88.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           75.87     
Median ITL (ms):                         65.64     
P99 ITL (ms):                            216.63    
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the frontend label Mar 30, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04fe9a07dd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm/entrypoints/openai/request_stats_headers.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to include per-request timing and compute statistics (such as queue time, inference time, and token counts) as HTTP response headers for non-streaming OpenAI-compatible endpoints. This functionality is toggled via a new CLI argument, enable_request_stats_headers. The review feedback identifies a critical missing null check in the header construction logic that could lead to an AttributeError and points out that the Responses API path currently fails to populate the usage metadata required for these headers to be generated.

Comment thread vllm/entrypoints/openai/request_stats_headers.py Outdated
Comment thread vllm/entrypoints/openai/responses/serving.py Outdated
@benchislett
Copy link
Copy Markdown
Member

I like this feature but don't have bandwidth to review at this time. Based on the description:

  • Why is it limited to non-streaming requests? There's already a pattern for returning "usage" in the last chunk of a streaming request, could we do the same thing here?
  • Could we also have speculative decoding acceptance metrics (per-position number of accepted tokens, number of drafts, and/or mean acceptance length)? This is a very frequently requested feature and seems like it would fit in nicely here. Would it be a simple add-on or require some major changes to the PR?

@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented Apr 14, 2026

@benchislett Thank you for taking the time to chime in:

Why is it limited to non-streaming requests? There's already a pattern for returning "usage" in the last chunk of a streaming request, could we do the same thing here?

For streaming requests, my understanding is that because the HTTP headers are sent before the request starts streaming, we won't have access to send in the timing metrics until the final chunk is sent. Also since it's going to be continuous response sent back, we won't be able to re-send the headers with each chunk. I believe there is a concept of trailers (which are like HTTP headers sent after the response) which could work, but I would have to read up about it a bit more of how it would work and whether it's even widely supported in most cases.

Could we also have speculative decoding acceptance metrics (per-position number of accepted tokens, number of drafts, and/or mean acceptance length)? This is a very frequently requested feature and seems like it would fit in nicely here. Would it be a simple add-on or require some major changes to the PR?

That would be a nice addition but it might make sense for it to be a follow-up PR once we've landed on what this initial design would look like. I have toyed around with a couple approaches on how this should actually look like and one other approach I've considered is using the FinishedRequestStats object instead of creating a new one, so that there is a single source of truth for the calculation in metrics and headers, but this does seem cleaner.

I'd be happy to contribute the spec-decoding PR once this is landed to ensure completeness of the feature! Let me know if there is anything else I can help answer!

P.S- If possible, could you help put a ready tag on the PR, so that I can make sure it doesn't break any existing behavior while I iterate?

@benchislett
Copy link
Copy Markdown
Member

marked as verified to unblock pre-commit. CI runs are expensive, let's wait for a second review from someone more experienced with our frontend

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 23, 2026

Hi @vrdn-23, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 23, 2026

Hi @vrdn-23, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented Apr 23, 2026

Thanks for adding the pre-commit check @benchislett.

I wanted to bump this to @noooop or @markmc based on the discussion I saw over at #39979 and wanted to know if either of you have a preference to this approach or think there should be a refactoring to have a single source of truth for the calculation of timings? Looking forward to hearing any thoughts/questions/feedback that you may have.

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Apr 23, 2026

I'm not an expert in observability, but I don't think we need to reinvent the OpenTelemetry tracker.

@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented Apr 23, 2026

I'm not an expert in observability, but I don't think we need to reinvent the OpenTelemetry tracker.

I'm sorry. I'm not quite sure I understood. While the metrics themselves are available via OpenTelemetry, in order to make decisions based on load balancing and cost attribution we would need some kind of indication with the request response (that doesn't involve us scraping metrics). Or maybe did I misunderstand what you were trying to convey?

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Apr 23, 2026

The biggest issue with using Timing Headers is that they do not support streaming responses, which makes their use cases very limited.

At the same time, I also think that adding metrics in the response body not a very good solution.

I'm trying to find a better solution, but I haven't found one yet.


Also, I'm not entirely sure whether the best practice for observability is to have the telemetry plane separated from the data plane or to have it together with the data plane.

@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented Apr 23, 2026

Just wanted to add to this issue a comment I made in another discussion that might be relevant and would be curious to hear your thoughts:
#40076 (comment)

Signed-off-by: Vinay Damodaran <vrdn@hey.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 10, 2026

Hi @vrdn-23, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@vrdn-23
Copy link
Copy Markdown
Contributor Author

vrdn-23 commented May 10, 2026

Closing in favor of #42198, which implements the same feature with a single-source-of-truth refactor: timing intervals are computed once in IterationStats.update_from_finished_request and consumed by both the Prometheus path and the new headers middleware. The middleware in this branch was recomputing the same intervals — that duplication is gone in #42198.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models intel-gpu Related to Intel GPU kv-connector llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output tool-calling v1 verified Run pre-commit for new contributors without triggering other tests

Projects

Status: Done
Status: Backlog
Status: Done
Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: Per-Request Timing Headers (--enable-request-stats-headers)

5 participants