[Feature]: Per-Request Timing Headers (--enable-request-stats-headers) by vrdn-23 · Pull Request #38572 · vllm-project/vllm

vrdn-23 · 2026-03-30T18:28:23Z

Purpose

Summary

Add opt-in x-vllm-* HTTP response headers exposing per-request timing and compute stats on non-streaming completion responses. Controlled by --enable-request-stats-headers.

Headers emitted:

Header	Description
`x-vllm-total-time`	Wall-clock ms, arrival to response
`x-vllm-queue-time`	Ms waiting to be scheduled
`x-vllm-prefill-time`	Ms from scheduled to first token
`x-vllm-decode-time`	Ms from first to last token
`x-vllm-inference-time`	Ms from scheduled to last token
`x-vllm-tokens-per-second`	Decode throughput
`x-vllm-prompt-tokens`	Input token count
`x-vllm-completion-tokens`	Output token count
`x-vllm-cached-tokens`	KV cache hits

Use cases: operator observability dashboards, client-side latency optimization, load balancer routing decisions.

Design decisions

Middleware, not per-router: A single HTTP middleware in api_server.py injects headers after the serving layer completes. This avoids touching every router, automatically covers future endpoints, and keeps header logic in one place. The middleware is only registered when the flag is enabled (zero overhead when off).
x-vllm-* prefix: Avoids collisions with proxies/CDNs that commonly use generic x- headers.
Non-streaming only: Streaming responses start sending before stats are available. Streaming support is explicitly out of scope.
RequestStateStats runtime import in protocol.py: Pydantic needs the type at runtime for model validation (arbitrary_types_allowed=True), so it cannot be behind TYPE_CHECKING.

Changes

vllm/entrypoints/openai/request_stats_headers.py (new): Pure function build_request_stats_headers + shared request_stats_headers_middleware. No FastAPI dependency in the header builder.
vllm/entrypoints/openai/api_server.py: Register middleware conditionally when flag is set.
vllm/entrypoints/openai/engine/protocol.py: Add request_stats and num_cached_tokens fields to RequestResponseMetadata.
vllm/entrypoints/openai/cli_args.py: Add --enable-request-stats-headers flag (default: False).
vllm/entrypoints/openai/chat_completion/serving.py: Populate request_metadata.request_stats from final_res.metrics.
vllm/entrypoints/openai/completion/serving.py: Same, from last_final_res.metrics.
vllm/entrypoints/openai/responses/serving.py: Populate stats from context.last_output.metrics and map ResponseUsage fields into UsageInfo for the middleware.

Known limitations

Multi-prompt batch completions: Timing headers reflect the last prompt's metrics, not an aggregate across all prompts. Token counts are correctly summed.
Multi-turn tool-calling (Responses API): Timing breakdown (queue/prefill/decode) reflects only the final turn. x-vllm-total-time (wall-clock) is still correct.

Both limitations are inherent in how RequestStateStats is scoped per-prompt/per-turn in the engine, not something that can be fixed at the serving layer without deeper changes.

AI assistance was used (Claude). This is not duplicating any existing PR.

Test Plan

Starting the server

(vllm) ssm-user@ip-192-168-100-135:/scratch/vllm$ vllm serve meta-llama/Llama-3.1-8B-Instruct --enable-request-stats-headers --max-model-len 8000 --max-num-seqs 128
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299] 
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.1.dev15338+g82e8db1f2.d20260329
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299]   █▄█▀ █     █     █     █  model   meta-llama/Llama-3.1-8B-Instruct
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:299] 
(APIServer pid=362812) INFO 03-30 18:31:18 [utils.py:233] non-default args: {'model_tag': 'meta-llama/Llama-3.1-8B-Instruct', 'enable_request_stats_headers': True, 'model': 'meta-llama/Llama-3.1-8B-Instruct', 'max_model_len': 8000, 'max_num_seqs': 128}
...
(APIServer pid=362812) INFO 03-30 18:34:22 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000

Testing a request

ssm-user@ip-192-168-100-135:/scratch$ curl -ik -XPOST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello!"}]}'
HTTP/1.1 200 OK
date: Mon, 30 Mar 2026 18:34:39 GMT
server: uvicorn
x-total-time: 1615.46
x-queue-time: 0.02
x-inference-time: 1583.64
x-prefill-time: 163.91
x-decode-time: 1419.72
x-prompt-tokens: 43
x-completion-tokens: 25
x-cached-tokens: 0
content-length: 682
content-type: application/json

{"id":"chatcmpl-b0ca3f3b0b282b5d","object":"chat.completion","created":1774895680,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":43,"total_tokens":68,"completion_tokens":25,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Test Result

Tested to see if there was any latency overhead with the addition of this middleware.

This PR

(vllm) ssm-user@ip-192-168-100-135:/scratch/vllm$ vllm bench serve --backend openai-chat --endpoint /v1/chat/completions --num-prompts 200 --input-len 512 --output-len 128 --request-rate 1 --save-result --result-filename headers.json
...
============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           1.00      
Benchmark duration (s):                  208.51    
Total input tokens:                      102200    
Total generated tokens:                  25600     
Request throughput (req/s):              0.96      
Output token throughput (tok/s):         122.77    
Peak output token throughput (tok/s):    242.00    
Peak concurrent requests:                22.00     
Total token throughput (tok/s):          612.91    
---------------Time to First Token----------------
Mean TTFT (ms):                          351.87    
Median TTFT (ms):                        328.36    
P99 TTFT (ms):                           649.73    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.67     
Median TPOT (ms):                        76.19     
P99 TPOT (ms):                           88.84     
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.07     
Median ITL (ms):                         65.64     
P99 ITL (ms):                            217.71    
==================================================

Main (baseline)

(vllm) ssm-user@ip-192-168-100-135:/scratch/vllm$ vllm bench serve --backend openai-chat --endpoint /v1/chat/completions --num-prompts 200 --input-len 512 --output-len 128 --request-rate 1 --save-result --result-filename baseline.json
...
============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           1.00      
Benchmark duration (s):                  208.51    
Total input tokens:                      102200    
Total generated tokens:                  25600     
Request throughput (req/s):              0.96      
Output token throughput (tok/s):         122.78    
Peak output token throughput (tok/s):    252.00    
Peak concurrent requests:                22.00     
Total token throughput (tok/s):          612.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          351.48    
Median TTFT (ms):                        322.23    
P99 TTFT (ms):                           641.76    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.47     
Median TPOT (ms):                        76.14     
P99 TPOT (ms):                           88.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           75.87     
Median ITL (ms):                         65.64     
P99 ITL (ms):                            216.63    
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04fe9a07dd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gemini-code-assist

Code Review

This pull request introduces a feature to include per-request timing and compute statistics (such as queue time, inference time, and token counts) as HTTP response headers for non-streaming OpenAI-compatible endpoints. This functionality is toggled via a new CLI argument, enable_request_stats_headers. The review feedback identifies a critical missing null check in the header construction logic that could lead to an AttributeError and points out that the Responses API path currently fails to populate the usage metadata required for these headers to be generated.

benchislett · 2026-04-13T18:26:33Z

I like this feature but don't have bandwidth to review at this time. Based on the description:

Why is it limited to non-streaming requests? There's already a pattern for returning "usage" in the last chunk of a streaming request, could we do the same thing here?
Could we also have speculative decoding acceptance metrics (per-position number of accepted tokens, number of drafts, and/or mean acceptance length)? This is a very frequently requested feature and seems like it would fit in nicely here. Would it be a simple add-on or require some major changes to the PR?

vrdn-23 · 2026-04-14T15:31:17Z

@benchislett Thank you for taking the time to chime in:

Why is it limited to non-streaming requests? There's already a pattern for returning "usage" in the last chunk of a streaming request, could we do the same thing here?

For streaming requests, my understanding is that because the HTTP headers are sent before the request starts streaming, we won't have access to send in the timing metrics until the final chunk is sent. Also since it's going to be continuous response sent back, we won't be able to re-send the headers with each chunk. I believe there is a concept of trailers (which are like HTTP headers sent after the response) which could work, but I would have to read up about it a bit more of how it would work and whether it's even widely supported in most cases.

Could we also have speculative decoding acceptance metrics (per-position number of accepted tokens, number of drafts, and/or mean acceptance length)? This is a very frequently requested feature and seems like it would fit in nicely here. Would it be a simple add-on or require some major changes to the PR?

That would be a nice addition but it might make sense for it to be a follow-up PR once we've landed on what this initial design would look like. I have toyed around with a couple approaches on how this should actually look like and one other approach I've considered is using the FinishedRequestStats object instead of creating a new one, so that there is a single source of truth for the calculation in metrics and headers, but this does seem cleaner.

I'd be happy to contribute the spec-decoding PR once this is landed to ensure completeness of the feature! Let me know if there is anything else I can help answer!

P.S- If possible, could you help put a ready tag on the PR, so that I can make sure it doesn't break any existing behavior while I iterate?

benchislett · 2026-04-23T00:02:39Z

marked as verified to unblock pre-commit. CI runs are expensive, let's wait for a second review from someone more experienced with our frontend

mergify · 2026-04-23T00:06:29Z

Hi @vrdn-23, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-04-23T00:17:16Z

Hi @vrdn-23, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

vrdn-23 · 2026-04-23T01:39:13Z

Thanks for adding the pre-commit check @benchislett.

I wanted to bump this to @noooop or @markmc based on the discussion I saw over at #39979 and wanted to know if either of you have a preference to this approach or think there should be a refactoring to have a single source of truth for the calculation of timings? Looking forward to hearing any thoughts/questions/feedback that you may have.

noooop · 2026-04-23T02:26:46Z

I'm not an expert in observability, but I don't think we need to reinvent the OpenTelemetry tracker.

vrdn-23 · 2026-04-23T02:44:35Z

I'm not an expert in observability, but I don't think we need to reinvent the OpenTelemetry tracker.

I'm sorry. I'm not quite sure I understood. While the metrics themselves are available via OpenTelemetry, in order to make decisions based on load balancing and cost attribution we would need some kind of indication with the request response (that doesn't involve us scraping metrics). Or maybe did I misunderstand what you were trying to convey?

noooop · 2026-04-23T02:47:07Z

The biggest issue with using Timing Headers is that they do not support streaming responses, which makes their use cases very limited.

At the same time, I also think that adding metrics in the response body not a very good solution.

I'm trying to find a better solution, but I haven't found one yet.

Also, I'm not entirely sure whether the best practice for observability is to have the telemetry plane separated from the data plane or to have it together with the data plane.

vrdn-23 · 2026-04-23T03:53:04Z

Just wanted to add to this issue a comment I made in another discussion that might be relevant and would be curious to hear your thoughts:
#40076 (comment)

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

mergify · 2026-05-10T00:30:17Z

Hi @vrdn-23, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

vrdn-23 · 2026-05-10T03:35:56Z

Closing in favor of #42198, which implements the same feature with a single-source-of-truth refactor: timing intervals are computed once in IterationStats.update_from_finished_request and consumed by both the Prometheus path and the new headers middleware. The middleware in this branch was recomputing the same intervals — that duplication is gone in #42198.

vrdn-23 requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and russellb as code owners March 30, 2026 18:28

claude Bot reviewed Mar 30, 2026

View reviewed changes

mergify Bot added the frontend label Mar 30, 2026

chatgpt-codex-connector Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread vllm/entrypoints/openai/request_stats_headers.py Outdated

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread vllm/entrypoints/openai/request_stats_headers.py Outdated

Comment thread vllm/entrypoints/openai/responses/serving.py Outdated

vrdn-23 force-pushed the vrdn-23/add-reponse-headers branch from 04fe9a0 to a618c95 Compare March 30, 2026 18:37

vrdn-23 requested review from ApostaC, WoosukKwon, ZJY0516, bigPYJ1151, gshtras, hmellor, jeejeelee, jikunshang, markmc, mgoin, noooop, orozery, patrickvonplaten, sighingnow, tdoublep, tjtanaa, tomeras91, yewentao256 and youkaichao as code owners March 30, 2026 18:37

vrdn-23 added 14 commits April 14, 2026 08:31

Merge branch 'main' into vrdn-23/add-reponse-headers

db95e1f

Merge branch 'main' into vrdn-23/add-reponse-headers

956f619

Merge branch 'main' into vrdn-23/add-reponse-headers

75ccf8d

Merge branch 'main' into vrdn-23/add-reponse-headers

d07950e

Merge branch 'main' into vrdn-23/add-reponse-headers

465df57

Merge branch 'main' into vrdn-23/add-reponse-headers

72a579a

Merge branch 'main' into vrdn-23/add-reponse-headers

34299e6

Merge branch 'main' into vrdn-23/add-reponse-headers

42f24bf

Merge branch 'main' into vrdn-23/add-reponse-headers

a30bcf8

Merge branch 'main' into vrdn-23/add-reponse-headers

fbbf55c

Merge branch 'main' into vrdn-23/add-reponse-headers

4375250

Merge branch 'main' into vrdn-23/add-reponse-headers

c47e88b

Merge branch 'main' into vrdn-23/add-reponse-headers

fab0ad9

Merge branch 'main' into vrdn-23/add-reponse-headers

7e1324b

Merge branch 'main' into vrdn-23/add-reponse-headers

376437a

vrdn-23 mentioned this pull request Apr 23, 2026

[Feature]: Per-request timing metrics in response body #40076

Open

1 task

fix merge conficts

5bc7b86

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

vrdn-23 mentioned this pull request May 10, 2026

[Frontend] Add x-vllm-* response headers for per-request stats #42198

Open

Uh oh!

Conversation

vrdn-23 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Summary

Design decisions

Changes

Known limitations

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

benchislett commented Apr 13, 2026

Uh oh!

vrdn-23 commented Apr 14, 2026

Uh oh!

benchislett commented Apr 23, 2026

Uh oh!

mergify Bot commented Apr 23, 2026

Uh oh!

mergify Bot commented Apr 23, 2026

Uh oh!

vrdn-23 commented Apr 23, 2026

Uh oh!

noooop commented Apr 23, 2026

Uh oh!

vrdn-23 commented Apr 23, 2026

Uh oh!

noooop commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrdn-23 commented Apr 23, 2026

Uh oh!

mergify Bot commented May 10, 2026

Uh oh!

vrdn-23 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vrdn-23 commented Mar 30, 2026 •

edited

Loading

noooop commented Apr 23, 2026 •

edited

Loading