🚀 The feature, motivation and pitch
Summary
Add an opt-in capability for vLLM to return per-request timing and compute metrics in the response body of OpenAI-compatible completion endpoints. The feature would be gated by a server-level flag (e.g. --enable-per-request-metrics) plus a per-request parameter (e.g. include_metrics: true), and would expose a structured metrics object alongside the normal response payload.
This issue is intentionally filed as a companion / counter-proposal to #36189, which proposes exposing the same information via HTTP response headers. A draft implementation of the body-based approach already exists in #36383.
Motivation
(The motivation here largely overlaps with #36189; only real difference is delivery mechanism)
vLLM already tracks detailed per-request timing internally (queue time, prefill time, decode time, inter-token latency, etc.) via RequestStateStats, and surfaces aggregated versions of this data through Prometheus metrics and OpenTelemetry traces. Those are backend-only, aggregate observability tools — they do not let an API consumer see where time was spent on their specific request.
Exposing this data to API consumers directly unlocks two use cases:
1. Per-user / per-tenant billing and cost attribution
Operators running multi-tenant deployments need to attribute GPU time and token counts back to individual requests for usage-based billing and chargeback. Prometheus gives aggregates per endpoint/model, not per request. Having generation_time_ms, queue_time_ms, prompt_tokens, and completion_tokens in the response body means the billing system that is already parsing the response JSON has everything it needs in one place, with no separate infrastructure.
2. Per-request SLA attribution and latency debugging
Application developers building on top of vLLM currently see only total latency. With a structured metrics field they can distinguish:
- Time waiting in the scheduler queue (capacity issue)
- Time in prefill / time-to-first-token (prompt cost)
- Time in decode / inter-token latency (generation cost)
This makes it trivial to add per-request SLA tracking to an application without running a Prometheus scraper or an OTEL collector.
Why the response body (and not only headers, as in #36189)
The headers-only approach proposed in #36189 is attractive for proxies and load balancers, but it has a hard limitation that a body-based approach does not:
- Streaming support - Headers are flushed before the first token. Metrics that are only known at end-of-generation (
generation_time_ms, mean_itl_ms) cannot be carried in headers without HTTP trailers, which have very limited client/proxy support. A final SSE event (or final chunk) carries the completed metrics naturally and is trivial for clients to consume. |
The headers and body approaches are complementary, not mutually exclusive. Routers that want real-time hot-path signals benefit from headers; billing pipelines and SDK users need the body. The position of this RFE is that the body-based API covers cases that headers cannot.
Proposal
Opt-in flags (double gate)
vllm serve <model> --enable-per-request-metrics
Plus a per-request parameter:
{ "model": "...", "messages": [...], "include_metrics": true }
Both must be set for metrics to be computed and returned. Default: off (no behavior change for existing users, no CPU overhead for deployments that do not opt in).
Response body additions
A new optional metrics field on ChatCompletionResponse, ChatCompletionStreamResponse, CompletionResponse, and CompletionStreamResponse:
| Field |
Unit |
Description |
time_to_first_token_ms |
ms |
Time from scheduling to first output token |
queue_time_ms |
ms |
Time spent waiting in the scheduler queue |
generation_time_ms |
ms |
Total decode time (excludes queue wait) |
mean_itl_ms |
ms |
Mean inter-token latency during decode |
tokens_per_second |
count / s |
Output throughput for this request |
(Prefill-time, cached-token, and per-phase GPU-time fields can be added incrementally as they become cleanly attributable to a single request.)
Streaming behavior
- For non-streaming responses:
metrics is populated on the final response object.
- For streaming responses:
metrics is emitted on the final SSE chunk — consistent with how OpenAI already emits final usage when stream_options.include_usage=true.
Alternatives
| Alternative |
Why Not |
| Headers only (#36189) |
Cannot carry end-of-generation timing for streaming cases. |
| Prometheus only |
Aggregate, not per-request; requires scraping infrastructure; invisible to the caller. |
| OpenTelemetry traces |
Requires a tracing backend; not accessible over plain HTTP; high operational overhead. |
| Always-on body fields |
Adds CPU cost and response size for deployments that don't care. Opt-in keeps zero overhead as the default. |
| Custom middleware |
Can only observe wall-clock time; cannot reach engine-internal timings (queue / prefill / decode). |
Additional context
Relationship to existing work
Why opt-in?
- Zero overhead when disabled - no extra computation, no extra fields serialized.
- Response size - clients doing strict schema validation shouldn't see new fields unless they ask for them.
- Information disclosure - timing data can reveal server capacity characteristics; operators should choose to expose it.
Before submitting a new issue...
🚀 The feature, motivation and pitch
Summary
Add an opt-in capability for vLLM to return per-request timing and compute metrics in the response body of OpenAI-compatible completion endpoints. The feature would be gated by a server-level flag (e.g.
--enable-per-request-metrics) plus a per-request parameter (e.g.include_metrics: true), and would expose a structuredmetricsobject alongside the normal response payload.This issue is intentionally filed as a companion / counter-proposal to #36189, which proposes exposing the same information via HTTP response headers. A draft implementation of the body-based approach already exists in #36383.
Motivation
(The motivation here largely overlaps with #36189; only real difference is delivery mechanism)
vLLM already tracks detailed per-request timing internally (queue time, prefill time, decode time, inter-token latency, etc.) via
RequestStateStats, and surfaces aggregated versions of this data through Prometheus metrics and OpenTelemetry traces. Those are backend-only, aggregate observability tools — they do not let an API consumer see where time was spent on their specific request.Exposing this data to API consumers directly unlocks two use cases:
1. Per-user / per-tenant billing and cost attribution
Operators running multi-tenant deployments need to attribute GPU time and token counts back to individual requests for usage-based billing and chargeback. Prometheus gives aggregates per endpoint/model, not per request. Having
generation_time_ms,queue_time_ms,prompt_tokens, andcompletion_tokensin the response body means the billing system that is already parsing the response JSON has everything it needs in one place, with no separate infrastructure.2. Per-request SLA attribution and latency debugging
Application developers building on top of vLLM currently see only total latency. With a structured
metricsfield they can distinguish:This makes it trivial to add per-request SLA tracking to an application without running a Prometheus scraper or an OTEL collector.
Why the response body (and not only headers, as in #36189)
The headers-only approach proposed in #36189 is attractive for proxies and load balancers, but it has a hard limitation that a body-based approach does not:
generation_time_ms,mean_itl_ms) cannot be carried in headers without HTTP trailers, which have very limited client/proxy support. A final SSE event (or final chunk) carries the completed metrics naturally and is trivial for clients to consume. |The headers and body approaches are complementary, not mutually exclusive. Routers that want real-time hot-path signals benefit from headers; billing pipelines and SDK users need the body. The position of this RFE is that the body-based API covers cases that headers cannot.
Proposal
Opt-in flags (double gate)
Plus a per-request parameter:
{ "model": "...", "messages": [...], "include_metrics": true }Both must be set for metrics to be computed and returned. Default: off (no behavior change for existing users, no CPU overhead for deployments that do not opt in).
Response body additions
A new optional
metricsfield onChatCompletionResponse,ChatCompletionStreamResponse,CompletionResponse, andCompletionStreamResponse:time_to_first_token_msqueue_time_msgeneration_time_msmean_itl_mstokens_per_second(Prefill-time, cached-token, and per-phase GPU-time fields can be added incrementally as they become cleanly attributable to a single request.)
Streaming behavior
metricsis populated on the final response object.metricsis emitted on the final SSE chunk — consistent with how OpenAI already emits finalusagewhenstream_options.include_usage=true.Alternatives
Additional context
Relationship to existing work
PerRequestTimingMetricsand the double-gate flag).--enable-request-stats-headers) #36189. That issue remains a valid proposal for router-facing, hot-path metrics; this RFE is scoped to the body-side API that headers cannot replace.Why opt-in?
Before submitting a new issue...