Skip to content

[Feature]: Per-request timing metrics in response body #40076

@nv-nedelman-1

Description

@nv-nedelman-1

🚀 The feature, motivation and pitch

Summary

Add an opt-in capability for vLLM to return per-request timing and compute metrics in the response body of OpenAI-compatible completion endpoints. The feature would be gated by a server-level flag (e.g. --enable-per-request-metrics) plus a per-request parameter (e.g. include_metrics: true), and would expose a structured metrics object alongside the normal response payload.

This issue is intentionally filed as a companion / counter-proposal to #36189, which proposes exposing the same information via HTTP response headers. A draft implementation of the body-based approach already exists in #36383.

Motivation

(The motivation here largely overlaps with #36189; only real difference is delivery mechanism)

vLLM already tracks detailed per-request timing internally (queue time, prefill time, decode time, inter-token latency, etc.) via RequestStateStats, and surfaces aggregated versions of this data through Prometheus metrics and OpenTelemetry traces. Those are backend-only, aggregate observability tools — they do not let an API consumer see where time was spent on their specific request.

Exposing this data to API consumers directly unlocks two use cases:

1. Per-user / per-tenant billing and cost attribution

Operators running multi-tenant deployments need to attribute GPU time and token counts back to individual requests for usage-based billing and chargeback. Prometheus gives aggregates per endpoint/model, not per request. Having generation_time_ms, queue_time_ms, prompt_tokens, and completion_tokens in the response body means the billing system that is already parsing the response JSON has everything it needs in one place, with no separate infrastructure.

2. Per-request SLA attribution and latency debugging

Application developers building on top of vLLM currently see only total latency. With a structured metrics field they can distinguish:

  • Time waiting in the scheduler queue (capacity issue)
  • Time in prefill / time-to-first-token (prompt cost)
  • Time in decode / inter-token latency (generation cost)

This makes it trivial to add per-request SLA tracking to an application without running a Prometheus scraper or an OTEL collector.

Why the response body (and not only headers, as in #36189)

The headers-only approach proposed in #36189 is attractive for proxies and load balancers, but it has a hard limitation that a body-based approach does not:

  • Streaming support - Headers are flushed before the first token. Metrics that are only known at end-of-generation (generation_time_ms, mean_itl_ms) cannot be carried in headers without HTTP trailers, which have very limited client/proxy support. A final SSE event (or final chunk) carries the completed metrics naturally and is trivial for clients to consume. |

The headers and body approaches are complementary, not mutually exclusive. Routers that want real-time hot-path signals benefit from headers; billing pipelines and SDK users need the body. The position of this RFE is that the body-based API covers cases that headers cannot.

Proposal

Opt-in flags (double gate)

vllm serve <model> --enable-per-request-metrics

Plus a per-request parameter:

{ "model": "...", "messages": [...], "include_metrics": true }

Both must be set for metrics to be computed and returned. Default: off (no behavior change for existing users, no CPU overhead for deployments that do not opt in).

Response body additions

A new optional metrics field on ChatCompletionResponse, ChatCompletionStreamResponse, CompletionResponse, and CompletionStreamResponse:

Field Unit Description
time_to_first_token_ms ms Time from scheduling to first output token
queue_time_ms ms Time spent waiting in the scheduler queue
generation_time_ms ms Total decode time (excludes queue wait)
mean_itl_ms ms Mean inter-token latency during decode
tokens_per_second count / s Output throughput for this request

(Prefill-time, cached-token, and per-phase GPU-time fields can be added incrementally as they become cleanly attributable to a single request.)

Streaming behavior

  • For non-streaming responses: metrics is populated on the final response object.
  • For streaming responses: metrics is emitted on the final SSE chunk — consistent with how OpenAI already emits final usage when stream_options.include_usage=true.

Alternatives

Alternative Why Not
Headers only (#36189) Cannot carry end-of-generation timing for streaming cases.
Prometheus only Aggregate, not per-request; requires scraping infrastructure; invisible to the caller.
OpenTelemetry traces Requires a tracing backend; not accessible over plain HTTP; high operational overhead.
Always-on body fields Adds CPU cost and response size for deployments that don't care. Opt-in keeps zero overhead as the default.
Custom middleware Can only observe wall-clock time; cannot reach engine-internal timings (queue / prefill / decode).

Additional context

Relationship to existing work

Why opt-in?

  • Zero overhead when disabled - no extra computation, no extra fields serialized.
  • Response size - clients doing strict schema validation shouldn't see new fields unless they ask for them.
  • Information disclosure - timing data can reveal server capacity characteristics; operators should choose to expose it.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions