[Feature]: Per-request timing metrics in response body

### 🚀 The feature, motivation and pitch

## Summary

Add an opt-in capability for vLLM to return per-request timing and compute metrics **in the response body** of OpenAI-compatible completion endpoints. The feature would be gated by a server-level flag (e.g. `--enable-per-request-metrics`) plus a per-request parameter (e.g. `include_metrics: true`), and would expose a structured `metrics` object alongside the normal response payload.

This issue is intentionally filed as a companion / counter-proposal to #36189, which proposes exposing the same information via HTTP response headers. A draft implementation of the body-based approach already exists in #36383.

## Motivation

(The motivation here largely overlaps with #36189; only real difference is delivery mechanism)

vLLM already tracks detailed per-request timing internally (queue time, prefill time, decode time, inter-token latency, etc.) via `RequestStateStats`, and surfaces aggregated versions of this data through Prometheus metrics and OpenTelemetry traces. Those are backend-only, aggregate observability tools — they do not let an API consumer see where time was spent on _their specific request_.

Exposing this data to API consumers directly unlocks two use cases:

### 1. Per-user / per-tenant billing and cost attribution

Operators running multi-tenant deployments need to attribute GPU time and token counts back to individual requests for usage-based billing and chargeback. Prometheus gives aggregates per endpoint/model, not per request. Having `generation_time_ms`, `queue_time_ms`, `prompt_tokens`, and `completion_tokens` in the response body means the billing system that is already parsing the response JSON has everything it needs in one place, with no separate infrastructure.

### 2. Per-request SLA attribution and latency debugging

Application developers building on top of vLLM currently see only total latency. With a structured `metrics` field they can distinguish:

- Time waiting in the scheduler queue (capacity issue)
- Time in prefill / time-to-first-token (prompt cost)
- Time in decode / inter-token latency (generation cost)

This makes it trivial to add per-request SLA tracking to an application without running a Prometheus scraper or an OTEL collector.

## Why the response body (and not only headers, as in #36189)

The headers-only approach proposed in #36189 is attractive for proxies and load balancers, but it has a hard limitation that a body-based approach does not:

- **Streaming support** - Headers are flushed before the first token. Metrics that are only known at end-of-generation (`generation_time_ms`, `mean_itl_ms`) cannot be carried in headers without HTTP trailers, which have very limited client/proxy support. A final SSE event (or final chunk) carries the completed metrics naturally and is trivial for clients to consume. |

The headers and body approaches are complementary, not mutually exclusive. Routers that want real-time hot-path signals benefit from headers; billing pipelines and SDK users need the body. The position of this RFE is that the body-based API covers cases that headers cannot.


## Proposal

### Opt-in flags (double gate)

```
vllm serve <model> --enable-per-request-metrics
```

Plus a per-request parameter:

```json
{ "model": "...", "messages": [...], "include_metrics": true }
```

Both must be set for metrics to be computed and returned. Default: **off** (no behavior change for existing users, no CPU overhead for deployments that do not opt in).

### Response body additions

A new optional `metrics` field on `ChatCompletionResponse`, `ChatCompletionStreamResponse`, `CompletionResponse`, and `CompletionStreamResponse`:

| Field | Unit | Description |
|---|---|---|
| `time_to_first_token_ms` | ms | Time from scheduling to first output token |
| `queue_time_ms` | ms | Time spent waiting in the scheduler queue |
| `generation_time_ms` | ms | Total decode time (excludes queue wait) |
| `mean_itl_ms` | ms | Mean inter-token latency during decode |
| `tokens_per_second` | count / s | Output throughput for this request |

(Prefill-time, cached-token, and per-phase GPU-time fields can be added incrementally as they become cleanly attributable to a single request.)

### Streaming behavior

- For non-streaming responses: `metrics` is populated on the final response object.
- For streaming responses: `metrics` is emitted on the final SSE chunk — consistent with how OpenAI already emits final `usage` when `stream_options.include_usage=true`.


### Alternatives

| Alternative | Why Not |
|---|---|
| **Headers only (#36189)** | Cannot carry end-of-generation timing for streaming cases. |
| **Prometheus only** | Aggregate, not per-request; requires scraping infrastructure; invisible to the caller. |
| **OpenTelemetry traces** | Requires a tracing backend; not accessible over plain HTTP; high operational overhead. |
| **Always-on body fields** | Adds CPU cost and response size for deployments that don't care. Opt-in keeps zero overhead as the default. |
| **Custom middleware** | Can only observe wall-clock time; cannot reach engine-internal timings (queue / prefill / decode). |


### Additional context

## Relationship to existing work

- Draft implementation: #36383 (already implements the core shape proposed here, including `PerRequestTimingMetrics` and the double-gate flag).
- Companion RFE for headers: #36189. That issue remains a valid proposal for router-facing, hot-path metrics; this RFE is scoped to the body-side API that headers cannot replace.

### Why opt-in?

- **Zero overhead when disabled** - no extra computation, no extra fields serialized.
- **Response size** - clients doing strict schema validation shouldn't see new fields unless they ask for them.
- **Information disclosure** - timing data can reveal server capacity characteristics; operators should choose to expose it.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Per-request timing metrics in response body #40076

🚀 The feature, motivation and pitch

Summary

Motivation

1. Per-user / per-tenant billing and cost attribution

2. Per-request SLA attribution and latency debugging

Why the response body (and not only headers, as in #36189)

Proposal

Opt-in flags (double gate)

Response body additions

Streaming behavior

Alternatives

Additional context

Relationship to existing work

Why opt-in?

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Unit	Description
`time_to_first_token_ms`	ms	Time from scheduling to first output token
`queue_time_ms`	ms	Time spent waiting in the scheduler queue
`generation_time_ms`	ms	Total decode time (excludes queue wait)
`mean_itl_ms`	ms	Mean inter-token latency during decode
`tokens_per_second`	count / s	Output throughput for this request

Alternative	Why Not
Headers only (#36189)	Cannot carry end-of-generation timing for streaming cases.
Prometheus only	Aggregate, not per-request; requires scraping infrastructure; invisible to the caller.
OpenTelemetry traces	Requires a tracing backend; not accessible over plain HTTP; high operational overhead.
Always-on body fields	Adds CPU cost and response size for deployments that don't care. Opt-in keeps zero overhead as the default.
Custom middleware	Can only observe wall-clock time; cannot reach engine-internal timings (queue / prefill / decode).

Uh oh!

[Feature]: Per-request timing metrics in response body #40076

Description

🚀 The feature, motivation and pitch

Summary

Motivation

1. Per-user / per-tenant billing and cost attribution

2. Per-request SLA attribution and latency debugging

Why the response body (and not only headers, as in #36189)

Proposal

Opt-in flags (double gate)

Response body additions

Streaming behavior

Alternatives

Additional context

Relationship to existing work

Why opt-in?

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions