Extend drain-before-roll idle checks to vLLM, TGI, and multi-replica services

## Follow-up to #856

#856 adds controller-level drain-before-roll support for the llama.cpp-compatible `/slots` path, which covers the single-replica local-model case that motivated the original issue.

This follow-up tracks broadening the idle detector so `spec.rolloutPolicy.waitForIdle` can make runtime-specific decisions beyond that first implementation.

## Scope

- Add runtime-specific idle detectors for vLLM and TGI using their exposed metrics or health endpoints.
- Define a generic fallback/contract for custom runtimes, for example an annotation or spec field declaring an idle endpoint and parser.
- For multi-replica Services, avoid relying on a single Service load-balanced request; inspect each serving pod or endpoint so rollout waits until every current replica is idle.
- Document which runtimes support drain-before-roll and what signal each uses.

## Why separate from #856

The original issue is primarily about avoiding dropped generations for single-replica llama.cpp workloads. vLLM/TGI/generic support is valuable, but it needs runtime-specific signal validation and likely per-pod endpoint iteration, so it is better reviewed as a focused follow-up rather than expanding the initial PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Extend drain-before-roll idle checks to vLLM, TGI, and multi-replica services #911

Follow-up to #856

Scope

Why separate from #856

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Extend drain-before-roll idle checks to vLLM, TGI, and multi-replica services #911

Description

Follow-up to #856

Scope

Why separate from #856

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions