Follow-up to #856
#856 adds controller-level drain-before-roll support for the llama.cpp-compatible /slots path, which covers the single-replica local-model case that motivated the original issue.
This follow-up tracks broadening the idle detector so spec.rolloutPolicy.waitForIdle can make runtime-specific decisions beyond that first implementation.
Scope
- Add runtime-specific idle detectors for vLLM and TGI using their exposed metrics or health endpoints.
- Define a generic fallback/contract for custom runtimes, for example an annotation or spec field declaring an idle endpoint and parser.
- For multi-replica Services, avoid relying on a single Service load-balanced request; inspect each serving pod or endpoint so rollout waits until every current replica is idle.
- Document which runtimes support drain-before-roll and what signal each uses.
Why separate from #856
The original issue is primarily about avoiding dropped generations for single-replica llama.cpp workloads. vLLM/TGI/generic support is valuable, but it needs runtime-specific signal validation and likely per-pod endpoint iteration, so it is better reviewed as a focused follow-up rather than expanding the initial PR.
Follow-up to #856
#856 adds controller-level drain-before-roll support for the llama.cpp-compatible
/slotspath, which covers the single-replica local-model case that motivated the original issue.This follow-up tracks broadening the idle detector so
spec.rolloutPolicy.waitForIdlecan make runtime-specific decisions beyond that first implementation.Scope
Why separate from #856
The original issue is primarily about avoiding dropped generations for single-replica llama.cpp workloads. vLLM/TGI/generic support is valuable, but it needs runtime-specific signal validation and likely per-pod endpoint iteration, so it is better reviewed as a focused follow-up rather than expanding the initial PR.