Skip to content

Extend drain-before-roll idle checks to vLLM, TGI, and multi-replica services #911

Description

@itsmiso-ai

Follow-up to #856

#856 adds controller-level drain-before-roll support for the llama.cpp-compatible /slots path, which covers the single-replica local-model case that motivated the original issue.

This follow-up tracks broadening the idle detector so spec.rolloutPolicy.waitForIdle can make runtime-specific decisions beyond that first implementation.

Scope

  • Add runtime-specific idle detectors for vLLM and TGI using their exposed metrics or health endpoints.
  • Define a generic fallback/contract for custom runtimes, for example an annotation or spec field declaring an idle endpoint and parser.
  • For multi-replica Services, avoid relying on a single Service load-balanced request; inspect each serving pod or endpoint so rollout waits until every current replica is idle.
  • Document which runtimes support drain-before-roll and what signal each uses.

Why separate from #856

The original issue is primarily about avoiding dropped generations for single-replica llama.cpp workloads. vLLM/TGI/generic support is valuable, but it needs runtime-specific signal validation and likely per-pod endpoint iteration, so it is better reviewed as a focused follow-up rather than expanding the initial PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions