Skip to content

[backend] Pipeline task stuck in Pending with no user-visible error when pod is Unschedulable (e.g., insufficient resources) #13401

@Mateusz-Switala

Description

@Mateusz-Switala

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
    The issue seems to be significant regardless of how we deploy the KFP. In particular, I use it on Kubernetes, Kubeflow Pipelines instance in RHOAI.

  • KFP version:
    2.16.0

  • KFP SDK version:
    2.16.1

Steps to reproduce

When a pipeline task's pod cannot be scheduled by Kubernetes — for example because the
requested CPU or memory exceeds what any node can satisfy — the run appears to hang
indefinitely in the KFP UI with no error or explanation visible to the user. The run
stays in RUNNING state and only a cluster administrator who can inspect raw pod events
will discover the actual cause (Unschedulable: Insufficient cpu).

  1. Author a pipeline component that sets an explicit resource limit that exceeds cluster
    capacity, for example:

    from kfp import dsl
    
    @dsl.component
    def heavy_task() -> str:
        return "done"
    
    @dsl.pipeline
    def my_pipeline():
        task = heavy_task()
        task.set_cpu_limit("128")        # exceeds any available node
        task.set_memory_limit("512Gi")   # exceeds any available node
  2. Compile and submit the pipeline to a KFP deployment running on a cluster that cannot
    satisfy those resource requests.

  3. Observe the run in the Pipelines UI.

Actual Behavior

  • The run transitions to RUNNING and stays there indefinitely.
  • No error message or warning is shown in the UI.
  • The task node shows no failure reason.
  • The root cause (pod unschedulable due to insufficient resources) is only discoverable
    by running kubectl describe pod <pod-name> and reading the Kubernetes event:
    Warning FailedScheduling ... 0/N nodes are available: N Insufficient cpu.

Why This Matters

Users who set resource limits (via set_cpu_limit, set_memory_limit, or
kubernetes.set_resources_v2) have no indication that their pipeline will never make
progress until they (or an admin) inspect the cluster directly. This is especially
problematic because:

  • Non-admin users typically cannot access pod-level events.
  • The UI gives no hint to retry with smaller limits or to contact an administrator.
  • The run continues consuming a pipeline run slot indefinitely.

Expected result

If a pod remains in Pending state and its PodScheduled condition is False with
reason Unschedulable after several retires, KFP should surface this as a meaningful error to the user:

  • Failing the task with an error message that includes the Kubernetes event reason
    (e.g., "Pod could not be scheduled: Insufficient cpu. Adjust resource limits or use
    a node with sufficient capacity."
    )

Materials and Reference


Impacted by this bug? Give it a 👍.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions