Environment
Steps to reproduce
When a pipeline task's pod cannot be scheduled by Kubernetes — for example because the
requested CPU or memory exceeds what any node can satisfy — the run appears to hang
indefinitely in the KFP UI with no error or explanation visible to the user. The run
stays in RUNNING state and only a cluster administrator who can inspect raw pod events
will discover the actual cause (Unschedulable: Insufficient cpu).
-
Author a pipeline component that sets an explicit resource limit that exceeds cluster
capacity, for example:
from kfp import dsl
@dsl.component
def heavy_task() -> str:
return "done"
@dsl.pipeline
def my_pipeline():
task = heavy_task()
task.set_cpu_limit("128") # exceeds any available node
task.set_memory_limit("512Gi") # exceeds any available node
-
Compile and submit the pipeline to a KFP deployment running on a cluster that cannot
satisfy those resource requests.
-
Observe the run in the Pipelines UI.
Actual Behavior
- The run transitions to
RUNNING and stays there indefinitely.
- No error message or warning is shown in the UI.
- The task node shows no failure reason.
- The root cause (pod unschedulable due to insufficient resources) is only discoverable
by running kubectl describe pod <pod-name> and reading the Kubernetes event:
Warning FailedScheduling ... 0/N nodes are available: N Insufficient cpu.
Why This Matters
Users who set resource limits (via set_cpu_limit, set_memory_limit, or
kubernetes.set_resources_v2) have no indication that their pipeline will never make
progress until they (or an admin) inspect the cluster directly. This is especially
problematic because:
- Non-admin users typically cannot access pod-level events.
- The UI gives no hint to retry with smaller limits or to contact an administrator.
- The run continues consuming a pipeline run slot indefinitely.
Expected result
If a pod remains in Pending state and its PodScheduled condition is False with
reason Unschedulable after several retires, KFP should surface this as a meaningful error to the user:
- Failing the task with an error message that includes the Kubernetes event reason
(e.g., "Pod could not be scheduled: Insufficient cpu. Adjust resource limits or use
a node with sufficient capacity.")
Materials and Reference
Impacted by this bug? Give it a 👍.
Environment
How did you deploy Kubeflow Pipelines (KFP)?
The issue seems to be significant regardless of how we deploy the KFP. In particular, I use it on Kubernetes, Kubeflow Pipelines instance in RHOAI.
KFP version:
2.16.0
KFP SDK version:
2.16.1
Steps to reproduce
When a pipeline task's pod cannot be scheduled by Kubernetes — for example because the
requested CPU or memory exceeds what any node can satisfy — the run appears to hang
indefinitely in the KFP UI with no error or explanation visible to the user. The run
stays in
RUNNINGstate and only a cluster administrator who can inspect raw pod eventswill discover the actual cause (
Unschedulable: Insufficient cpu).Author a pipeline component that sets an explicit resource limit that exceeds cluster
capacity, for example:
Compile and submit the pipeline to a KFP deployment running on a cluster that cannot
satisfy those resource requests.
Observe the run in the Pipelines UI.
Actual Behavior
RUNNINGand stays there indefinitely.by running
kubectl describe pod <pod-name>and reading the Kubernetes event:Warning FailedScheduling ... 0/N nodes are available: N Insufficient cpu.Why This Matters
Users who set resource limits (via
set_cpu_limit,set_memory_limit, orkubernetes.set_resources_v2) have no indication that their pipeline will never makeprogress until they (or an admin) inspect the cluster directly. This is especially
problematic because:
Expected result
If a pod remains in
Pendingstate and itsPodScheduledcondition isFalsewithreason
Unschedulableafter several retires, KFP should surface this as a meaningful error to the user:(e.g., "Pod could not be scheduled: Insufficient cpu. Adjust resource limits or use
a node with sufficient capacity.")
Materials and Reference
Impacted by this bug? Give it a 👍.