fix(sdk): remove max duration limit on retries by JerT33 · Pull Request #13452 · kubeflow/pipelines

JerT33 · 2026-05-30T02:21:45Z

Description of your changes:

Currently there is a hard cap of 1 hour on retries, regardless of user configuration. Importantly, this 1hr limit is cumulative across all retry attempts of the same component (Argo measures it from the first attempt's start). Due to the interaction with Argo Workflows, once this limit is reached the active pod will fail or no pod will be retried (if pod ran over 1hr before failing).

This PR is intended to be a quick fix. The default behavior is all still the same, but now users can override it to get around this issue for long-running components.

Further discussion/changes can be proposed to fix the issue where this time is taken from the total component time, but that will require much larger configuration changes.

Live Cluster Evidence

Before (2.16.1):

Compiled retryPolicy with `backoff_max_duration='4h'`

retryPolicy:
  backoffMaxDuration: 3600s   # ← reduced from '4h'
  maxRetryCount: 1

50 minute wait example:

Initial pod start time: 2026-05-30T02:51:40Z:

    state:
      terminated:
        containerID: containerd://62169eb1743a4d8e19fcc9663dfd5253396ff62714f698d64f476de6cfca1919
        exitCode: 1
        finishedAt: "2026-05-30T03:47:29Z"
        reason: Error
        startedAt: "2026-05-30T02:51:40Z"

Pod is retried, but fails prematurely
Retry pod deadline:
4 minutes after initial pod failure (56 mins in)

    state:
      terminated:
        containerID: containerd://75afccb4eeb1ff6b2e3752dfa88398743048c00ac56e92d0beec4e8df9912b12
        exitCode: 143
        finishedAt: "2026-05-30T03:51:12Z"
        reason: Error
        startedAt: "2026-05-30T03:47:45Z"

Failure forced due to argo deadline:

- name: ARGO_DEADLINE
  value: "2026-05-30T03:51:10Z"

>1hr wait example:

Pod fails after 1 of executing, no pods are retried, the workflow fails immediately

After:

Compiled retryPolicy with `backoff_max_duration='4h'`

retryPolicy:
  backoffMaxDuration: 14400s   # ← '4h' preserved
  maxRetryCount: 1

50 minute wait example:

Initial pod fails after 1hr, retry pod is spun up as expected
argo deadline on retry pod (4hrs after initial execution time):

    - name: ARGO_DEADLINE
      value: "2026-05-31T17:20:36Z"

61 minute wait example:

(initial pod ran past 1hr threshold, failed, pod was still retried due to the 4hr limit increase) argo deadline on retry pod (4hrs after initial execution time): ```yaml - name: ARGO_DEADLINE value: "2026-05-31T17:20:59Z" ```

Checklist:

You have signed off your commits
The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

argoWF when my pod attempts to retry after an hour:

google-oss-prow · 2026-05-30T02:21:48Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ntny · 2026-05-30T18:36:59Z

@JerT33 Good point, this is a real issue.
I think changing current backoff_max_duration is tricky because SDK already writes 3600s into IR by default. So argocompiler cannot distinguish if user really set 1h or if it is just the old default.
Maybe we should deprecate this parameter and add a new optional one. Then Argo compiler can just skip setting maxDuration when the new field is not set, and set only if user explicitly asks for it.

@zazulam @droctothorpe hi! could you also take a look? This may be controversial, so I’d like to get a second opinion

JerT33 · 2026-05-30T20:43:46Z

@JerT33 Good point, this is a real issue. I think changing current backoff_max_duration is tricky because SDK already writes 3600s into IR by default. So argocompiler cannot distinguish if user really set 1h or if it is just the old default. Maybe we should deprecate this parameter and add a new optional one. Then Argo compiler can just skip setting maxDuration when the new field is not set, and set only if user explicitly asks for it.

@zazulam @droctothorpe hi! could you also take a look? This may be controversial, so I’d like to get a second opinion

@ntny thanks for the feedback!
Yeah this was an option I considered as well. The thought here is that this implementation won't change the current default behavior, but it would allow users to override the 1hr max duration if needed for a longer running component. It seems like this issue should have been present since 2022 when retries were first implemented (#7867), so I wasn't sure how much we would want to change the existing default behavior. Agreed, would love some extra opinions here!

Signed-off-by: JerT33 <trestjeremiah@gmail.com> remove some verbose comments Signed-off-by: JerT33 <trestjeremiah@gmail.com> fix lint

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR removes the 1-hour cap on backoff_max_duration when serializing retry policies, and updates related documentation and tests to reflect the new behavior.

Changes:

Stop capping backoff_max_duration at 3600 seconds in retry policy proto generation.
Update proto/doc comments to remove the “1 hour max” wording.
Adjust unit tests to expect larger backoff_max_duration values.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
sdk/python/kfp/dsl/structures.py	Removes the 3600s cap when converting `backoff_max_duration` to a protobuf `Duration`.
sdk/python/kfp/dsl/structures_test.py	Updates retry policy serialization test to expect uncapped `backoff_max_duration`.
sdk/python/kfp/dsl/pipeline_task.py	Updates `set_retry` docstring to remove the stated 1-hour maximum.
sdk/python/kfp/compiler/compiler_test.py	Updates compiler test expectations for `backoff_max_duration.seconds`.
api/v2alpha1/pipeline_spec.proto	Updates spec comments to remove the stated 1-hour max/capping behavior.

Files not reviewed (1)

api/v2alpha1/go/pipelinespec/pipeline_spec.pb.go: Language not supported

ntny · 2026-05-31T18:50:11Z

@JerT33 Good point, this is a real issue. I think changing current backoff_max_duration is tricky because SDK already writes 3600s into IR by default. So argocompiler cannot distinguish if user really set 1h or if it is just the old default. Maybe we should deprecate this parameter and add a new optional one. Then Argo compiler can just skip setting maxDuration when the new field is not set, and set only if user explicitly asks for it.
@zazulam @droctothorpe hi! could you also take a look? This may be controversial, so I’d like to get a second opinion

@ntny thanks for the feedback! Yeah this was an option I considered as well. The thought here is that this implementation won't change the current default behavior, but it would allow users to override the 1hr max duration if needed for a longer running component. It seems like this issue should have been present since 2022 when retries were first implemented (#7867), so I wasn't sure how much we would want to change the existing default behavior. Agreed, would love some extra opinions here!

@JerT33 Thanks, that makes sense. I agree that this is the most practical approach if we want to keep the change small and maintain backward compatibility.
The API semantics could probably be revisited separately in the future, but that's a much larger discussion than this PR. I'm happy with this direction.

For a possible follow-up, I could imagine something along these lines:

Deprecate the current backoff_max_duration parameter. The current name is somewhat misleading because it defines the maximum duration of all retries for a component combined rather than the duration of an individual retry. In addition, because the SDK serializes a default, downstream components cannot distinguish between an explicit user choice and a historical default.
Introduce a new optional parameter with a name that more clearly describes the total retry window semantics.
Allow platform administrators to configure a global default retry window at the ML Pipeline API Server deployment level instead of having it effectively defined by the SDK. This would allow different deployments to choose defaults/or just disable default deadline that make sense for their environment.
Use the platform default whenever the component-level value is not explicitly specified.
If a user explicitly configures a retry window for a component, that value should override the platform default.

The main motivation would be to separate explicit user intent from historical SDK defaults and make platform-wide policy easier to evolve over time.

ntny · 2026-05-31T18:50:46Z

@lgtm

droctothorpe · 2026-06-01T00:04:03Z

Thanks for tackling this, @JerT33! And thanks for the review, @ntny!

/approve
/lgtm

google-oss-prow · 2026-06-01T00:04:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: droctothorpe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~api/OWNERS~~ [droctothorpe]
~~sdk/OWNERS~~ [droctothorpe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow Bot added the do-not-merge/work-in-progress label May 30, 2026

google-oss-prow Bot requested review from droctothorpe and zazulam May 30, 2026 02:21

google-oss-prow Bot added the size/M label May 30, 2026

JerT33 force-pushed the feat/remove_retry_duration_cap branch from fbe1371 to 9a1d8cf Compare May 30, 2026 02:29

google-oss-prow Bot added size/S and removed size/M labels May 30, 2026

JerT33 changed the title ~~remove retry max duration cap~~ fix(sdk): remove retry max duration cap May 30, 2026

JerT33 force-pushed the feat/remove_retry_duration_cap branch from 9a1d8cf to 417adaf Compare May 30, 2026 02:42

JerT33 changed the title ~~fix(sdk): remove retry max duration cap~~ fix(sdk): remove max duration limit on retires May 30, 2026

JerT33 changed the title ~~fix(sdk): remove max duration limit on retires~~ fix(sdk): remove max duration limit on retries May 30, 2026

JerT33 marked this pull request as ready for review May 31, 2026 14:46

Copilot AI review requested due to automatic review settings May 31, 2026 14:46

google-oss-prow Bot removed the do-not-merge/work-in-progress label May 31, 2026

google-oss-prow Bot requested review from VaniHaripriya and mprahl May 31, 2026 14:46

remove retry max duration cap

524ca55

Signed-off-by: JerT33 <trestjeremiah@gmail.com> remove some verbose comments Signed-off-by: JerT33 <trestjeremiah@gmail.com> fix lint

JerT33 force-pushed the feat/remove_retry_duration_cap branch from 417adaf to 524ca55 Compare May 31, 2026 14:47

Copilot AI reviewed May 31, 2026

View reviewed changes

Comment thread sdk/python/kfp/dsl/structures_test.py

google-oss-prow Bot assigned droctothorpe Jun 1, 2026

google-oss-prow Bot added the lgtm label Jun 1, 2026

google-oss-prow Bot added the approved label Jun 1, 2026

google-oss-prow Bot merged commit c3c257d into kubeflow:master Jun 1, 2026
112 of 113 checks passed

JerT33 deleted the feat/remove_retry_duration_cap branch June 1, 2026 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sdk): remove max duration limit on retries#13452

fix(sdk): remove max duration limit on retries#13452
google-oss-prow[bot] merged 1 commit into
kubeflow:masterfrom
JerT33:feat/remove_retry_duration_cap

JerT33 commented May 30, 2026 •

edited

Loading

Uh oh!

google-oss-prow Bot commented May 30, 2026

Uh oh!

ntny commented May 30, 2026 •

edited

Loading

Uh oh!

JerT33 commented May 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

ntny commented May 31, 2026 •

edited

Loading

Uh oh!

ntny commented May 31, 2026

Uh oh!

droctothorpe commented Jun 1, 2026

Uh oh!

google-oss-prow Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JerT33 commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of your changes:

Live Cluster Evidence

Before (2.16.1):

Compiled retryPolicy with backoff_max_duration='4h'

50 minute wait example:

>1hr wait example:

After:

Compiled retryPolicy with backoff_max_duration='4h'

50 minute wait example:

61 minute wait example:

Uh oh!

google-oss-prow Bot commented May 30, 2026

Uh oh!

ntny commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JerT33 commented May 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

ntny commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntny commented May 31, 2026

Uh oh!

droctothorpe commented Jun 1, 2026

Uh oh!

google-oss-prow Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JerT33 commented May 30, 2026 •

edited

Loading

Compiled retryPolicy with `backoff_max_duration='4h'`

Compiled retryPolicy with `backoff_max_duration='4h'`

ntny commented May 30, 2026 •

edited

Loading

ntny commented May 31, 2026 •

edited

Loading