Skip to content

fix(sdk): remove max duration limit on retries#13452

Merged
google-oss-prow[bot] merged 1 commit into
kubeflow:masterfrom
JerT33:feat/remove_retry_duration_cap
Jun 1, 2026
Merged

fix(sdk): remove max duration limit on retries#13452
google-oss-prow[bot] merged 1 commit into
kubeflow:masterfrom
JerT33:feat/remove_retry_duration_cap

Conversation

@JerT33
Copy link
Copy Markdown
Contributor

@JerT33 JerT33 commented May 30, 2026

Description of your changes:

Currently there is a hard cap of 1 hour on retries, regardless of user configuration. Importantly, this 1hr limit is cumulative across all retry attempts of the same component (Argo measures it from the first attempt's start). Due to the interaction with Argo Workflows, once this limit is reached the active pod will fail or no pod will be retried (if pod ran over 1hr before failing).

This PR is intended to be a quick fix. The default behavior is all still the same, but now users can override it to get around this issue for long-running components.

Further discussion/changes can be proposed to fix the issue where this time is taken from the total component time, but that will require much larger configuration changes.

Live Cluster Evidence

Before (2.16.1):

Compiled retryPolicy with backoff_max_duration='4h'

retryPolicy:
  backoffMaxDuration: 3600s   # ← reduced from '4h'
  maxRetryCount: 1

50 minute wait example:

Screenshot 2026-05-29 at 11 58 57 PM

Initial pod start time: 2026-05-30T02:51:40Z:

    state:
      terminated:
        containerID: containerd://62169eb1743a4d8e19fcc9663dfd5253396ff62714f698d64f476de6cfca1919
        exitCode: 1
        finishedAt: "2026-05-30T03:47:29Z"
        reason: Error
        startedAt: "2026-05-30T02:51:40Z"

Pod is retried, but fails prematurely
Retry pod deadline:
4 minutes after initial pod failure (56 mins in)

    state:
      terminated:
        containerID: containerd://75afccb4eeb1ff6b2e3752dfa88398743048c00ac56e92d0beec4e8df9912b12
        exitCode: 143
        finishedAt: "2026-05-30T03:51:12Z"
        reason: Error
        startedAt: "2026-05-30T03:47:45Z"

Failure forced due to argo deadline:

- name: ARGO_DEADLINE
  value: "2026-05-30T03:51:10Z"

>1hr wait example:

Screenshot 2026-05-30 at 12 13 03 AM

Pod fails after 1 of executing, no pods are retried, the workflow fails immediately

After:

Compiled retryPolicy with backoff_max_duration='4h'

retryPolicy:
  backoffMaxDuration: 14400s   # ← '4h' preserved
  maxRetryCount: 1

50 minute wait example:

Screenshot 2026-05-31 at 10 34 19 AM

Initial pod fails after 1hr, retry pod is spun up as expected
argo deadline on retry pod (4hrs after initial execution time):

    - name: ARGO_DEADLINE
      value: "2026-05-31T17:20:36Z"

61 minute wait example:

Screenshot 2026-05-31 at 10 45 14 AM (initial pod ran past 1hr threshold, failed, pod was still retried due to the 4hr limit increase) argo deadline on retry pod (4hrs after initial execution time): ```yaml - name: ARGO_DEADLINE value: "2026-05-31T17:20:59Z" ```

Checklist:

argoWF when my pod attempts to retry after an hour:
Screenshot 2026-05-31 at 11 00 31 AM

@google-oss-prow
Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@google-oss-prow google-oss-prow Bot requested review from droctothorpe and zazulam May 30, 2026 02:21
@JerT33 JerT33 force-pushed the feat/remove_retry_duration_cap branch from fbe1371 to 9a1d8cf Compare May 30, 2026 02:29
@google-oss-prow google-oss-prow Bot added size/S and removed size/M labels May 30, 2026
@JerT33 JerT33 changed the title remove retry max duration cap fix(sdk): remove retry max duration cap May 30, 2026
@JerT33 JerT33 force-pushed the feat/remove_retry_duration_cap branch from 9a1d8cf to 417adaf Compare May 30, 2026 02:42
@JerT33 JerT33 changed the title fix(sdk): remove retry max duration cap fix(sdk): remove max duration limit on retires May 30, 2026
@JerT33 JerT33 changed the title fix(sdk): remove max duration limit on retires fix(sdk): remove max duration limit on retries May 30, 2026
@ntny
Copy link
Copy Markdown
Contributor

ntny commented May 30, 2026

@JerT33 Good point, this is a real issue.
I think changing current backoff_max_duration is tricky because SDK already writes 3600s into IR by default. So argocompiler cannot distinguish if user really set 1h or if it is just the old default.
Maybe we should deprecate this parameter and add a new optional one. Then Argo compiler can just skip setting maxDuration when the new field is not set, and set only if user explicitly asks for it.

@zazulam @droctothorpe hi! could you also take a look? This may be controversial, so I’d like to get a second opinion

@JerT33
Copy link
Copy Markdown
Contributor Author

JerT33 commented May 30, 2026

@JerT33 Good point, this is a real issue. I think changing current backoff_max_duration is tricky because SDK already writes 3600s into IR by default. So argocompiler cannot distinguish if user really set 1h or if it is just the old default. Maybe we should deprecate this parameter and add a new optional one. Then Argo compiler can just skip setting maxDuration when the new field is not set, and set only if user explicitly asks for it.

@zazulam @droctothorpe hi! could you also take a look? This may be controversial, so I’d like to get a second opinion

@ntny thanks for the feedback!
Yeah this was an option I considered as well. The thought here is that this implementation won't change the current default behavior, but it would allow users to override the 1hr max duration if needed for a longer running component. It seems like this issue should have been present since 2022 when retries were first implemented (#7867), so I wasn't sure how much we would want to change the existing default behavior. Agreed, would love some extra opinions here!

@JerT33 JerT33 marked this pull request as ready for review May 31, 2026 14:46
Copilot AI review requested due to automatic review settings May 31, 2026 14:46
@google-oss-prow google-oss-prow Bot requested review from VaniHaripriya and mprahl May 31, 2026 14:46
Signed-off-by: JerT33 <trestjeremiah@gmail.com>

remove some verbose comments

Signed-off-by: JerT33 <trestjeremiah@gmail.com>

fix lint
@JerT33 JerT33 force-pushed the feat/remove_retry_duration_cap branch from 417adaf to 524ca55 Compare May 31, 2026 14:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR removes the 1-hour cap on backoff_max_duration when serializing retry policies, and updates related documentation and tests to reflect the new behavior.

Changes:

  • Stop capping backoff_max_duration at 3600 seconds in retry policy proto generation.
  • Update proto/doc comments to remove the “1 hour max” wording.
  • Adjust unit tests to expect larger backoff_max_duration values.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
sdk/python/kfp/dsl/structures.py Removes the 3600s cap when converting backoff_max_duration to a protobuf Duration.
sdk/python/kfp/dsl/structures_test.py Updates retry policy serialization test to expect uncapped backoff_max_duration.
sdk/python/kfp/dsl/pipeline_task.py Updates set_retry docstring to remove the stated 1-hour maximum.
sdk/python/kfp/compiler/compiler_test.py Updates compiler test expectations for backoff_max_duration.seconds.
api/v2alpha1/pipeline_spec.proto Updates spec comments to remove the stated 1-hour max/capping behavior.
Files not reviewed (1)
  • api/v2alpha1/go/pipelinespec/pipeline_spec.pb.go: Language not supported

Comment thread sdk/python/kfp/dsl/structures_test.py
@ntny
Copy link
Copy Markdown
Contributor

ntny commented May 31, 2026

@JerT33 Good point, this is a real issue. I think changing current backoff_max_duration is tricky because SDK already writes 3600s into IR by default. So argocompiler cannot distinguish if user really set 1h or if it is just the old default. Maybe we should deprecate this parameter and add a new optional one. Then Argo compiler can just skip setting maxDuration when the new field is not set, and set only if user explicitly asks for it.
@zazulam @droctothorpe hi! could you also take a look? This may be controversial, so I’d like to get a second opinion

@ntny thanks for the feedback! Yeah this was an option I considered as well. The thought here is that this implementation won't change the current default behavior, but it would allow users to override the 1hr max duration if needed for a longer running component. It seems like this issue should have been present since 2022 when retries were first implemented (#7867), so I wasn't sure how much we would want to change the existing default behavior. Agreed, would love some extra opinions here!

@JerT33 Thanks, that makes sense. I agree that this is the most practical approach if we want to keep the change small and maintain backward compatibility.
The API semantics could probably be revisited separately in the future, but that's a much larger discussion than this PR. I'm happy with this direction.

For a possible follow-up, I could imagine something along these lines:

  • Deprecate the current backoff_max_duration parameter. The current name is somewhat misleading because it defines the maximum duration of all retries for a component combined rather than the duration of an individual retry. In addition, because the SDK serializes a default, downstream components cannot distinguish between an explicit user choice and a historical default.
  • Introduce a new optional parameter with a name that more clearly describes the total retry window semantics.
  • Allow platform administrators to configure a global default retry window at the ML Pipeline API Server deployment level instead of having it effectively defined by the SDK. This would allow different deployments to choose defaults/or just disable default deadline that make sense for their environment.
  • Use the platform default whenever the component-level value is not explicitly specified.
  • If a user explicitly configures a retry window for a component, that value should override the platform default.

The main motivation would be to separate explicit user intent from historical SDK defaults and make platform-wide policy easier to evolve over time.

@ntny
Copy link
Copy Markdown
Contributor

ntny commented May 31, 2026

@lgtm

@droctothorpe
Copy link
Copy Markdown
Collaborator

Thanks for tackling this, @JerT33! And thanks for the review, @ntny!

/approve
/lgtm

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: droctothorpe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit c3c257d into kubeflow:master Jun 1, 2026
112 of 113 checks passed
@JerT33 JerT33 deleted the feat/remove_retry_duration_cap branch June 1, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants