fix(sdk): remove max duration limit on retries#13452
Conversation
|
Skipping CI for Draft Pull Request. |
fbe1371 to
9a1d8cf
Compare
9a1d8cf to
417adaf
Compare
|
@JerT33 Good point, this is a real issue. @zazulam @droctothorpe hi! could you also take a look? This may be controversial, so I’d like to get a second opinion |
@ntny thanks for the feedback! |
Signed-off-by: JerT33 <trestjeremiah@gmail.com> remove some verbose comments Signed-off-by: JerT33 <trestjeremiah@gmail.com> fix lint
417adaf to
524ca55
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR removes the 1-hour cap on backoff_max_duration when serializing retry policies, and updates related documentation and tests to reflect the new behavior.
Changes:
- Stop capping
backoff_max_durationat 3600 seconds in retry policy proto generation. - Update proto/doc comments to remove the “1 hour max” wording.
- Adjust unit tests to expect larger
backoff_max_durationvalues.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| sdk/python/kfp/dsl/structures.py | Removes the 3600s cap when converting backoff_max_duration to a protobuf Duration. |
| sdk/python/kfp/dsl/structures_test.py | Updates retry policy serialization test to expect uncapped backoff_max_duration. |
| sdk/python/kfp/dsl/pipeline_task.py | Updates set_retry docstring to remove the stated 1-hour maximum. |
| sdk/python/kfp/compiler/compiler_test.py | Updates compiler test expectations for backoff_max_duration.seconds. |
| api/v2alpha1/pipeline_spec.proto | Updates spec comments to remove the stated 1-hour max/capping behavior. |
Files not reviewed (1)
- api/v2alpha1/go/pipelinespec/pipeline_spec.pb.go: Language not supported
@JerT33 Thanks, that makes sense. I agree that this is the most practical approach if we want to keep the change small and maintain backward compatibility. For a possible follow-up, I could imagine something along these lines:
The main motivation would be to separate explicit user intent from historical SDK defaults and make platform-wide policy easier to evolve over time. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: droctothorpe The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description of your changes:
Currently there is a hard cap of 1 hour on retries, regardless of user configuration. Importantly, this 1hr limit is cumulative across all retry attempts of the same component (Argo measures it from the first attempt's start). Due to the interaction with Argo Workflows, once this limit is reached the active pod will fail or no pod will be retried (if pod ran over 1hr before failing).
This PR is intended to be a quick fix. The default behavior is all still the same, but now users can override it to get around this issue for long-running components.
Further discussion/changes can be proposed to fix the issue where this time is taken from the total component time, but that will require much larger configuration changes.
Live Cluster Evidence
Before (2.16.1):
Compiled retryPolicy with
backoff_max_duration='4h'50 minute wait example:
Initial pod start time:
2026-05-30T02:51:40Z:Pod is retried, but fails prematurely
Retry pod deadline:
4 minutes after initial pod failure (56 mins in)
Failure forced due to argo deadline:
>1hr wait example:
Pod fails after 1 of executing, no pods are retried, the workflow fails immediately
After:
Compiled retryPolicy with
backoff_max_duration='4h'50 minute wait example:
Initial pod fails after 1hr, retry pod is spun up as expected
argo deadline on retry pod (4hrs after initial execution time):
61 minute wait example:
Checklist:
argoWF when my pod attempts to retry after an hour:
