fix(backend): fix recurring runs with the latest pipeline version#13440
fix(backend): fix recurring runs with the latest pipeline version#13440jaewak wants to merge 6 commits into
Conversation
|
Hi @jaewak. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds lightweight validation for V2 pipeline job inputs and introduces idempotent handling for recurring-run-triggered run creation to avoid duplicate run submissions.
Changes:
- Add
V2Spec.ValidateJobInputsto validate runtime parameters without full workflow compilation. - Add a RunStore query method to detect existing runs by recurring run id + display name and use it for idempotent
CreateRun. - Add/extend tests covering input validation and recurring-run idempotency behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/src/apiserver/template/v2_template.go | Adds ValidateJobInputs helper for V2 runtime parameter validation. |
| backend/src/apiserver/template/template_test.go | Adds unit tests for ValidateJobInputs. |
| backend/src/apiserver/storage/run_store.go | Adds DB lookup for existing run UUID by recurring run id + display name. |
| backend/src/apiserver/storage/run_store_test.go | Adds test for the new RunStore lookup method. |
| backend/src/apiserver/resource/resource_manager.go | Uses the lookup for idempotent recurring-run run creation; adds canonical label setting; updates job validation path. |
| backend/src/apiserver/resource/resource_manager_test.go | Adds test ensuring recurring-run duplicates return existing run and do not submit a new workflow. |
|
This is a great find! Thank you for taking the time to deeply debug this and lay it out clearly for us. I am surprised that the second bug where duplicate runs are created is due to replicas. I would think that the scheduled workflow API would be stateless and thus the number of replicas wouldn't impact the number of runs created. |
Thanks for the review! Looks like the controller is not stateless. Before creating a workflow, the controller checks the informers to see if the run has already been created (code path here) |
Signed-off-by: jaewak <jaewan.0907@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: jaewak <82007787+jaewak@users.noreply.github.com>
a5203d5 to
9ee6ad1
Compare
|
/ok-to-test |
|
@jaewak this is great, thank you so much for tackling it. Can you make sure to run the pre-commit hooks locally? The pre-commit CI stage is failing. |
There was a problem hiding this comment.
PR Overview
This fixes the nil dereference and adds good coverage, but the new recurring-run idempotency check is still racy under multi-replica scheduling.
Blocking feedback
- backend/src/apiserver/resource/resource_manager.go:642
GetRunByRecurringRunIdAndDisplayNameis only a preflight read, so two controller replicas can still both observe "no matching run" and continue.- Both requests then go on to create a Kubernetes workflow before the DB insert happens.
run_detailsonly has a primary key onUUID; there is no uniqueness on(JobUUID, DisplayName)to collapse that race into a safe no-op.- In that case the patch still allows duplicate or orphaned scheduled runs, which is the exact failure mode this change is trying to eliminate.
I think this needs a DB-backed idempotency mechanism (for example a unique key plus insert-on-conflict handling, or another lock around the create path) rather than a best-effort read-before-create.
^ Feedback from GPT-5.4
Signed-off-by: jaewak <jaewan.0907@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…nflict Signed-off-by: jaewak <jaewan.0907@gmail.com>
Good call. I tried the unique index on (JobUUID, DisplayName) first but it gets messy since non-recurring runs all store So I leaned on the PK we already have instead. Recurring-run triggers now get a deterministic UUID (UUIDv5 of One thing to flag: the workflow is still created (with |
Signed-off-by: jaewak <jaewan.0907@gmail.com>
Description of your changes:
Fixes two bugs in recurring runs when "Always use latest pipeline version" is enabled (pipeline version ID is empty).
Bug 1: API server panic on recurring run creation
What we observed
When a user creates a recurring run in the UI with "Always use latest pipeline version" enabled, the API server returns an internal error. The run is never created. In the API server logs, we saw a nil pointer dereference panic originating from the Argo workflow compilation path.
This only happens when the pipeline version ID is left empty (the "always latest" path). Recurring runs with a specific pinned version work fine.
Why it happens
In
ResourceManager.CreateJob(), there's a validation step that fetches the latest pipeline version's template and callstmpl.ScheduledWorkflow(job)to verify that the user-provided runtime parameters are compatible with the pipeline's inputs.This compilation assumes various fields are populated (e.g., pipeline version metadata, image URIs from the specific version). When no version is pinned, these fields are nil, causing the panic.
Additionally, after this validation block, the code checks
tmpl.GetTemplateType() == template.V1without guarding againsttmplbeing nil, which is another crash path when the "always latest" code skips template fetching.How the fix works
New
ValidateJobInputs()method onV2Spec(v2_template.go): Validates that the job's runtime parameters are compatible with the pipeline spec'sInputDefinitions— the same validation thatScheduledWorkflow()does internally — but without performing the full Argo workflow compilation. It only converts the runtime config and callsvalidatePipelineJobInputs().Type assertion in
CreateJob(resource_manager.go): Instead of unconditionally callingtmpl.ScheduledWorkflow(job), we check if the template is a*template.V2Spec. If so, call the lightweightValidateJobInputs(). Otherwise, fall back toScheduledWorkflow()for V1 templates (which don't have this issue).Nil guard (
resource_manager.go): Addtmpl != nil &&before the V1 pipeline block check to prevent a nil dereference whentmplwasn't fetched.Bug 2: Run flood — multiple duplicate runs per trigger interval
What we observed
After fixing Bug 1 and successfully creating a recurring run with "always use latest version," we observed that each trigger interval (e.g., every 1 minute) was creating 8-10+ duplicate runs instead of just 1. The namespace quickly fills up with duplicate workflows.
This does NOT happen when a specific pipeline version is pinned. With a pinned version, exactly 1 run is created per interval.
Why it happens
The scheduled-workflow-controller is deployed with multiple replicas (2 in our cluster). When a trigger fires, the reconciliation loop runs on each replica concurrently. There's an existing idempotency mechanism:
This checks whether a workflow with the deterministic name (e.g.,
runofhello-world-abc123-1-2873143499) already exists. If it does, the controller skips submission.The problem: When using "always use latest version," the controller doesn't create the workflow directly — it calls the API server's
CreateRungRPC endpoint. The API server creates the Argo workflow with a name derived from the pipeline's display name (e.g.,echo-xxxxx), NOT the deterministic name the controller expects. So the controller'sGetcheck always returns "not found," and every replica proceeds to callCreateRun.With 2 controller replicas, each resyncing every 10 seconds, this creates 8-10+ runs per minute interval.
Secondary problem: Even if we prevent duplicate runs, the workflows created via the
CreateRungRPC path were missing the canonical labels that the controller needs:[scheduledworkflows.kubeflow.org/scheduledWorkflowName](http://scheduledworkflows.kubeflow.org/scheduledWorkflowName)[scheduledworkflows.kubeflow.org/workflowIndex](http://scheduledworkflows.kubeflow.org/workflowIndex)[scheduledworkflows.kubeflow.org/workflowEpoch](http://scheduledworkflows.kubeflow.org/workflowEpoch)[scheduledworkflows.kubeflow.org/isOwnedByScheduledWorkflow](http://scheduledworkflows.kubeflow.org/isOwnedByScheduledWorkflow)Without these labels, the controller's
workflowClient.List()(which uses label selectors) never finds the workflow. It can't detect it as active or completed, so the SWF status never advances —nextTriggeredTimeandworkflowHistorystay frozen, and the controller retries indefinitely.How the fix works
Server-side idempotency check (
resource_manager.go+run_store.go): At the very top ofResourceManager.CreateRun(), before any workflow creation or template fetching, we check: "does a run with thisRecurringRunId+DisplayNamealready exist in the DB?" If yes, return it immediately.This works because:
DisplayName(swf.NextResourceName()) for a given triggerNew method added to
RunStoreInterface:This does a simple
SELECT UUID FROM run_details WHERE JobUUID = ? AND DisplayName = ? LIMIT 1.Canonical labels (
resource_manager.go): After settingOwnerReferenceson the workflow, also callexecutionSpec.SetCannonicalLabels(swf.Name, epoch, nextIndex). This sets all four labels the controller needs to track the workflow via its label-basedListqueries.The index is computed from
swf.Status.Trigger.LastIndex + 1, matching what the controller would set if it were creating the workflow directly.Testing
Tested on a live multi-replica deployment (3 API server pods, 2 scheduled-workflow-controller pods) on EKS:
Bug 1 verification:
Bug 2 verification:
nextTriggeredTime,workflowHistory, andlastIndexall update as expectedChecklist: