fix(backend): fix recurring runs with the latest pipeline version by jaewak · Pull Request #13440 · kubeflow/pipelines

jaewak · 2026-05-27T21:07:53Z

Description of your changes:

Fixes two bugs in recurring runs when "Always use latest pipeline version" is enabled (pipeline version ID is empty).

Bug 1: API server panic on recurring run creation

What we observed

When a user creates a recurring run in the UI with "Always use latest pipeline version" enabled, the API server returns an internal error. The run is never created. In the API server logs, we saw a nil pointer dereference panic originating from the Argo workflow compilation path.

This only happens when the pipeline version ID is left empty (the "always latest" path). Recurring runs with a specific pinned version work fine.

Why it happens

In ResourceManager.CreateJob(), there's a validation step that fetches the latest pipeline version's template and calls tmpl.ScheduledWorkflow(job) to verify that the user-provided runtime parameters are compatible with the pipeline's inputs.

This compilation assumes various fields are populated (e.g., pipeline version metadata, image URIs from the specific version). When no version is pinned, these fields are nil, causing the panic.

Additionally, after this validation block, the code checks tmpl.GetTemplateType() == template.V1 without guarding against tmpl being nil, which is another crash path when the "always latest" code skips template fetching.

How the fix works

New ValidateJobInputs() method on V2Spec (v2_template.go): Validates that the job's runtime parameters are compatible with the pipeline spec's InputDefinitions — the same validation that ScheduledWorkflow() does internally — but without performing the full Argo workflow compilation. It only converts the runtime config and calls validatePipelineJobInputs().
Type assertion in CreateJob (resource_manager.go): Instead of unconditionally calling tmpl.ScheduledWorkflow(job), we check if the template is a *template.V2Spec. If so, call the lightweight ValidateJobInputs(). Otherwise, fall back to ScheduledWorkflow() for V1 templates (which don't have this issue).
Nil guard (resource_manager.go): Add tmpl != nil && before the V1 pipeline block check to prevent a nil dereference when tmpl wasn't fetched.

Bug 2: Run flood — multiple duplicate runs per trigger interval

What we observed

After fixing Bug 1 and successfully creating a recurring run with "always use latest version," we observed that each trigger interval (e.g., every 1 minute) was creating 8-10+ duplicate runs instead of just 1. The namespace quickly fills up with duplicate workflows.

This does NOT happen when a specific pipeline version is pinned. With a pinned version, exactly 1 run is created per interval.

Why it happens

The scheduled-workflow-controller is deployed with multiple replicas (2 in our cluster). When a trigger fires, the reconciliation loop runs on each replica concurrently. There's an existing idempotency mechanism:

// controller.go - submitNewWorkflowIfNotAlreadySubmitted()
_, isNotFoundError, err := c.workflowClient.Get(swf.Namespace, workflowName)
if err == nil {
    // Already exists, nothing to do
    return true, workflowName, nil
}

This checks whether a workflow with the deterministic name (e.g., runofhello-world-abc123-1-2873143499) already exists. If it does, the controller skips submission.

The problem: When using "always use latest version," the controller doesn't create the workflow directly — it calls the API server's CreateRun gRPC endpoint. The API server creates the Argo workflow with a name derived from the pipeline's display name (e.g., echo-xxxxx), NOT the deterministic name the controller expects. So the controller's Get check always returns "not found," and every replica proceeds to call CreateRun.

With 2 controller replicas, each resyncing every 10 seconds, this creates 8-10+ runs per minute interval.

Secondary problem: Even if we prevent duplicate runs, the workflows created via the CreateRun gRPC path were missing the canonical labels that the controller needs:

[scheduledworkflows.kubeflow.org/scheduledWorkflowName](http://scheduledworkflows.kubeflow.org/scheduledWorkflowName)
[scheduledworkflows.kubeflow.org/workflowIndex](http://scheduledworkflows.kubeflow.org/workflowIndex)
[scheduledworkflows.kubeflow.org/workflowEpoch](http://scheduledworkflows.kubeflow.org/workflowEpoch)
[scheduledworkflows.kubeflow.org/isOwnedByScheduledWorkflow](http://scheduledworkflows.kubeflow.org/isOwnedByScheduledWorkflow)

Without these labels, the controller's workflowClient.List() (which uses label selectors) never finds the workflow. It can't detect it as active or completed, so the SWF status never advances — nextTriggeredTime and workflowHistory stay frozen, and the controller retries indefinitely.

How the fix works

Server-side idempotency check (resource_manager.go + run_store.go): At the very top of ResourceManager.CreateRun(), before any workflow creation or template fetching, we check: "does a run with this RecurringRunId + DisplayName already exist in the DB?" If yes, return it immediately.

This works because:
- The controller always passes the same deterministic DisplayName (swf.NextResourceName()) for a given trigger
- The DB is the single source of truth — no race window between replicas
- The first replica to reach the DB insert wins; subsequent replicas get the existing run back
New method added to RunStoreInterface:
```
GetRunByRecurringRunIdAndDisplayName(recurringRunId, displayName string) (string, error)
```
This does a simple SELECT UUID FROM run_details WHERE JobUUID = ? AND DisplayName = ? LIMIT 1.
Canonical labels (resource_manager.go): After setting OwnerReferences on the workflow, also call executionSpec.SetCannonicalLabels(swf.Name, epoch, nextIndex). This sets all four labels the controller needs to track the workflow via its label-based List queries.

The index is computed from swf.Status.Trigger.LastIndex + 1, matching what the controller would set if it were creating the workflow directly.

Testing

Tested on a live multi-replica deployment (3 API server pods, 2 scheduled-workflow-controller pods) on EKS:

Bug 1 verification:

Recurring runs with "always use latest version" can now be created without error
The recurring run appears in the UI and the ScheduledWorkflow CRD is created successfully

Bug 2 verification:

Each trigger interval produces exactly 1 run (previously 8-10+)
Controller logs show "successfully submitted" on each replica, but no duplicate workflows are created (idempotency check returns existing run)

Workflows have correct canonical labels:

[scheduledworkflows.kubeflow.org/isOwnedByScheduledWorkflow](http://scheduledworkflows.kubeflow.org/isOwnedByScheduledWorkflow): "true"
[scheduledworkflows.kubeflow.org/scheduledWorkflowName](http://scheduledworkflows.kubeflow.org/scheduledWorkflowName): "runofhello-world-internal25zgz"
[scheduledworkflows.kubeflow.org/workflowIndex](http://scheduledworkflows.kubeflow.org/workflowIndex): "2"
[scheduledworkflows.kubeflow.org/workflowEpoch](http://scheduledworkflows.kubeflow.org/workflowEpoch): "1779896834"

SWF status advances correctly — nextTriggeredTime, workflowHistory, and lastIndex all update as expected
Confirmed stable over extended period (30+ trigger intervals with zero duplicates)

Checklist:

You have signed off your commits
The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

google-oss-prow · 2026-05-27T21:08:05Z

Hi @jaewak. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds lightweight validation for V2 pipeline job inputs and introduces idempotent handling for recurring-run-triggered run creation to avoid duplicate run submissions.

Changes:

Add V2Spec.ValidateJobInputs to validate runtime parameters without full workflow compilation.
Add a RunStore query method to detect existing runs by recurring run id + display name and use it for idempotent CreateRun.
Add/extend tests covering input validation and recurring-run idempotency behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
backend/src/apiserver/template/v2_template.go	Adds `ValidateJobInputs` helper for V2 runtime parameter validation.
backend/src/apiserver/template/template_test.go	Adds unit tests for `ValidateJobInputs`.
backend/src/apiserver/storage/run_store.go	Adds DB lookup for existing run UUID by recurring run id + display name.
backend/src/apiserver/storage/run_store_test.go	Adds test for the new RunStore lookup method.
backend/src/apiserver/resource/resource_manager.go	Uses the lookup for idempotent recurring-run run creation; adds canonical label setting; updates job validation path.
backend/src/apiserver/resource/resource_manager_test.go	Adds test ensuring recurring-run duplicates return existing run and do not submit a new workflow.

cbartram · 2026-05-29T15:36:32Z

This is a great find! Thank you for taking the time to deeply debug this and lay it out clearly for us. I am surprised that the second bug where duplicate runs are created is due to replicas.

I would think that the scheduled workflow API would be stateless and thus the number of replicas wouldn't impact the number of runs created.

jaewak · 2026-05-29T16:07:25Z

This is a great find! Thank you for taking the time to deeply debug this and lay it out clearly for us. I am surprised that the second bug where duplicate runs are created is due to replicas.

I would think that the scheduled workflow API would be stateless and thus the number of replicas wouldn't impact the number of runs created.

Thanks for the review! Looks like the controller is not stateless. Before creating a workflow, the controller checks the informers to see if the run has already been created (code path here)

Signed-off-by: jaewak <jaewan.0907@gmail.com>

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: jaewak <82007787+jaewak@users.noreply.github.com>

droctothorpe · 2026-05-29T19:42:02Z

/ok-to-test

droctothorpe · 2026-05-29T19:46:42Z

@jaewak this is great, thank you so much for tackling it. Can you make sure to run the pre-commit hooks locally? The pre-commit CI stage is failing.

droctothorpe

PR Overview

This fixes the nil dereference and adds good coverage, but the new recurring-run idempotency check is still racy under multi-replica scheduling.

Blocking feedback

backend/src/apiserver/resource/resource_manager.go:642
- GetRunByRecurringRunIdAndDisplayName is only a preflight read, so two controller replicas can still both observe "no matching run" and continue.
- Both requests then go on to create a Kubernetes workflow before the DB insert happens.
- run_details only has a primary key on UUID; there is no uniqueness on (JobUUID, DisplayName) to collapse that race into a safe no-op.
- In that case the patch still allows duplicate or orphaned scheduled runs, which is the exact failure mode this change is trying to eliminate.

I think this needs a DB-backed idempotency mechanism (for example a unique key plus insert-on-conflict handling, or another lock around the create path) rather than a best-effort read-before-create.

^ Feedback from GPT-5.4

Signed-off-by: jaewak <jaewan.0907@gmail.com>

google-oss-prow · 2026-05-29T21:22:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from droctothorpe. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

backend/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…nflict Signed-off-by: jaewak <jaewan.0907@gmail.com>

jaewak · 2026-05-29T21:52:12Z

PR Overview

This fixes the nil dereference and adds good coverage, but the new recurring-run idempotency check is still racy under multi-replica scheduling.

Blocking feedback

backend/src/apiserver/resource/resource_manager.go:642

GetRunByRecurringRunIdAndDisplayName is only a preflight read, so two controller replicas can still both observe "no matching run" and continue.

Both requests then go on to create a Kubernetes workflow before the DB insert happens.

run_details only has a primary key on UUID; there is no uniqueness on (JobUUID, DisplayName) to collapse that race into a safe no-op.

In that case the patch still allows duplicate or orphaned scheduled runs, which is the exact failure mode this change is trying to eliminate.

I think this needs a DB-backed idempotency mechanism (for example a unique key plus insert-on-conflict handling, or another lock around the create path) rather than a best-effort read-before-create.

^ Feedback from GPT-5.4

Good call.

I tried the unique index on (JobUUID, DisplayName) first but it gets messy since non-recurring runs all store JobUUID = "" and are allowed to share display names, so the index would reject valid manual runs.

So I leaned on the PK we already have instead. Recurring-run triggers now get a deterministic UUID (UUIDv5 of RecurringRunId + "/" + DisplayName), so racing triggers land on the same key, and RunStore.CreateRun catches the duplicate-key error and just returns the run that already won. Atomic, no new index, no migration, manual runs untouched. Kept the old preflight read as a cheap fast-path. Added a util.NewDeterministicUUID helper + tests for determinism, the idempotent insert, and the ID derivation.

One thing to flag: the workflow is still created (with generateName) before the DB insert, so in a true dead-heat the loser leaves one orphaned workflow. Fixing that means a deterministic workflow name or flipping the order (DB first). Happy to do it here or as a follow-up.

Signed-off-by: jaewak <jaewan.0907@gmail.com>

Copilot AI review requested due to automatic review settings May 27, 2026 21:07

google-oss-prow Bot requested review from HumairAK and zazulam May 27, 2026 21:08

google-oss-prow Bot added size/L needs-ok-to-test labels May 27, 2026

Copilot AI reviewed May 27, 2026

View reviewed changes

jaewak marked this pull request as draft May 27, 2026 21:15

google-oss-prow Bot added the do-not-merge/work-in-progress label May 27, 2026

jaewak marked this pull request as ready for review May 28, 2026 19:53

google-oss-prow Bot removed the do-not-merge/work-in-progress label May 28, 2026

google-oss-prow Bot requested a review from droctothorpe May 28, 2026 19:53

jaewak and others added 2 commits May 29, 2026 15:41

fix(backend): fix recurring runs with latest pipeline version

9f00ebd

Signed-off-by: jaewak <jaewan.0907@gmail.com>

fix: add missing ORDER BY to the query

9ee6ad1

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: jaewak <82007787+jaewak@users.noreply.github.com>

droctothorpe force-pushed the fix/backend-recurring-run-scheduling branch from a5203d5 to 9ee6ad1 Compare May 29, 2026 19:42

google-oss-prow Bot added ok-to-test and removed needs-ok-to-test labels May 29, 2026

droctothorpe requested changes May 29, 2026

View reviewed changes

google-oss-prow Bot assigned droctothorpe May 29, 2026

build: fix precheck failures

6c5131b

Signed-off-by: jaewak <jaewan.0907@gmail.com>

jaewak and others added 2 commits May 29, 2026 17:22

Merge branch 'master' into fix/backend-recurring-run-scheduling

ca66d47

fix: make primary key uuid deterministic and resolve duplicate key co…

05bc78b

…nflict Signed-off-by: jaewak <jaewan.0907@gmail.com>

ci: re-trigger flaky E2E run

b6ef5ea

Signed-off-by: jaewak <jaewan.0907@gmail.com>

github-actions Bot added the ci-passed All CI tests on a pull request have passed label May 29, 2026

Conversation

jaewak commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug 1: API server panic on recurring run creation

What we observed

Why it happens

How the fix works

Bug 2: Run flood — multiple duplicate runs per trigger interval

What we observed

Why it happens

How the fix works

Testing

Uh oh!

google-oss-prow Bot commented May 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cbartram commented May 29, 2026

Uh oh!

jaewak commented May 29, 2026

Uh oh!

droctothorpe commented May 29, 2026

Uh oh!

droctothorpe commented May 29, 2026

Uh oh!

droctothorpe left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

PR Overview

Blocking feedback

Uh oh!

google-oss-prow Bot commented May 29, 2026

Uh oh!

jaewak commented May 29, 2026

PR Overview

Blocking feedback

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jaewak commented May 27, 2026 •

edited

Loading

droctothorpe left a comment •

edited

Loading