Skip to content

fix(backend): fix recurring runs with the latest pipeline version#13440

Open
jaewak wants to merge 6 commits into
kubeflow:masterfrom
jaewak:fix/backend-recurring-run-scheduling
Open

fix(backend): fix recurring runs with the latest pipeline version#13440
jaewak wants to merge 6 commits into
kubeflow:masterfrom
jaewak:fix/backend-recurring-run-scheduling

Conversation

@jaewak
Copy link
Copy Markdown

@jaewak jaewak commented May 27, 2026

Description of your changes:

Fixes two bugs in recurring runs when "Always use latest pipeline version" is enabled (pipeline version ID is empty).


Bug 1: API server panic on recurring run creation

What we observed

When a user creates a recurring run in the UI with "Always use latest pipeline version" enabled, the API server returns an internal error. The run is never created. In the API server logs, we saw a nil pointer dereference panic originating from the Argo workflow compilation path.

This only happens when the pipeline version ID is left empty (the "always latest" path). Recurring runs with a specific pinned version work fine.

Why it happens

In ResourceManager.CreateJob(), there's a validation step that fetches the latest pipeline version's template and calls tmpl.ScheduledWorkflow(job) to verify that the user-provided runtime parameters are compatible with the pipeline's inputs.

This compilation assumes various fields are populated (e.g., pipeline version metadata, image URIs from the specific version). When no version is pinned, these fields are nil, causing the panic.

Additionally, after this validation block, the code checks tmpl.GetTemplateType() == template.V1 without guarding against tmpl being nil, which is another crash path when the "always latest" code skips template fetching.

How the fix works

  1. New ValidateJobInputs() method on V2Spec (v2_template.go): Validates that the job's runtime parameters are compatible with the pipeline spec's InputDefinitions — the same validation that ScheduledWorkflow() does internally — but without performing the full Argo workflow compilation. It only converts the runtime config and calls validatePipelineJobInputs().

  2. Type assertion in CreateJob (resource_manager.go): Instead of unconditionally calling tmpl.ScheduledWorkflow(job), we check if the template is a *template.V2Spec. If so, call the lightweight ValidateJobInputs(). Otherwise, fall back to ScheduledWorkflow() for V1 templates (which don't have this issue).

  3. Nil guard (resource_manager.go): Add tmpl != nil && before the V1 pipeline block check to prevent a nil dereference when tmpl wasn't fetched.


Bug 2: Run flood — multiple duplicate runs per trigger interval

What we observed

After fixing Bug 1 and successfully creating a recurring run with "always use latest version," we observed that each trigger interval (e.g., every 1 minute) was creating 8-10+ duplicate runs instead of just 1. The namespace quickly fills up with duplicate workflows.

This does NOT happen when a specific pipeline version is pinned. With a pinned version, exactly 1 run is created per interval.

Why it happens

The scheduled-workflow-controller is deployed with multiple replicas (2 in our cluster). When a trigger fires, the reconciliation loop runs on each replica concurrently. There's an existing idempotency mechanism:

// controller.go - submitNewWorkflowIfNotAlreadySubmitted()
_, isNotFoundError, err := c.workflowClient.Get(swf.Namespace, workflowName)
if err == nil {
    // Already exists, nothing to do
    return true, workflowName, nil
}

This checks whether a workflow with the deterministic name (e.g., runofhello-world-abc123-1-2873143499) already exists. If it does, the controller skips submission.

The problem: When using "always use latest version," the controller doesn't create the workflow directly — it calls the API server's CreateRun gRPC endpoint. The API server creates the Argo workflow with a name derived from the pipeline's display name (e.g., echo-xxxxx), NOT the deterministic name the controller expects. So the controller's Get check always returns "not found," and every replica proceeds to call CreateRun.

With 2 controller replicas, each resyncing every 10 seconds, this creates 8-10+ runs per minute interval.

Secondary problem: Even if we prevent duplicate runs, the workflows created via the CreateRun gRPC path were missing the canonical labels that the controller needs:

Without these labels, the controller's workflowClient.List() (which uses label selectors) never finds the workflow. It can't detect it as active or completed, so the SWF status never advances — nextTriggeredTime and workflowHistory stay frozen, and the controller retries indefinitely.

How the fix works

  1. Server-side idempotency check (resource_manager.go + run_store.go): At the very top of ResourceManager.CreateRun(), before any workflow creation or template fetching, we check: "does a run with this RecurringRunId + DisplayName already exist in the DB?" If yes, return it immediately.

    This works because:

    • The controller always passes the same deterministic DisplayName (swf.NextResourceName()) for a given trigger
    • The DB is the single source of truth — no race window between replicas
    • The first replica to reach the DB insert wins; subsequent replicas get the existing run back

    New method added to RunStoreInterface:

    GetRunByRecurringRunIdAndDisplayName(recurringRunId, displayName string) (string, error)

    This does a simple SELECT UUID FROM run_details WHERE JobUUID = ? AND DisplayName = ? LIMIT 1.

  2. Canonical labels (resource_manager.go): After setting OwnerReferences on the workflow, also call executionSpec.SetCannonicalLabels(swf.Name, epoch, nextIndex). This sets all four labels the controller needs to track the workflow via its label-based List queries.

    The index is computed from swf.Status.Trigger.LastIndex + 1, matching what the controller would set if it were creating the workflow directly.


Testing

Tested on a live multi-replica deployment (3 API server pods, 2 scheduled-workflow-controller pods) on EKS:

Bug 1 verification:

  • Recurring runs with "always use latest version" can now be created without error
  • The recurring run appears in the UI and the ScheduledWorkflow CRD is created successfully

Bug 2 verification:

  • Each trigger interval produces exactly 1 run (previously 8-10+)
  • Controller logs show "successfully submitted" on each replica, but no duplicate workflows are created (idempotency check returns existing run)
  • Workflows have correct canonical labels:
    [scheduledworkflows.kubeflow.org/isOwnedByScheduledWorkflow](http://scheduledworkflows.kubeflow.org/isOwnedByScheduledWorkflow): "true"
    [scheduledworkflows.kubeflow.org/scheduledWorkflowName](http://scheduledworkflows.kubeflow.org/scheduledWorkflowName): "runofhello-world-internal25zgz"
    [scheduledworkflows.kubeflow.org/workflowIndex](http://scheduledworkflows.kubeflow.org/workflowIndex): "2"
    [scheduledworkflows.kubeflow.org/workflowEpoch](http://scheduledworkflows.kubeflow.org/workflowEpoch): "1779896834"
    
  • SWF status advances correctly — nextTriggeredTime, workflowHistory, and lastIndex all update as expected
  • Confirmed stable over extended period (30+ trigger intervals with zero duplicates)

Checklist:

Copilot AI review requested due to automatic review settings May 27, 2026 21:07
@google-oss-prow google-oss-prow Bot requested review from HumairAK and zazulam May 27, 2026 21:08
@google-oss-prow
Copy link
Copy Markdown

Hi @jaewak. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds lightweight validation for V2 pipeline job inputs and introduces idempotent handling for recurring-run-triggered run creation to avoid duplicate run submissions.

Changes:

  • Add V2Spec.ValidateJobInputs to validate runtime parameters without full workflow compilation.
  • Add a RunStore query method to detect existing runs by recurring run id + display name and use it for idempotent CreateRun.
  • Add/extend tests covering input validation and recurring-run idempotency behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
backend/src/apiserver/template/v2_template.go Adds ValidateJobInputs helper for V2 runtime parameter validation.
backend/src/apiserver/template/template_test.go Adds unit tests for ValidateJobInputs.
backend/src/apiserver/storage/run_store.go Adds DB lookup for existing run UUID by recurring run id + display name.
backend/src/apiserver/storage/run_store_test.go Adds test for the new RunStore lookup method.
backend/src/apiserver/resource/resource_manager.go Uses the lookup for idempotent recurring-run run creation; adds canonical label setting; updates job validation path.
backend/src/apiserver/resource/resource_manager_test.go Adds test ensuring recurring-run duplicates return existing run and do not submit a new workflow.

Comment thread backend/src/apiserver/template/v2_template.go
Comment thread backend/src/apiserver/storage/run_store.go Outdated
Comment thread backend/src/apiserver/storage/run_store.go
Comment thread backend/src/apiserver/resource/resource_manager.go Outdated
Comment thread backend/src/apiserver/storage/run_store.go Outdated
Comment thread backend/src/apiserver/resource/resource_manager.go
@jaewak jaewak marked this pull request as draft May 27, 2026 21:15
@jaewak jaewak marked this pull request as ready for review May 28, 2026 19:53
@google-oss-prow google-oss-prow Bot requested a review from droctothorpe May 28, 2026 19:53
@cbartram
Copy link
Copy Markdown
Contributor

This is a great find! Thank you for taking the time to deeply debug this and lay it out clearly for us. I am surprised that the second bug where duplicate runs are created is due to replicas.

I would think that the scheduled workflow API would be stateless and thus the number of replicas wouldn't impact the number of runs created.

@jaewak
Copy link
Copy Markdown
Author

jaewak commented May 29, 2026

This is a great find! Thank you for taking the time to deeply debug this and lay it out clearly for us. I am surprised that the second bug where duplicate runs are created is due to replicas.

I would think that the scheduled workflow API would be stateless and thus the number of replicas wouldn't impact the number of runs created.

Thanks for the review! Looks like the controller is not stateless. Before creating a workflow, the controller checks the informers to see if the run has already been created (code path here)

jaewak and others added 2 commits May 29, 2026 15:41
Signed-off-by: jaewak <jaewan.0907@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: jaewak <82007787+jaewak@users.noreply.github.com>
@droctothorpe droctothorpe force-pushed the fix/backend-recurring-run-scheduling branch from a5203d5 to 9ee6ad1 Compare May 29, 2026 19:42
@droctothorpe
Copy link
Copy Markdown
Collaborator

/ok-to-test

@droctothorpe
Copy link
Copy Markdown
Collaborator

@jaewak this is great, thank you so much for tackling it. Can you make sure to run the pre-commit hooks locally? The pre-commit CI stage is failing.

Copy link
Copy Markdown
Collaborator

@droctothorpe droctothorpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This fixes the nil dereference and adds good coverage, but the new recurring-run idempotency check is still racy under multi-replica scheduling.

Blocking feedback

  • backend/src/apiserver/resource/resource_manager.go:642
    • GetRunByRecurringRunIdAndDisplayName is only a preflight read, so two controller replicas can still both observe "no matching run" and continue.
    • Both requests then go on to create a Kubernetes workflow before the DB insert happens.
    • run_details only has a primary key on UUID; there is no uniqueness on (JobUUID, DisplayName) to collapse that race into a safe no-op.
    • In that case the patch still allows duplicate or orphaned scheduled runs, which is the exact failure mode this change is trying to eliminate.

I think this needs a DB-backed idempotency mechanism (for example a unique key plus insert-on-conflict handling, or another lock around the create path) rather than a best-effort read-before-create.


^ Feedback from GPT-5.4

Signed-off-by: jaewak <jaewan.0907@gmail.com>
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from droctothorpe. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jaewak
Copy link
Copy Markdown
Author

jaewak commented May 29, 2026

PR Overview

This fixes the nil dereference and adds good coverage, but the new recurring-run idempotency check is still racy under multi-replica scheduling.

Blocking feedback

  • backend/src/apiserver/resource/resource_manager.go:642

    • GetRunByRecurringRunIdAndDisplayName is only a preflight read, so two controller replicas can still both observe "no matching run" and continue.
    • Both requests then go on to create a Kubernetes workflow before the DB insert happens.
    • run_details only has a primary key on UUID; there is no uniqueness on (JobUUID, DisplayName) to collapse that race into a safe no-op.
    • In that case the patch still allows duplicate or orphaned scheduled runs, which is the exact failure mode this change is trying to eliminate.

I think this needs a DB-backed idempotency mechanism (for example a unique key plus insert-on-conflict handling, or another lock around the create path) rather than a best-effort read-before-create.

^ Feedback from GPT-5.4

Good call.

I tried the unique index on (JobUUID, DisplayName) first but it gets messy since non-recurring runs all store JobUUID = "" and are allowed to share display names, so the index would reject valid manual runs.

So I leaned on the PK we already have instead. Recurring-run triggers now get a deterministic UUID (UUIDv5 of RecurringRunId + "/" + DisplayName), so racing triggers land on the same key, and RunStore.CreateRun catches the duplicate-key error and just returns the run that already won. Atomic, no new index, no migration, manual runs untouched. Kept the old preflight read as a cheap fast-path. Added a util.NewDeterministicUUID helper + tests for determinism, the idempotent insert, and the ID derivation.

One thing to flag: the workflow is still created (with generateName) before the DB insert, so in a true dead-heat the loser leaves one orphaned workflow. Fixing that means a deterministic workflow name or flipping the order (DB first). Happy to do it here or as a follow-up.

Signed-off-by: jaewak <jaewan.0907@gmail.com>
@github-actions github-actions Bot added the ci-passed All CI tests on a pull request have passed label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-passed All CI tests on a pull request have passed ok-to-test size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants