ci: make e2e tests more robust against flakes by vdemeester · Pull Request #127 · tektoncd-catalog/git-clone

vdemeester · 2026-06-08T09:36:57Z

Summary

Makes the e2e test runners resilient to two distinct flakes observed in CI (e.g. on #126, which is docs-only yet had E2E jobs fail).

Flake modes diagnosed

1. Control-plane readiness race (E2E (v1.6.2))

error: timed out waiting for the condition on deployments/tekton-pipelines-controller

The 120s wait on just the controller/webhook deployments was too tight on a cold kind node still pulling images.
deployment available also doesn't guarantee the admission webhook has ready endpoints, so the subsequent kubectl apply of the Task can race an unserved webhook.

2. Pod sandbox contention (E2E (v1.9.3), git-clone-run-delete-existing)

pod status "PodReadyToStartContainers":"False"; message: ""   (empty logs)

All 14 TaskRuns are created at once on a single-node kind cluster; one pod stalled at sandbox/network setup and never started its container. Nothing to do with git-clone logic.

Changes

Harden readiness wait (both e2e-tests.sh and e2e-bundle-test.sh): wait for all tekton-pipelines deployments (300s) and poll the tekton-pipelines-webhook endpoints until populated before applying any Tekton resources.
Retry a TaskRun once in e2e-tests.sh (delete + re-apply from its source manifest) before declaring failure, to absorb transient pod-startup blips.

No change to the Task, StepAction, or test fixtures — test-runner robustness only.

Two distinct flake modes were observed in CI: 1. Control-plane readiness race: waiting only on the controller/webhook deployments with a 120s timeout was too tight on a cold kind node, and 'deployment available' does not guarantee the admission webhook has ready endpoints, so subsequent applies could race an unserved webhook. 2. Pod sandbox contention: with all TaskRuns created at once on a single-node kind cluster, a pod could stall at PodReadyToStartContainers=False (sandbox /network/image-pull blip) and never start its container. Changes: - Wait for all tekton-pipelines deployments (300s) and poll the webhook endpoints until populated before applying any Tekton resources. - Retry a TaskRun once (delete + re-apply from source manifest) before declaring failure, to absorb transient pod-startup flakes. - Apply the same readiness hardening to the bundle e2e test. Signed-off-by: Vincent Demeester <vdemeest@redhat.com>

tekton-robot · 2026-06-08T09:37:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from vdemeester after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Port the robustness fixes from tektoncd-catalog/git-clone#127 to all four e2e runners. Two flake modes are addressed: 1. Control-plane readiness race: waiting only on the controller/webhook deployments with a 120s timeout was too tight on a cold kind node, and 'deployment available' does not guarantee the admission webhook has ready endpoints, so subsequent applies could race an unserved webhook. Now wait for all tekton-pipelines deployments (300s) and poll the webhook endpoints until populated before applying any Tekton resources. 2. Pod sandbox contention: with many PipelineRuns created at once on a single-node kind cluster, a pod could stall at PodReadyToStartContainers and never start its container. Each PipelineRun's spec is now snapshotted (dependency-free, via kubectl -o yaml) and retried once (delete + recreate) before being declared a failure. Applies to e2e-tests.sh, e2e-tests-alpine.sh, e2e-stepactions.sh and e2e-bundle-test.sh. Signed-off-by: Vincent Demeester <vdemeest@redhat.com>

tekton-robot requested review from QuanZhang-William and vinamra28 June 8, 2026 09:37

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 8, 2026

vdemeester merged commit a962c33 into main Jun 8, 2026
8 checks passed

vdemeester deleted the fix/e2e-flakiness branch June 8, 2026 09:52

vdemeester mentioned this pull request Jun 8, 2026

ci: make e2e tests more robust against flakes tektoncd-catalog/golang#33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: make e2e tests more robust against flakes#127

ci: make e2e tests more robust against flakes#127
vdemeester merged 1 commit into
mainfrom
fix/e2e-flakiness

vdemeester commented Jun 8, 2026

Uh oh!

tekton-robot commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vdemeester commented Jun 8, 2026

Summary

Flake modes diagnosed

Changes

Uh oh!

tekton-robot commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants