ci: make e2e tests more robust against flakes#127
Merged
Conversation
Two distinct flake modes were observed in CI: 1. Control-plane readiness race: waiting only on the controller/webhook deployments with a 120s timeout was too tight on a cold kind node, and 'deployment available' does not guarantee the admission webhook has ready endpoints, so subsequent applies could race an unserved webhook. 2. Pod sandbox contention: with all TaskRuns created at once on a single-node kind cluster, a pod could stall at PodReadyToStartContainers=False (sandbox /network/image-pull blip) and never start its container. Changes: - Wait for all tekton-pipelines deployments (300s) and poll the webhook endpoints until populated before applying any Tekton resources. - Retry a TaskRun once (delete + re-apply from source manifest) before declaring failure, to absorb transient pod-startup flakes. - Apply the same readiness hardening to the bundle e2e test. Signed-off-by: Vincent Demeester <vdemeest@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
vdemeester
added a commit
to tektoncd-catalog/golang
that referenced
this pull request
Jun 8, 2026
Port the robustness fixes from tektoncd-catalog/git-clone#127 to all four e2e runners. Two flake modes are addressed: 1. Control-plane readiness race: waiting only on the controller/webhook deployments with a 120s timeout was too tight on a cold kind node, and 'deployment available' does not guarantee the admission webhook has ready endpoints, so subsequent applies could race an unserved webhook. Now wait for all tekton-pipelines deployments (300s) and poll the webhook endpoints until populated before applying any Tekton resources. 2. Pod sandbox contention: with many PipelineRuns created at once on a single-node kind cluster, a pod could stall at PodReadyToStartContainers and never start its container. Each PipelineRun's spec is now snapshotted (dependency-free, via kubectl -o yaml) and retried once (delete + recreate) before being declared a failure. Applies to e2e-tests.sh, e2e-tests-alpine.sh, e2e-stepactions.sh and e2e-bundle-test.sh. Signed-off-by: Vincent Demeester <vdemeest@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the e2e test runners resilient to two distinct flakes observed in CI (e.g. on #126, which is docs-only yet had E2E jobs fail).
Flake modes diagnosed
1. Control-plane readiness race (
E2E (v1.6.2))deployment availablealso doesn't guarantee the admission webhook has ready endpoints, so the subsequentkubectl applyof the Task can race an unserved webhook.2. Pod sandbox contention (
E2E (v1.9.3),git-clone-run-delete-existing)Changes
e2e-tests.shande2e-bundle-test.sh): wait for alltekton-pipelinesdeployments (300s) and poll thetekton-pipelines-webhookendpoints until populated before applying any Tekton resources.e2e-tests.sh(delete + re-apply from its source manifest) before declaring failure, to absorb transient pod-startup blips.No change to the Task, StepAction, or test fixtures — test-runner robustness only.