Skip to content

ci: make e2e tests more robust against flakes#127

Merged
vdemeester merged 1 commit into
mainfrom
fix/e2e-flakiness
Jun 8, 2026
Merged

ci: make e2e tests more robust against flakes#127
vdemeester merged 1 commit into
mainfrom
fix/e2e-flakiness

Conversation

@vdemeester

Copy link
Copy Markdown
Member

Summary

Makes the e2e test runners resilient to two distinct flakes observed in CI (e.g. on #126, which is docs-only yet had E2E jobs fail).

Flake modes diagnosed

1. Control-plane readiness race (E2E (v1.6.2))

error: timed out waiting for the condition on deployments/tekton-pipelines-controller
  • The 120s wait on just the controller/webhook deployments was too tight on a cold kind node still pulling images.
  • deployment available also doesn't guarantee the admission webhook has ready endpoints, so the subsequent kubectl apply of the Task can race an unserved webhook.

2. Pod sandbox contention (E2E (v1.9.3), git-clone-run-delete-existing)

pod status "PodReadyToStartContainers":"False"; message: ""   (empty logs)
  • All 14 TaskRuns are created at once on a single-node kind cluster; one pod stalled at sandbox/network setup and never started its container. Nothing to do with git-clone logic.

Changes

  • Harden readiness wait (both e2e-tests.sh and e2e-bundle-test.sh): wait for all tekton-pipelines deployments (300s) and poll the tekton-pipelines-webhook endpoints until populated before applying any Tekton resources.
  • Retry a TaskRun once in e2e-tests.sh (delete + re-apply from its source manifest) before declaring failure, to absorb transient pod-startup blips.

No change to the Task, StepAction, or test fixtures — test-runner robustness only.

Two distinct flake modes were observed in CI:

1. Control-plane readiness race: waiting only on the controller/webhook
   deployments with a 120s timeout was too tight on a cold kind node, and
   'deployment available' does not guarantee the admission webhook has ready
   endpoints, so subsequent applies could race an unserved webhook.

2. Pod sandbox contention: with all TaskRuns created at once on a single-node
   kind cluster, a pod could stall at PodReadyToStartContainers=False (sandbox
   /network/image-pull blip) and never start its container.

Changes:
- Wait for all tekton-pipelines deployments (300s) and poll the webhook
  endpoints until populated before applying any Tekton resources.
- Retry a TaskRun once (delete + re-apply from source manifest) before
  declaring failure, to absorb transient pod-startup flakes.
- Apply the same readiness hardening to the bundle e2e test.

Signed-off-by: Vincent Demeester <vdemeest@redhat.com>
@tekton-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from vdemeester after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 8, 2026
@vdemeester vdemeester merged commit a962c33 into main Jun 8, 2026
8 checks passed
@vdemeester vdemeester deleted the fix/e2e-flakiness branch June 8, 2026 09:52
vdemeester added a commit to tektoncd-catalog/golang that referenced this pull request Jun 8, 2026
Port the robustness fixes from tektoncd-catalog/git-clone#127 to all four
e2e runners.

Two flake modes are addressed:

1. Control-plane readiness race: waiting only on the controller/webhook
   deployments with a 120s timeout was too tight on a cold kind node, and
   'deployment available' does not guarantee the admission webhook has ready
   endpoints, so subsequent applies could race an unserved webhook. Now wait
   for all tekton-pipelines deployments (300s) and poll the webhook endpoints
   until populated before applying any Tekton resources.

2. Pod sandbox contention: with many PipelineRuns created at once on a
   single-node kind cluster, a pod could stall at PodReadyToStartContainers
   and never start its container. Each PipelineRun's spec is now snapshotted
   (dependency-free, via kubectl -o yaml) and retried once (delete + recreate)
   before being declared a failure.

Applies to e2e-tests.sh, e2e-tests-alpine.sh, e2e-stepactions.sh and
e2e-bundle-test.sh.

Signed-off-by: Vincent Demeester <vdemeest@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants