Skip to content

test: harden 3 load-induced release-gate flakes (solid fixes)#229

Merged
brandonrc merged 1 commit into
mainfrom
fix/release-gate-flake-hardening
Jun 21, 2026
Merged

test: harden 3 load-induced release-gate flakes (solid fixes)#229
brandonrc merged 1 commit into
mainfrom
fix/release-gate-flake-hardening

Conversation

@brandonrc

Copy link
Copy Markdown
Contributor

Makes the 3 flaky release-gate suites SOLID (not retry-roulette). All load-induced flakes from concurrent gate execution; root-caused at the source.

  • format-jvm/test-sbt: idempotent artifacts GET hit a transient 5xx under fleet worker-starvation. Now retries 5xx/000 only (never 4xx).
  • security/test-regression-security: setup user-creates hit transient 5xx. Bounded retry on transient class only.
  • security/test-scan-dedup-1373: old assertion (total completed==1) was wrong — backend runs 2 scanners (dependency+grype), so a correctly-deduped artifact has 1 row PER scan_type. New assertion is per-scan_type (stronger, deterministic). NOTE: surfaced a real backend race (check-then-act on scan_results under concurrent triggers, no lock/unique constraint) — filed separately; the test serializes its own triggers so it's deterministic.
  • webhook/test-webhook-retry-recover: attempt_delta=0s was test contamination — a global webhook captured concurrent suites' deliveries. Now isolates by repo entity_id; windows account for 30s tick + jitter. Backend scheduler is correct.

Verified locally under concurrent load where feasible (security 9/9 under load); webhook timing logic validated on synthetic logs (live delivery needs the kind pod network).

- common.sh: add api_get_with_retry + create_test_user_with_retry — bounded
  retry-with-backoff on TRANSIENT class only (5xx/000); 4xx never retried so
  real client/auth errors are not masked.
- test-sbt.sh: artifacts GET uses api_get_with_retry (idempotent GET that hit a
  transient 5xx under fleet worker-starvation).
- test-regression-security.sh: 3 setup user-creates use create_test_user_with_retry.
- test-scan-dedup-short-circuit-1373.sh: assert exactly-1-completed-row PER
  scan_type (backend registers dependency+grype scanners) and stable scan_id
  per type across re-trigger — STRONGER + deterministic vs the old total-count==1.
  (Test serializes its own triggers; see filed backend race for the concurrent case.)
- test-webhook-retry-recover.sh: isolate measured deliveries by repo entity_id so
  concurrent suites' sibling deliveries to a global webhook no longer contaminate
  the timing (root cause of attempt_delta=0s); oldest-first; windows account for
  the 30s scheduler tick + 20% jitter; per-PID receiver port.
@brandonrc brandonrc merged commit e2e41d7 into main Jun 21, 2026
1 of 3 checks passed
@brandonrc brandonrc deleted the fix/release-gate-flake-hardening branch June 21, 2026 06:37
brandonrc added a commit that referenced this pull request Jun 21, 2026
…fits the 300s gate timeout (#230)

The #229 rewrite correctly fixed the deltas=0s contamination (entity_id
isolation) but set EXPECT_ATTEMPTS=3, which waits for backoff 30+60+120s + 30s
ticks + jitter (~330s) and overruns the run-suite per-test TEST_TIMEOUT=300s
(exit 124). Default EXPECT_ATTEMPTS to 1: reject attempt 1, recover on attempt 2,
validate the first real backoff interval (~30s) and the recover path — fits well
inside 300s and stays deterministic. Full 3-interval ladder still available via
EXPECT_ATTEMPTS=3 + larger timeouts. Keeps the entity_id isolation fix intact.

Co-authored-by: brandonrc <brandonrc@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant