Skip to content

test(gc): raise gc_write_barrier_stress timeout 120s→300s to de-flake CI#5115

Merged
proggeramlug merged 1 commit into
mainfrom
fix/gc-stress-test-timeout
Jun 14, 2026
Merged

test(gc): raise gc_write_barrier_stress timeout 120s→300s to de-flake CI#5115
proggeramlug merged 1 commit into
mainfrom
fix/gc-stress-test-timeout

Conversation

@proggeramlug

@proggeramlug proggeramlug commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Problem

The cargo-test CI job intermittently fails on crates/perry/tests/gc_write_barrier_stress.rstenured_mutation_stress (and structured_clone_gc_churn_stress) panic with compiled binary timed out after 120s, regardless of the PR's changes (it fails identically across unrelated PRs and passes on lucky/idle runs).

Root cause — it's a timeout flake, not a correctness failure

compile_and_run executes the stress binaries under the slowest GC configuration:

run.env("PERRY_GC_FORCE_EVACUATE", "1")     // copy every marked object, every cycle
   .env("PERRY_GC_VERIFY_EVACUATION", "1"); // full-heap pointer-verification scan per cycle

against a heavy churn workload (churn(5) + churn(4) rounds × 30 000 allocations + explicit gc() each). Measured wall-clock:

config time
normal ~1.5s
FORCE_EVACUATE + VERIFY_EVACUATION ~21s (fast host)

~21s on a fast local host scales to roughly 60–130s on the slower, shared, heavily-parallel CI runners — straddling the old 120s deadline, so the run is killed (timeout panic) whenever the runner is loaded. (Issue #5029 fixed the underlying GC corruption and re-enabled these tests in #5043; this is purely the run deadline being too tight for the verify-evacuate config.)

Fix

Raise COMPILED_BINARY_TIMEOUT from 120s to 300s. The tests assert correctness (BARRIER_STRESS_OK), not speed, so widening the deadline de-flakes CI without trimming the stress coverage (still force-evacuate + full verify across all churn cycles). 300s is ~14× the measured fast-host time and ~2.5× the worst-case loaded-CI estimate.

Verified locally: both configs still print BARRIER_STRESS_OK; the verify-evacuate run completes in ~21s.

This de-flakes the cargo-test job for every open PR. No version bump / changelog per maintainer instruction.

Summary by CodeRabbit

  • Tests
    • Adjusted garbage collection stress test timeout parameters to accommodate slower execution environments and allow adequate headroom for test completion.

…e CI

The two stress binaries run under PERRY_GC_FORCE_EVACUATE=1 +
PERRY_GC_VERIFY_EVACUATION=1 (copy every object + full-heap verify scan per
GC cycle). Measured ~1.5s normal vs ~21s under that config on a fast host,
which scales to ~60-130s on slower/loaded CI runners — right at the 120s
budget, so the cargo-test job intermittently timed out (panic at the run
deadline) regardless of the PR's changes. The tests assert correctness
(BARRIER_STRESS_OK), not speed, so widen the deadline rather than trim the
stress coverage.
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 46129171-ed67-47bf-afe7-c60ba6daac44

📥 Commits

Reviewing files that changed from the base of the PR and between 5b61919 and 389e2ec.

📒 Files selected for processing (1)
  • crates/perry/tests/gc_write_barrier_stress.rs

📝 Walkthrough

Walkthrough

The COMPILED_BINARY_TIMEOUT constant in the GC write barrier stress test is increased from 120 seconds to 300 seconds. Explanatory comments are added noting that the test runs with forced GC evacuation and verification enabled, which can cause significantly longer runtimes on slower CI runners.

Changes

GC Write Barrier Stress Test Timeout

Layer / File(s) Summary
Stress test timeout constant
crates/perry/tests/gc_write_barrier_stress.rs
COMPILED_BINARY_TIMEOUT raised from 120s to 300s; comments added explaining forced evacuation and verification settings drive the longer allowed runtime.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

🐇 A bunny once waited with patience so grand,
Two minutes weren't enough — the GC had more planned!
With evacuation and verification in tow,
Five whole minutes now let the stress test flow.
Hop hop, no more timeouts — watch the barriers go! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely identifies the main change: raising the GC write barrier stress test timeout from 120s to 300s to fix CI flakiness.
Description check ✅ Passed The PR description provides a comprehensive explanation of the problem, root cause, fix, and verification, though it doesn't follow the template structure with explicit sections like Summary, Changes, Test plan, and Checklist.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/gc-stress-test-timeout

Comment @coderabbitai help to get the list of available commands and usage tips.

@proggeramlug proggeramlug merged commit 5881584 into main Jun 14, 2026
13 of 14 checks passed
@proggeramlug proggeramlug deleted the fix/gc-stress-test-timeout branch June 14, 2026 06:05
proggeramlug added a commit that referenced this pull request Jun 14, 2026
…ocking gate) (#5116)

The two GC write-barrier stress tests run compiled binaries under the
slowest GC configuration (PERRY_GC_FORCE_EVACUATE + PERRY_GC_VERIFY_EVACUATION)
to hunt a *rare* corruption window (#5029). They're ~200s each and
nondeterministic by nature, which makes them a poor fit for the blocking
per-PR `cargo-test` gate — one flake blocked every unrelated PR (e.g.
#5115, a one-line test-only change, failed `tenured_mutation_stress`).

- `#[ignore]` both tests so the per-PR `cargo test -p perry` skips them.
  The gate stays meaningful (all the fast, deterministic unit/integration
  tests still block) and runs ~6-7 min faster.
- Add an opt-in, non-blocking `gc-stress` CI job (`continue-on-error`,
  gated by the `run-extended-tests` label / `workflow_dispatch` /
  tag push, like the existing parity/compile-smoke jobs) that runs them
  with `--ignored`. The signal is preserved without blocking PRs.

Run locally with:
  cargo test -p perry --test gc_write_barrier_stress -- --ignored

The underlying corruption (#5029) is real (verify-evacuation only aborts
on a genuine un-rewritten live slot) and should stay tracked / reopened;
this change just stops a nondeterministic stress test from gating every PR.

Co-authored-by: Ralph Küpper <ralph2@skelpo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant