Skip to content

fix(gc): close the #5029 conservative-scan × evacuation corruption — re-enable write-barrier stress tests#5043

Merged
proggeramlug merged 1 commit into
mainfrom
fix/gc-remembered-multipage
Jun 12, 2026
Merged

fix(gc): close the #5029 conservative-scan × evacuation corruption — re-enable write-barrier stress tests#5043
proggeramlug merged 1 commit into
mainfrom
fix/gc-remembered-multipage

Conversation

@proggeramlug

Copy link
Copy Markdown
Contributor

Fixes #5029.

The bug chain (all empirically measured — full trail in the #5029 comments)

The stress suite has been red since #4998 made explicit gc() run the Full conservative native-stack scan. The corruption is real (clone fields read recycled memory), and the root cause is an interaction chain, not a single defect:

  1. A stale stack word resurrects a dead old object. The churn loop leaves dead frame slots pointing at a previous round's 256 KB array backing (born directly in old-gen as a large allocation). The conservative Full scan marks it as a trace seed every manual gc().
  2. Its slots dangle into recycled nursery memory. The dead array's children were legitimately swept cycles ago; when fresh nursery blocks are later allocated over those freed ranges, the dead object's slots suddenly alias live young objects (pointer_in_nursery flips false→true for unchanged slot bits — measured at panic time: 7710 aliased slots, page-pattern all-clean, manual barrier replay covers instantly).
  3. Tracing/evacuating/rewriting through the aliases corrupts the heap, and the old-young-edge verifier reports them as the deterministic missing_edges=7710 signature.
  4. Independently, old-page defrag moved objects whose unmarked-old referrers were skipped by the rewrite pass (753 skipped stale referrers measured per evacuating cycle), and one un-rewritable referrer surface remains (codegen raw-pointer globals — GC: old-page defrag unsound on conservative-scan cycles — un-rewritten referrer of moved old objects (class-keys global suspected) #5042).

The four fixes

# Change Failure mode it closes
1 Conservative discoveries classified Old are pin-only in minors (CONS_PINNED, no mark/trace seed); full collections keep mark+trace for #4977 resurrection-tracing through dangling slots; the phantom verifier signature
2 rewrite_heap_objects / verify_heap_objects no longer skip unmarked non-nursery objects (unmarked ≠ dead outside the nursery in a minor) un-rewritten old→old referrers of evacuated targets
3 restore_surviving_dirty_coverage(): post-clear remembered-set repair using the same walk the verifier uses (address-validated before any header deref — the reclaim unit tests seed synthetic entries) coverage drops the from-scratch rebuild could disagree on; the copying-path to-survivor re-remember gap (ptrs.decode_bits → None for to-space children)
4 Old-page defrag skipped on conservative-scan cycles (copying minors never scan conservatively, so steady-state defrag is unaffected) moved old objects with un-rewritable referrers — containment until #5042 lands

The two stress tests are re-enabled (they were #[ignore]d in #5033 to unblock CI).

Why a live old object loses nothing from fix 1

Its real old→young edges are barrier-covered: measured ~60k mark_dirty_old_page calls per inter-cycle window, all landing on the right pages — minors only ever find old→young edges through the remembered set anyway. Retention doesn't need the mark (minors don't sweep old-gen), and CONS_PINNED already blocks every evacuation path.

Validation

Follow-up

#5042 — register codegen raw-pointer globals (perry_class_keys_* et al.) as mutable roots, then lift the fix-4 defrag gate and extend PERRY_GC_VERIFY_EVACUATION to walk those tables.

No version bump / changelog — maintainer folds metadata at merge.

…re-enable write-barrier stress tests

Four coordinated fixes, each addressing a measured failure mode of the
gc_write_barrier_stress suite (red since #4998 made explicit gc() run the
Full conservative native-stack scan):

1. roots: conservative discoveries in the OLD generation are pin-only in
   MINOR cycles (CONS_PINNED, no mark, no trace seed). A stale stack word
   can resurrect a DEAD old object whose slots still point into long-swept
   nursery memory; once fresh nursery blocks land on those freed ranges the
   slots alias live young objects, and tracing/evacuating/rewriting through
   them corrupts the heap (and produced the deterministic
   missing_edges=7710 verifier signature on a dead 256 KB array backing).
   Minors never sweep the old gen, so the mark is not needed for survival,
   and a LIVE old object's real old->young edges are dirty-page-covered by
   the write barriers (measured: ~60k barrier calls per inter-cycle window,
   all landing correctly). FULL collections keep mark+trace (#4977).

2. verify/rewrite: rewrite_heap_objects and verify_heap_objects no longer
   skip UNMARKED non-nursery objects. Being unmarked in a minor is the
   normal state of a live old object, not a sign of death; old->old
   references have no remembered-set coverage, so this walk is the only
   pass that re-points an old referrer at an evacuated target before the
   forwarding stubs are released (measured: 753 skipped stale referrers
   per evacuating cycle).

3. remembered set: restore_surviving_dirty_coverage() re-derives kept
   pages after remembered_set_clear from the SAME walk the old-young-edge
   verifier uses, so a still-needed page can never be dropped (also closes
   the copying-path re-remember gap where ptrs.decode_bits returns None
   for freshly copied to-survivor children). External entries are
   validated address-first (page classify / malloc registry) before any
   header dereference - the reclaim unit tests seed synthetic entries.

4. policy: old-page defrag (C4b compaction) is skipped on cycles that ran
   the conservative stack scan. Conservative stack words cannot be
   rewritten after a move and CONS_PINNED only covers direct discoveries;
   the stress suite demonstrated a moved old object with an un-rewritten
   referrer (shape-table lookups through it returned recycled memory).
   Copying minors never run the conservative scan, so steady-state defrag
   is unaffected. Follow-up to lift the gate: #5042 (codegen raw-pointer
   globals, e.g. perry_class_keys_*, need mutable-root registration).

Validation: gc_write_barrier_stress 2/2 across repeated runs (re-enabled,
previously #[ignore]d); standalone repro matrix 23/23 clean across
FORCE_EVACUATE+VERIFY_EVACUATION, PERRY_CONSERVATIVE_STACK_SCAN=full,
default, PERRY_GEN_GC=0 and PERRY_GEN_GC_EVACUATE=0; gc unit suite
377/377; perry-runtime lib green (single known macOS date flake); perry
bin+integration suites green.
@proggeramlug proggeramlug merged commit dac3e31 into main Jun 12, 2026
13 checks passed
@proggeramlug proggeramlug deleted the fix/gc-remembered-multipage branch June 12, 2026 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: gc_write_barrier_stress red on main (missing old→young remembered-set edges → segfault)

1 participant