Skip to content

fix(gc): explicit gc() forces the conservative native-stack scan (#4977)#4998

Merged
proggeramlug merged 2 commits into
mainfrom
fix/gc-explicit-collect-stack-scan-4977
Jun 11, 2026
Merged

fix(gc): explicit gc() forces the conservative native-stack scan (#4977)#4998
proggeramlug merged 2 commits into
mainfrom
fix/gc-explicit-collect-stack-scan-4977

Conversation

@proggeramlug

Copy link
Copy Markdown
Contributor

Fixes #4977.

Problem

Explicit gc() (a full collection with the default auto stack-scan mode) skipped the conservative native-stack scan: gc/roots.rs::conservative_stack_scan_decision_for maps AutoSkipDisabled. At a gc() callsite, live module-init/top-level locals can be held only on the native stack — neither the precise shadow-stack roots nor the module-var scanners cover them — so the collector reclaimed the whole object graph and later field reads returned dangling-pointer garbage. Silent corruption, no crash:

const keep = { nested: { deep: "leaf-string-4916" } };
gc();
console.log(keep.nested.deep); // garbage (reused string memory, e.g. "globalThis")

Fix

  • gc/roots.rs: new ManualGcScanGuard — pins the Full conservative scan for the duration of a manual collection, only when no per-thread override is already pinned. The GC unit tests pin Auto inside their controlled-root scopes (so forced collections still reclaim objects held only as native-stack test locals) and keep working unchanged; an explicit PERRY_CONSERVATIVE_STACK_SCAN env value beats any override either way, so the bisection escape hatch is intact.
  • gc/policy.rs: both manual-collect paths — direct js_gc_collect and the deferred-flush Collect(Manual) arm — now share one manual_gc_collect_now() helper that engages the guard (they previously duplicated the weakref + collect sequence).
  • gc/heap_snapshot.rs: the Diagnostics fakes: v8 heap snapshot is an empty-but-valid graph; inspector/repl sessions look real but aren't #4916 workaround that wrapped its collect in the Full override is superseded and removed; js_gc_collect now provides the guarantee.

Threshold-triggered automatic collections are intentionally untouched — the Auto skip exists for copied-minor eligibility and per-cycle cost; this change scopes the full scan to explicit collections where the caller observably holds live state at the callsite.

Validation

  • Repro (test-files/test_issue_4977_gc_toplevel_locals.ts, object literal + class instance): prints 16 / leaf-string-4916 / widget-name-4977 / 1 with the fix; previous binary printed 10 / globalThis / <empty> / 1.
  • PERRY_CONSERVATIVE_STACK_SCAN=0 still reproduces the skip (env precedence verified at runtime); PERRY_GEN_GC=0 legacy path unaffected.
  • RUST_TEST_THREADS=1 cargo test --release -p perry-runtime gc:: → 377 passed / 0 failed, including the new manual_gc_scan_guard_forces_full_scan_only_when_unpinned covering both the unpinned-engage and pinned-respect cases.

Code-only PR — no version bump / changelog (maintainer folds metadata at merge).

Ralph Küpper and others added 2 commits June 11, 2026 13:16
In the default auto scan mode a full collection skips the conservative
native-stack scan, but at a gc() callsite live module-init/top-level
locals may be held only on the native stack — neither the precise
shadow-stack roots nor the module-var scanners cover them — so the
collector reclaimed live object graphs and later field reads returned
dangling-pointer garbage (silent corruption, no crash).

Fix: ManualGcScanGuard pins the Full conservative scan for the duration
of a manual collection, on both the direct js_gc_collect path and the
deferred-flush arm (now shared via manual_gc_collect_now). The guard
respects an already-pinned per-thread override (the GC unit tests pin
Auto so forced collections still reclaim native-stack locals), and an
explicit PERRY_CONSERVATIVE_STACK_SCAN env value beats any override, so
the bisection escape hatch keeps working. The heap-snapshot workaround
that wrapped its collect in the Full override (#4916) is superseded and
removed.

Verified: repro prints 16 / leaf-string-4916 / widget-name-4977 (was
dangling-pointer garbage); PERRY_CONSERVATIVE_STACK_SCAN=0 still
reproduces the skip (env precedence intact); PERRY_GEN_GC=0 unaffected;
gc:: unit suite 377 passed / 0 failed.
@proggeramlug proggeramlug merged commit 44fd5c3 into main Jun 11, 2026
1 check passed
@proggeramlug proggeramlug deleted the fix/gc-explicit-collect-stack-scan-4977 branch June 11, 2026 11:18
proggeramlug pushed a commit that referenced this pull request Jun 12, 2026
…remembered-set fix

Both tests fail on a PRE-EXISTING remembered-set coverage bug exposed when
explicit gc() started using the Full conservative stack scan (#4998): minor
cycles drop legitimate old->young dirty-page coverage, live nursery children
of old-gen large objects are swept while still referenced, and forced
evacuation corrupts through the dangling slots. Full root-cause trail
(bisect, knob matrix, instrumentation) lives in #5029. Re-enable when the
coverage fix lands.
proggeramlug added a commit that referenced this pull request Jun 12, 2026
…re-enable write-barrier stress tests (#5043)

Four coordinated fixes, each addressing a measured failure mode of the
gc_write_barrier_stress suite (red since #4998 made explicit gc() run the
Full conservative native-stack scan):

1. roots: conservative discoveries in the OLD generation are pin-only in
   MINOR cycles (CONS_PINNED, no mark, no trace seed). A stale stack word
   can resurrect a DEAD old object whose slots still point into long-swept
   nursery memory; once fresh nursery blocks land on those freed ranges the
   slots alias live young objects, and tracing/evacuating/rewriting through
   them corrupts the heap (and produced the deterministic
   missing_edges=7710 verifier signature on a dead 256 KB array backing).
   Minors never sweep the old gen, so the mark is not needed for survival,
   and a LIVE old object's real old->young edges are dirty-page-covered by
   the write barriers (measured: ~60k barrier calls per inter-cycle window,
   all landing correctly). FULL collections keep mark+trace (#4977).

2. verify/rewrite: rewrite_heap_objects and verify_heap_objects no longer
   skip UNMARKED non-nursery objects. Being unmarked in a minor is the
   normal state of a live old object, not a sign of death; old->old
   references have no remembered-set coverage, so this walk is the only
   pass that re-points an old referrer at an evacuated target before the
   forwarding stubs are released (measured: 753 skipped stale referrers
   per evacuating cycle).

3. remembered set: restore_surviving_dirty_coverage() re-derives kept
   pages after remembered_set_clear from the SAME walk the old-young-edge
   verifier uses, so a still-needed page can never be dropped (also closes
   the copying-path re-remember gap where ptrs.decode_bits returns None
   for freshly copied to-survivor children). External entries are
   validated address-first (page classify / malloc registry) before any
   header dereference - the reclaim unit tests seed synthetic entries.

4. policy: old-page defrag (C4b compaction) is skipped on cycles that ran
   the conservative stack scan. Conservative stack words cannot be
   rewritten after a move and CONS_PINNED only covers direct discoveries;
   the stress suite demonstrated a moved old object with an un-rewritten
   referrer (shape-table lookups through it returned recycled memory).
   Copying minors never run the conservative scan, so steady-state defrag
   is unaffected. Follow-up to lift the gate: #5042 (codegen raw-pointer
   globals, e.g. perry_class_keys_*, need mutable-root registration).

Validation: gc_write_barrier_stress 2/2 across repeated runs (re-enabled,
previously #[ignore]d); standalone repro matrix 23/23 clean across
FORCE_EVACUATE+VERIFY_EVACUATION, PERRY_CONSERVATIVE_STACK_SCAN=full,
default, PERRY_GEN_GC=0 and PERRY_GEN_GC_EVACUATE=0; gc unit suite
377/377; perry-runtime lib green (single known macOS date flake); perry
bin+integration suites green.

Co-authored-by: Ralph Küpper <ralph@skelpo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GC: explicit gc() (full collect, default auto stack-scan) reclaims live top-level locals — string fields read back as garbage

1 participant