Skip to content

GC: old-page defrag unsound on conservative-scan cycles — un-rewritten referrer of moved old objects (class-keys global suspected) #5042

@proggeramlug

Description

@proggeramlug

Context

Found while fixing #5029. With the #5029 fixes in place, one corruption vector remained, isolated experimentally to old-page evacuation (C4b old-gen defrag): disabling defrag selection makes the structured_clone_gc_churn_stress workload pass 11/11 across the full knob matrix; enabling it corrupts the clone root deterministically on the cycle where old_page_moved_objects > 0.

Evidence

  • The strengthened verify_heap_objects (now covering unmarked old objects) finds no stale forwarded refs in any walked heap object after the rewrite — so the dangling referrer is NOT a heap slot covered by rewrite_forwarded_references.
  • Failure shape: with a 302-property cloned object, shape-table-based property lookups break (cl["f" + i] undefined for ~295 props) while inline-offset fast-path fields (f0–f4) and a direct string field survive. This pattern points at a codegen-emitted global holding a raw pointer to a moved object — prime suspect: the per-class @perry_class_keys_<module>__<class> globals (shared keys_array pointer, built once at module init). If those globals are not registered as FFI mutable roots, an old-page move of the keys array leaves the global dangling.

Current mitigation (shipped with the #5029 PR)

gc_collect_inner_with_trigger skips old-page defrag selection on any cycle whose conservative-stack-scan decision is Scan. Copying minors (the steady-state path) never run the conservative scan, so defrag still operates there under its own policy. This contains the corruption but leaves defrag disabled for fallback minors (e.g. every explicit gc() since #4998).

To do

  1. Audit codegen-emitted raw-pointer globals (perry_class_keys_*, any module-var data tables holding raw I64 object pointers) for FFI mutable-root registration so the rewrite pass can fix them after moves.
  2. Re-test: drop the conservative-scan defrag gate, run gc_write_barrier_stress + the CI: gc_write_barrier_stress red on main (missing old→young remembered-set edges → segfault) #5029 repro matrix.
  3. Consider extending verify_evacuated_no_stale_forwarded_refs to walk codegen global tables so this class of dangling root is caught by PERRY_GC_VERIFY_EVACUATION instead of manifesting as silent corruption.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions