Skip to content

GC: automate write-barrier store-site audit (CI gate), barrier structuredClone object fields, add e2e barrier stress tests#4886

Merged
proggeramlug merged 1 commit into
mainfrom
gc-write-barrier-audit
Jun 10, 2026
Merged

GC: automate write-barrier store-site audit (CI gate), barrier structuredClone object fields, add e2e barrier stress tests#4886
proggeramlug merged 1 commit into
mainfrom
gc-write-barrier-audit

Conversation

@proggeramlug

Copy link
Copy Markdown
Contributor

What

Automates write-barrier coverage verification for the generational GC and wires it into CI, so a new unbarriered heap-slot store path can no longer land silently. Closes the gap where scripts/gc_store_site_inventory.py existed but ran nowhere — it was failing with 21 unaudited store sites that had accumulated since the last manual audit.

Audit results — all 21 findings + expanded-coverage sites classified

Classification Sites Examples
Already barriered (raw store immediately re-stored by note_array_slot / replayed by rebuild_*_layout) 12 array/generic.rs map/materialize/splice, array/sort.rs write-back, object/arguments.rs descriptor builders, json_tape.rs
Stack/Vec temporaries 5 array/sort.rs merge buffers
Pointer-free stores (i32 fills, tape numerics, unit-test sentinel, POD scalars) 7 expr/native_memory.rs, json_tape.rs tape copy, pod_record.rs
Init-time store of non-GC payload 1 temporal/mod.rs cell init
Genuine missing barrier → fixed 1 builtins/globals.rs structuredClone object fields

Every benign site now carries a justified inline GC_STORE_AUDIT(<CLASS>): reason marker (the checker enforces marker-or-fail).

The fix (latent gap — could not be made to fail; see below)

js_structured_clone's object branch deep-cloned fields with a bare *fields.add(i) = js_structured_clone(field), while the sibling array branch right above uses the barriered note_array_slot. The recursive clone can run arbitrarily many minor GCs mid-loop, and a wide clone target (>16KB) is malloc-backed — i.e. outside the nursery — while fields are still being installed. The store now routes through object::store_object_field_slot (same shared helper thread.rs deserialization uses).

Honesty note on reproduction: I could not make the unfixed code fail, after four targeted attempts (wide dynamic objects ± explicit gc(), wide literals ± PERRY_GC_FORCE_EVACUATE/VERIFY_EVACUATION, heavy mid-clone allocation pressure, pre/post-churn verification against a clean main build). The reason is visible in gc/trace.rs: minor GC currently skips tracing only objects that are both tenured and physically in the old-gen arena — malloc-backed and tenured-in-nursery parents are still traced, so the missed remembered-set entry has no observable symptom today. It becomes load-bearing the moment clone targets can land in arena old-gen (pretenuring, evacuation-policy changes). The fix is consistency/hardening, priced at one helper call per cloned field.

The repro attempts surfaced three real pre-existing bugs, filed separately:

Checker upgrades (scripts/gc_store_site_inventory.py)

  • Scans all of perry-runtime/src (was a hand-picked module list that missed builtins/, json_tape.rs, temporal/, and anything added since). Collector internals (gc/) are excluded via the allowlist — the barrier implementation and the GC's own unit tests perform raw stores by design.
  • Allowlist file scripts/gc_store_site_allowlist.txt: path-prefix | line-substring-or-* | justification; entries without a justification fail the run (exit 2); unused entries warn.
  • Fixed a rotted carve-out: the typed-array pointer-free exemption matched the filename typedarray.rs and silently stopped applying when the module was split into typedarray/ — now module-based, typedarray_view.rs included.
  • Matches store_aligned / store_volatile emitters in perry-codegen (previously only .store().
  • Self-tests for all of the above (--self-test).

Run it: python3 scripts/gc_store_site_inventory.py (exit 0 = clean; --json-out <path> for the machine-readable packet).

CI

The lint job now runs the checker (self-test + gate) after the file-size gate. Current state: 808 files scanned, 213 audited sites, 60 allowlisted, 0 findings.

Stress tests (crates/perry/tests/gc_write_barrier_stress.rs)

Two e2e tests (compile with CARGO_BIN_EXE_perry, run with PERRY_GC_FORCE_EVACUATE=1 PERRY_GC_VERIFY_EVACUATION=1):

  1. tenured_mutation_stress — tenure 300 objects across 5 GC cycles, then point their object fields / array elements / closure captures / module globals at fresh nursery values, churn 4 more cycles, verify everything.
  2. structured_clone_gc_churn_stress — 300-field literal + 200-deep nested chain through structuredClone, churn, full verification.

Both avoid wide dynamic objects / wide literals deliberately (#4878, #4880 — documented in the file header).

Zero-regressions check

  • perry-runtime (RUST_TEST_THREADS=1): 1005 passed, 1 failed — date::tests::test_full_year_setters_revive_invalid_date_only fails identically on unmodified main in a clean worktree (assertion left:1.0 right:0.0 at date.rs:1867), i.e. pre-existing on main as of today, not from this PR.
  • perry-codegen: all green. perry (incl. both new stress tests): all green.
  • Full-workspace run on the dev box hit the known parallel-link contention on seven perry-ext-* crates; each passes individually (this is the exact failure mode the CI job serializes CARGO_BUILD_JOBS=1 for).
  • cargo fmt --check, check_file_size.sh, checker self-test + gate: green.
  • No symptom claim: this PR does not claim to explain the known flaky test262 TypedArray/forEach segfault — no currently-reproducible missed barrier was found.

Per contributor convention, no version bump / CHANGELOG edits — maintainer folds those in at merge.

Side note for CLAUDE.md (not changed here since it's metadata-adjacent): the documented workspace test one-liner is missing --exclude perry-ui-windows-winui (crate added in #4680); without it the suite fails to build on macOS hosts.

…bject fields

The gc_store_site_inventory.py checker existed but was never wired into
CI, so new raw heap-slot stores could land unaudited — and 21 had.

Audit result of all 21 (+ the sites surfaced by expanding scan coverage):
one latent gap fixed, everything else verified barriered/init/stack/
pointer-free and annotated with justified GC_STORE_AUDIT markers.

Fix: js_structured_clone's object branch wrote deep-cloned fields with a
bare slot store while the sibling array branch used the barriered helper.
The recursive clone can run minor GCs mid-loop, so the clone target can
be malloc-backed (>16KB wide) or tenured while fields are still being
installed. Now routes through object::store_object_field_slot. Latent
today (minor GC still traces malloc/tenured-in-nursery parents); becomes
load-bearing the moment clone targets can land in arena old-gen.

Checker upgrades:
- Scan ALL of perry-runtime/src (was a hand-picked module list that
  missed builtins/, json_tape.rs, temporal/, iterator helpers, ...);
  collector internals (gc/) excluded via the new allowlist.
- scripts/gc_store_site_allowlist.txt: path|substring|justification
  entries, malformed/missing-justification lines fail the run.
- Fix the typedarray carve-out that silently stopped matching when
  typedarray.rs was split into typedarray/ (filename compare -> module
  compare; typedarray_view.rs covered too).
- Match store_aligned/store_volatile emitters in perry-codegen.
- Self-tests for all of the above.

CI: lint job now runs the checker (self-test + gate).

Stress test: crates/perry/tests/gc_write_barrier_stress.rs — tenure
objects across GC cycles, mutate them to point at fresh nursery values
(object fields, array elements, closure captures, module globals),
churn, verify; plus structuredClone integrity under churn. Both run
under PERRY_GC_FORCE_EVACUATE=1 + PERRY_GC_VERIFY_EVACUATION=1.

Pre-existing bugs found while attempting a repro (filed separately):
#4878 gc() deadlock after wide dynamic-prop object, #4879 structuredClone
drops dynamic overflow properties, #4880 wide-literal compile blowup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant