GC: automate write-barrier store-site audit (CI gate), barrier structuredClone object fields, add e2e barrier stress tests#4886
Merged
Conversation
…bject fields The gc_store_site_inventory.py checker existed but was never wired into CI, so new raw heap-slot stores could land unaudited — and 21 had. Audit result of all 21 (+ the sites surfaced by expanding scan coverage): one latent gap fixed, everything else verified barriered/init/stack/ pointer-free and annotated with justified GC_STORE_AUDIT markers. Fix: js_structured_clone's object branch wrote deep-cloned fields with a bare slot store while the sibling array branch used the barriered helper. The recursive clone can run minor GCs mid-loop, so the clone target can be malloc-backed (>16KB wide) or tenured while fields are still being installed. Now routes through object::store_object_field_slot. Latent today (minor GC still traces malloc/tenured-in-nursery parents); becomes load-bearing the moment clone targets can land in arena old-gen. Checker upgrades: - Scan ALL of perry-runtime/src (was a hand-picked module list that missed builtins/, json_tape.rs, temporal/, iterator helpers, ...); collector internals (gc/) excluded via the new allowlist. - scripts/gc_store_site_allowlist.txt: path|substring|justification entries, malformed/missing-justification lines fail the run. - Fix the typedarray carve-out that silently stopped matching when typedarray.rs was split into typedarray/ (filename compare -> module compare; typedarray_view.rs covered too). - Match store_aligned/store_volatile emitters in perry-codegen. - Self-tests for all of the above. CI: lint job now runs the checker (self-test + gate). Stress test: crates/perry/tests/gc_write_barrier_stress.rs — tenure objects across GC cycles, mutate them to point at fresh nursery values (object fields, array elements, closure captures, module globals), churn, verify; plus structuredClone integrity under churn. Both run under PERRY_GC_FORCE_EVACUATE=1 + PERRY_GC_VERIFY_EVACUATION=1. Pre-existing bugs found while attempting a repro (filed separately): #4878 gc() deadlock after wide dynamic-prop object, #4879 structuredClone drops dynamic overflow properties, #4880 wide-literal compile blowup.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Automates write-barrier coverage verification for the generational GC and wires it into CI, so a new unbarriered heap-slot store path can no longer land silently. Closes the gap where
scripts/gc_store_site_inventory.pyexisted but ran nowhere — it was failing with 21 unaudited store sites that had accumulated since the last manual audit.Audit results — all 21 findings + expanded-coverage sites classified
note_array_slot/ replayed byrebuild_*_layout)array/generic.rsmap/materialize/splice,array/sort.rswrite-back,object/arguments.rsdescriptor builders,json_tape.rsarray/sort.rsmerge buffersexpr/native_memory.rs,json_tape.rstape copy,pod_record.rstemporal/mod.rscell initbuiltins/globals.rsstructuredClone object fieldsEvery benign site now carries a justified inline
GC_STORE_AUDIT(<CLASS>): reasonmarker (the checker enforces marker-or-fail).The fix (latent gap — could not be made to fail; see below)
js_structured_clone's object branch deep-cloned fields with a bare*fields.add(i) = js_structured_clone(field), while the sibling array branch right above uses the barrierednote_array_slot. The recursive clone can run arbitrarily many minor GCs mid-loop, and a wide clone target (>16KB) is malloc-backed — i.e. outside the nursery — while fields are still being installed. The store now routes throughobject::store_object_field_slot(same shared helperthread.rsdeserialization uses).Honesty note on reproduction: I could not make the unfixed code fail, after four targeted attempts (wide dynamic objects ± explicit
gc(), wide literals ±PERRY_GC_FORCE_EVACUATE/VERIFY_EVACUATION, heavy mid-clone allocation pressure, pre/post-churn verification against a clean main build). The reason is visible ingc/trace.rs: minor GC currently skips tracing only objects that are both tenured and physically in the old-gen arena — malloc-backed and tenured-in-nursery parents are still traced, so the missed remembered-set entry has no observable symptom today. It becomes load-bearing the moment clone targets can land in arena old-gen (pretenuring, evacuation-policy changes). The fix is consistency/hardening, priced at one helper call per cloned field.The repro attempts surfaced three real pre-existing bugs, filed separately:
gc()deadlocks (pthread mutex, main thread parked) after building a wide dynamic-property objectChecker upgrades (
scripts/gc_store_site_inventory.py)perry-runtime/src(was a hand-picked module list that missedbuiltins/,json_tape.rs,temporal/, and anything added since). Collector internals (gc/) are excluded via the allowlist — the barrier implementation and the GC's own unit tests perform raw stores by design.scripts/gc_store_site_allowlist.txt:path-prefix | line-substring-or-* | justification; entries without a justification fail the run (exit 2); unused entries warn.typedarray.rsand silently stopped applying when the module was split intotypedarray/— now module-based,typedarray_view.rsincluded.store_aligned/store_volatileemitters in perry-codegen (previously only.store().--self-test).Run it:
python3 scripts/gc_store_site_inventory.py(exit 0 = clean;--json-out <path>for the machine-readable packet).CI
The
lintjob now runs the checker (self-test + gate) after the file-size gate. Current state: 808 files scanned, 213 audited sites, 60 allowlisted, 0 findings.Stress tests (
crates/perry/tests/gc_write_barrier_stress.rs)Two e2e tests (compile with
CARGO_BIN_EXE_perry, run withPERRY_GC_FORCE_EVACUATE=1 PERRY_GC_VERIFY_EVACUATION=1):tenured_mutation_stress— tenure 300 objects across 5 GC cycles, then point their object fields / array elements / closure captures / module globals at fresh nursery values, churn 4 more cycles, verify everything.structured_clone_gc_churn_stress— 300-field literal + 200-deep nested chain through structuredClone, churn, full verification.Both avoid wide dynamic objects / wide literals deliberately (#4878, #4880 — documented in the file header).
Zero-regressions check
perry-runtime(RUST_TEST_THREADS=1): 1005 passed, 1 failed —date::tests::test_full_year_setters_revive_invalid_date_onlyfails identically on unmodified main in a clean worktree (assertion left:1.0 right:0.0 at date.rs:1867), i.e. pre-existing on main as of today, not from this PR.perry-codegen: all green.perry(incl. both new stress tests): all green.perry-ext-*crates; each passes individually (this is the exact failure mode the CI job serializesCARGO_BUILD_JOBS=1for).cargo fmt --check,check_file_size.sh, checker self-test + gate: green.Per contributor convention, no version bump / CHANGELOG edits — maintainer folds those in at merge.
Side note for CLAUDE.md (not changed here since it's metadata-adjacent): the documented workspace test one-liner is missing
--exclude perry-ui-windows-winui(crate added in #4680); without it the suite fails to build on macOS hosts.