PerryTS · andrewtdiz · Jun 17, 2026 · Jun 18, 2026 · Jun 18, 2026 · coderabbitai
diff --git a/PERF_RUN_LOG.md b/PERF_RUN_LOG.md
@@ -44,3 +44,126 @@
   - `benchmarks/baseline.json` is stale for this Linux environment; compare was run with `--warn-only` and the before/after comparison above uses the captured local baseline JSON.
   - Follow-up candidates remain in typed array and numeric array hot paths, but this cycle stopped at the isolated registration-hoist optimization.
 - PR: https://github.com/PerryTS/perry/pull/5295
+
+## 2026-06-17 - Guarded numeric array direct payload access
+
+- Start revision: `8d953ca7ad6f`
+- Branch: `codex/perry-performance-20260617`
+- Worker assignment: single Codex pass in this worktree
+- Benchmark environment: Linux `/usr/bin/time -v`; local `node` cannot execute `.ts` benchmark inputs, so Node columns and correctness comparisons were skipped by the harness
+- Baseline commands:
+  - `cargo build --release`
+  - `./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-final-e816fc3e4.json`
+  - `./benchmarks/quick.sh`
+  - `target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-multiply-final --quiet`
+  - `perf stat -e cycles,instructions,branches,branch-misses /tmp/perry-matrix-multiply-final`
+- Baseline results:
+  - compare quick medians: loop_overhead 74ms/18768KB, fibonacci 261ms/18920KB, math_intensive 69ms/18944KB, nested_loops 956ms/19152KB, factorial 94ms/18896KB
+  - quick: fibonacci 262ms/18MB, math_intensive 55ms/18MB, nested_loops 965ms/18MB, factorial 75ms/18MB, matrix_multiply 1842ms/28MB
+  - direct matrix binary: `matrix_multiply:1778`, `checksum:41079519680`
+  - `perf stat` direct matrix binary: 6,569,183,197 cycles, 30,876,077,204 instructions, 5,501,828,073 branches, 2,178,745 branch-misses, 1.8236s elapsed
+- Selected gap and evidence:
+  - After the registration hoist, `matrix_multiply` was still the slowest `quick.sh` case at 1842ms.
+  - LLVM trace for `benchmarks/suite/16_matrix_multiply.ts` showed hot-path calls to `js_array_numeric_get_f64_unboxed` and `js_array_numeric_set_f64_unboxed` after the existing typed-feedback numeric array guards.
+  - The guards prove a live, non-forwarded array, in-bounds index where required, raw-f64 numeric layout, and numeric set values; the runtime helpers then only repeat checks before loading or storing the raw-f64 payload.
+- Change:
+  - Inlined raw-f64 array element loads/stores in guarded numeric array index get/set lowering after the typed-feedback guard and codegen length checks.
+  - Recorded direct-load/direct-store native proof consumers and taught the verifier to accept them only with the existing consumed raw-f64 layout fact.
+  - Updated typed-feedback, typed-shape, and native-proof tests to expect direct payload access instead of helper calls on the guarded fast paths.
+- Post-change benchmark commands:
+  - `cargo build --release`
+  - `target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-direct-final --trace llvm --quiet`
+  - `for i in 1 2 3 4 5; do /tmp/perry-matrix-direct-final; done`
+  - `perf stat -e cycles,instructions,branches,branch-misses /tmp/perry-matrix-direct-final`
+  - `./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-direct-numeric-final-e816fc3e4.json`
+  - `./benchmarks/quick.sh`
+- Post-change results:
+  - traced matrix binary: 1736ms, 1730ms, 1729ms, 1738ms, 1714ms; checksum always `41079519680`
+  - `perf stat` direct matrix binary: 6,337,280,206 cycles, 28,036,164,989 instructions, 4,648,261,291 branches, 488,073 branch-misses, 1.7806s elapsed
+  - compare quick medians: loop_overhead 56ms/19040KB, fibonacci 239ms/18764KB, math_intensive 58ms/18756KB, nested_loops 921ms/18944KB, factorial 89ms/18828KB
+  - quick: fibonacci 264ms/18MB, math_intensive 55ms/18MB, nested_loops 928ms/18MB, factorial 76ms/18MB, matrix_multiply 1745ms/28MB
+- Measured impact:
+  - `16_matrix_multiply` quick: 1842ms -> 1745ms, 5.3% faster
+  - Direct matrix binary instructions: 30.88B -> 28.04B, 9.2% fewer
+  - Direct matrix binary branches: 5.50B -> 4.65B, 15.5% fewer
+  - `10_nested_loops` compare median: 956ms -> 921ms, 3.7% faster
+- Verification:
+  - `cargo fmt --check`
+  - `git diff --check`
+  - `cargo test -p perry-codegen --test typed_feedback`
+  - `cargo test -p perry-codegen --test typed_shape_descriptors`
+  - `cargo test -p perry-codegen --test native_proof_regressions artifact_records_numeric_array_f64_fast_paths_and_fallback_reasons`
+  - `cargo test -p perry-codegen native_value::verify::tests`
+  - `cargo build --release`
+  - `PERRY_BIN=target/release/perry python3 tests/test_typed_feedback_runtime_evidence.py`
+  - `tests/test_benchmark_output_verifier.sh`
+  - Trace check confirmed `js_array_numeric_get_f64_unboxed` and `js_array_numeric_set_f64_unboxed` are declared but have no `call` sites in the generated matrix module; raw-f64 `load double` and `store double` operations remain in the guarded paths.
+- Notes:
+  - `benchmarks/baseline.json` is stale for this Linux environment; compare was run with `--warn-only` and the before/after comparison above uses the captured local first-cycle results.
+  - This follow-up is intended as a stacked draft PR on top of the typed-feedback registration-hoist PR.
+- PR: https://github.com/PerryTS/perry/pull/5302
+
+## 2026-06-17 - Monomorphic array guard fast cache
+
+- Start revision: `ed71efde8585`
+- Branch: `codex/perry-array-guard-cache-fastpath`
+- Worker assignment: single Codex pass in this worktree
+- Benchmark environment: Linux `/usr/bin/time -v`; local `node` cannot execute `.ts` benchmark inputs, so Node columns and correctness comparisons were skipped by the harness
+- Baseline commands:
+  - `cargo build --release`
+  - `target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-direct-final --trace llvm --quiet`
+  - `for i in 1 2 3 4 5; do /tmp/perry-matrix-direct-final; done`
+  - `perf stat -e cycles,instructions,branches,branch-misses /tmp/perry-matrix-direct-final`
+  - `PERRY_TYPED_FEEDBACK_TRACE=/tmp/perry-matrix-typed-feedback.json /tmp/perry-matrix-direct-final`
+  - `./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-direct-numeric-final-e816fc3e4.json`
+  - `./benchmarks/quick.sh`
+- Baseline results:
+  - direct matrix binary: 1736ms, 1730ms, 1729ms, 1738ms, 1714ms; checksum always `41079519680`
+  - `perf stat` direct matrix binary: 6,337,280,206 cycles, 28,036,164,989 instructions, 4,648,261,291 branches, 488,073 branch-misses, 1.7806s elapsed
+  - typed-feedback trace for direct matrix binary: 33,619,968 numeric array index-get guard passes, 65,536 numeric array index-set guard passes, 0 get/set guard failures
+  - compare quick medians: loop_overhead 56ms/19040KB, fibonacci 239ms/18764KB, math_intensive 58ms/18756KB, nested_loops 921ms/18944KB, factorial 89ms/18828KB
+  - quick: fibonacci 264ms/18MB, math_intensive 55ms/18MB, nested_loops 928ms/18MB, factorial 76ms/18MB, matrix_multiply 1745ms/28MB
+- Selected gap and evidence:
+  - After direct raw-f64 payload access, `matrix_multiply` remained the slowest `quick.sh` case at 1745ms.
+  - Matrix trace showed 33.6M successful numeric array get guard calls and 65K set guard calls, all monomorphic with no get/set failures.
+  - Sampled profiling/disassembly of `/tmp/perry-matrix-direct-final` showed the inner loop still calling `js_typed_feedback_numeric_array_index_get_guard` twice per `k` iteration; the guard path enters `guard_observe`, locks the global typed-feedback registry, does a `HashMap` lookup, updates counters, and rechecks the same monomorphic observation.
+  - A narrower raw-f64 classification shortcut was tested first and discarded: five direct matrix runs were 1767ms, 1774ms, 1757ms, 1806ms, 1763ms, which was slower/noisier than the 1714-1738ms baseline.
+- Change:
+  - Added a small lock-free, direct-mapped cache for array typed-feedback guard sites.
+  - The cache is seeded by the existing slow `guard_observe` path and fast-passes only when the current array observation exactly matches the cached feedback key and the runtime contract guard is valid.
+  - Slow paths still update the registry, failures, megamorphic state, invalidation-visible observations, and fallback counters; trace snapshots merge cache fast-pass counters back into `observed_count`, per-site guard passes, and by-guard totals.
+  - Direct non-guard observations also update or disable the cache so a reused site that becomes megamorphic cannot keep fast-passing from stale cache state.
+- Post-change benchmark commands:
+  - `cargo build --release`
+  - `target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-array-guard-cache-final --quiet`
+  - `for i in 1 2 3 4 5; do /tmp/perry-matrix-array-guard-cache-final; done`
+  - `PERRY_TYPED_FEEDBACK_TRACE=/tmp/perry-matrix-array-guard-cache-final-trace.json /tmp/perry-matrix-array-guard-cache-final`
+  - `perf stat -e cycles,instructions,branches,branch-misses /tmp/perry-matrix-array-guard-cache-final`
+  - `./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-array-guard-cache-final-ed71efde8.json`
+  - `./benchmarks/quick.sh`
+- Post-change results:
+  - direct matrix binary: 1239ms, 1258ms, 1223ms, 1247ms, 1226ms; checksum always `41079519680`
+  - final trace run: `matrix_multiply:1237`, checksum `41079519680`, 33,619,968 numeric array index-get guard passes, 65,536 numeric array index-set guard passes, 0 get/set guard failures
+  - `perf stat` direct matrix binary: 4,485,321,202 cycles, 16,737,765,528 instructions, 3,085,068,790 branches, 382,419 branch-misses, 1.2376s elapsed
+  - compare quick medians: loop_overhead 56ms/18728KB, fibonacci 240ms/18888KB, math_intensive 55ms/18768KB, nested_loops 662ms/22888KB, factorial 76ms/18836KB
+  - quick: fibonacci 268ms/18MB, math_intensive 74ms/18MB, nested_loops 670ms/22MB, factorial 75ms/18MB, matrix_multiply 1228ms/30MB
+- Measured impact:
+  - `16_matrix_multiply` direct median: 1730ms -> 1239ms, 28.4% faster
+  - `16_matrix_multiply` quick: 1745ms -> 1228ms, 29.6% faster
+  - Direct matrix binary instructions: 28.04B -> 16.74B, 40.3% fewer
+  - Direct matrix binary branches: 4.65B -> 3.09B, 33.6% fewer
+  - `10_nested_loops` compare median: 921ms -> 662ms, 28.1% faster
+- Verification:
+  - `cargo fmt --check`
+  - `git diff --check`
+  - `cargo test -p perry-runtime typed_feedback`
+  - `cargo test -p perry-codegen --test typed_feedback`
+  - `cargo test -p perry-codegen --test typed_shape_descriptors`
+  - `PERRY_BIN=target/release/perry python3 tests/test_typed_feedback_runtime_evidence.py`
+  - `tests/test_benchmark_output_verifier.sh`
+  - `cargo build --release`
+  - Typed-feedback trace confirmed aggregate and per-site guard pass counts remain consistent with the pre-cache trace despite fast-path counter merging.
+- Notes:
+  - `benchmarks/baseline.json` is stale for this Linux environment; compare was run with `--warn-only` and the before/after comparison above uses the captured local second-cycle results.
+  - This follow-up is intended as a stacked draft PR on top of the guarded numeric array direct payload access PR.
+- PR: https://github.com/PerryTS/perry/pull/5307
diff --git a/benchmarks/compiler_output/workloads.toml b/benchmarks/compiler_output/workloads.toml
@@ -632,8 +632,10 @@ detail = "numeric indexed read takes the guarded raw-f64 fast path and loads the
 
 [[workloads.numeric_arrays.ir_checks]]
 name = "numeric_array_uses_unboxed_set"
-contains = "js_array_numeric_set_f64_unboxed"
-detail = "numeric indexed write uses the guarded raw-f64 helper"
+contains = "js_typed_feedback_numeric_array_index_set_guard"
+regex = '''idxset\.inbounds\.\d+:[\s\S]*?inttoptr i64 %\w+ to ptr\s*\n\s*store double %\w+, ptr %\w+[^\n]*\n\s*br label %idxset\.merge'''
-regex = '''idxset\.inbounds\.\d+:[\s\S]*?inttoptr i64 %\w+ to ptr\s*\n\s*store double %\w+, ptr %\w+[^\n]*\n\s*br label %idxset\.merge'''
+regex = '''idxset\.inbounds\.\d+:[\s\S]*?inttoptr i64 %[-a-zA-Z$._0-9]+ to ptr\s*\n\s*store double %[-a-zA-Z$._0-9]+, ptr %[-a-zA-Z$._0-9]+[^\n]*\n\s*br label %idxset\.merge'''
-regex = '''idxset\.inbounds\.\d+:[\s\S]*?inttoptr i64 %\w+ to ptr\s*\n\s*store double %\w+, ptr %\w+[^\n]*\n\s*br label %idxset\.merge'''
+regex = '''idxset\.inbounds\.\d+:[\s\S]*?inttoptr i64 %[-a-zA-Z$._0-9]+ to ptr\s*\n\s*store double %[-a-zA-Z$._0-9]+, ptr %[-a-zA-Z$._0-9]+[^\n]*\n\s*br label %idxset\.merge'''
+regex_none = ["call i32 @js_array_numeric_set_f64_unboxed"]
+detail = "numeric indexed write takes the guarded raw-f64 fast path and stores the slot inline (inttoptr + store double in idxset.inbounds; helper call elided)"
 
 [[workloads.numeric_arrays.stdout_checks]]
 name = "numeric_arrays_checksum"
@@ -674,7 +676,7 @@ rejected_fact_state = "invalidated"
 [[workloads.numeric_arrays.native_rep_checks.require_records]]
 name = "numeric_array_get_fast_f64"
 expr_kind = "NumericArrayIndexGet"
-consumer = "js_array_numeric_get_f64_unboxed"
+consumer = "numeric_array_index_get.raw_f64_load"
 native_rep_name = "f64"
 access_mode = "checked_native"
 bounds_state = "proven_or_guarded"
@@ -702,7 +704,7 @@ rejected_fact_state = "invalidated"
 [[workloads.numeric_arrays.native_rep_checks.require_records]]
 name = "numeric_array_set_fast_f64"
 expr_kind = "NumericArrayIndexSet"
-consumer = "js_array_numeric_set_f64_unboxed"
+consumer = "numeric_array_index_set.raw_f64_store"
 native_rep_name = "f64"
 access_mode = "checked_native"
 bounds_state = "proven_or_guarded"

diff --git a/crates/perry-codegen/src/expr/index.rs b/crates/perry-codegen/src/expr/index.rs
@@ -212,11 +212,15 @@ pub(crate) fn lower_index_set_fast(
     {
         let blk = ctx.block();
         if require_numeric_layout {
-            blk.call(
-                I32,
-                "js_array_numeric_set_f64_unboxed",
-                &[(I64, &arr_handle), (I32, &idx_i32), (DOUBLE, val_double)],
-            );
+            let (_element_addr, element_ptr) = element_slot(blk, &arr_handle, &idx_i32);
+            // The numeric-array guard proves the receiver has raw-f64 numeric
+            // layout and the value is numeric; the preceding length check
+            // proves this specific store is in-bounds. Store the numeric
+            // payload directly instead of calling the runtime helper.
+            // GC_STORE_AUDIT(POINTER_FREE): the stored value is a guard-proven
+            // numeric f64 written into a raw-f64 array payload slot — no GC
+            // pointer is stored, so no write barrier is required.
+            blk.store(DOUBLE, val_double, &element_ptr);
         } else {
             let (element_addr, element_ptr) = element_slot(blk, &arr_handle, &idx_i32);
             // In-place overwrite of a non-raw-layout (e.g. downgraded `any[]`)
@@ -251,7 +255,7 @@ pub(crate) fn lower_index_set_fast(
         ctx.record_lowered_value_with_access_mode_and_facts(
             "NumericArrayIndexSet",
             Some(local_id),
-            "js_array_numeric_set_f64_unboxed",
+            "numeric_array_index_set.raw_f64_store",
             &stored,
             Some(BoundsState::Guarded {
                 guard_id: "numeric_array_index_set_guard".to_string(),

diff --git a/crates/perry-codegen/src/expr/index_get.rs b/crates/perry-codegen/src/expr/index_get.rs
@@ -289,6 +289,11 @@ fn lower_guarded_array_index_get(
     let fast_blk = ctx.block();
     let arr_bits = fast_blk.bitcast_double_to_i64(arr_box);
     let arr_handle = fast_blk.and(I64, &arr_bits, POINTER_MASK_I64);
+    let idx_i64 = fast_blk.zext(I32, idx_i32, I64);
+    let byte_offset = fast_blk.shl(I64, &idx_i64, "3");
+    let with_header = fast_blk.add(I64, &byte_offset, "8");
+    let element_addr = fast_blk.add(I64, &arr_handle, &with_header);
+    let element_ptr = fast_blk.inttoptr(I64, &element_addr);
     let fast_val = if require_numeric_layout {
         // The `numeric_array_index_get_guard` on the way into this block already
         // proved: a plain, non-forwarded `Array`, in raw-f64 numeric layout,
@@ -297,19 +302,10 @@ fn lower_guarded_array_index_get(
         // of calling `js_array_numeric_get_f64_unboxed`, whose hot path
         // re-validates exactly those same conditions and then does this load.
         // Raw-f64 arrays are dense (no HOLE slots) and the slot holds a raw f64,
-        // matching the runtime helper's `return *elements_ptr.add(index)`.
-        let idx_i64 = fast_blk.zext(I32, idx_i32, I64);
-        let byte_offset = fast_blk.shl(I64, &idx_i64, "3");
-        let with_header = fast_blk.add(I64, &byte_offset, "8");
-        let element_addr = fast_blk.add(I64, &arr_handle, &with_header);
-        let element_ptr = fast_blk.inttoptr(I64, &element_addr);
+        // matching the runtime helper's `return *elements_ptr.add(index)`. The
+        // `element_ptr` is hoisted above the branch since both arms reuse it.
         fast_blk.load(DOUBLE, &element_ptr)
     } else {
-        let idx_i64 = fast_blk.zext(I32, idx_i32, I64);
-        let byte_offset = fast_blk.shl(I64, &idx_i64, "3");
-        let with_header = fast_blk.add(I64, &byte_offset, "8");
-        let element_addr = fast_blk.add(I64, &arr_handle, &with_header);
-        let element_ptr = fast_blk.inttoptr(I64, &element_addr);
         let fast_raw = fast_blk.load(DOUBLE, &element_ptr);
         // `new Array(n)` slots are TAG_HOLE internally; JavaScript reads expose
         // `undefined`.
@@ -330,7 +326,7 @@ fn lower_guarded_array_index_get(
         ctx.record_lowered_value_with_access_mode_and_facts(
             "NumericArrayIndexGet",
             None,
-            "js_array_numeric_get_f64_unboxed",
+            "numeric_array_index_get.raw_f64_load",
             &fast,
             Some(BoundsState::Guarded {
                 guard_id: "numeric_array_index_get_guard".to_string(),

diff --git a/crates/perry-codegen/src/expr/index_set.rs b/crates/perry-codegen/src/expr/index_set.rs
@@ -439,11 +439,15 @@ pub(crate) fn lower(ctx: &mut FnCtx<'_>, expr: &Expr) -> Result<String> {
                                 let blk = ctx.block();
                                 let arr_bits = blk.bitcast_double_to_i64(&arr_box);
                                 let arr_handle = blk.and(I64, &arr_bits, POINTER_MASK_I64);
-                                blk.call(
-                                    I32,
-                                    "js_array_numeric_set_f64_unboxed",
-                                    &[(I64, &arr_handle), (I32, &idx_i32), (DOUBLE, &val_double)],
-                                );
+                                let idx_i64 = blk.zext(I32, &idx_i32, I64);
+                                let byte_offset = blk.shl(I64, &idx_i64, "3");
+                                let with_header = blk.add(I64, &byte_offset, "8");
+                                let element_addr = blk.add(I64, &arr_handle, &with_header);
+                                let element_ptr = blk.inttoptr(I64, &element_addr);
+                                // GC_STORE_AUDIT(POINTER_FREE): guard-proven
+                                // numeric f64 stored into a raw-f64 array
+                                // payload slot — no GC pointer, no barrier.
+                                blk.store(DOUBLE, &val_double, &element_ptr);
                                 blk.br(&merge_label);
                             }
                             let stored = LoweredValue {
@@ -455,7 +459,7 @@ pub(crate) fn lower(ctx: &mut FnCtx<'_>, expr: &Expr) -> Result<String> {
                             ctx.record_lowered_value_with_access_mode_and_facts(
                                 "NumericArrayIndexSet",
                                 Some(*arr_id),
-                                "js_array_numeric_set_f64_unboxed",
+                                "numeric_array_index_set.raw_f64_store",
                                 &stored,
                                 Some(BoundsState::Guarded {
                                     guard_id: "numeric_array_index_set_guard".to_string(),

diff --git a/crates/perry-codegen/src/native_value/verify.rs b/crates/perry-codegen/src/native_value/verify.rs
@@ -255,6 +255,8 @@ fn raw_f64_checked_native_consumer(record: &NativeRepRecord) -> bool {
         record.consumer.as_str(),
         "js_array_numeric_get_f64_unboxed"
             | "js_array_numeric_set_f64_unboxed"
+            | "numeric_array_index_get.raw_f64_load"
+            | "numeric_array_index_set.raw_f64_store"
             | "js_array_numeric_push_f64_unboxed"
             | "class_field_get.raw_f64_load"
             | "class_field_set.raw_f64_store"
@@ -1257,6 +1259,14 @@ mod tests {
         for (expr_kind, consumer) in [
             ("NumericArrayIndexGet", "js_array_numeric_get_f64_unboxed"),
             ("NumericArrayIndexSet", "js_array_numeric_set_f64_unboxed"),
+            (
+                "NumericArrayIndexGet",
+                "numeric_array_index_get.raw_f64_load",
+            ),
+            (
+                "NumericArrayIndexSet",
+                "numeric_array_index_set.raw_f64_store",
+            ),
             ("NumericArrayPush", "js_array_numeric_push_f64_unboxed"),
             ("ClassFieldGet", "class_field_get.raw_f64_load"),
             ("ClassFieldSet", "class_field_set.raw_f64_store"),