Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions PERF_RUN_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,126 @@
- `benchmarks/baseline.json` is stale for this Linux environment; compare was run with `--warn-only` and the before/after comparison above uses the captured local baseline JSON.
- Follow-up candidates remain in typed array and numeric array hot paths, but this cycle stopped at the isolated registration-hoist optimization.
- PR: https://github.com/PerryTS/perry/pull/5295

## 2026-06-17 - Guarded numeric array direct payload access

- Start revision: `8d953ca7ad6f`
- Branch: `codex/perry-performance-20260617`
- Worker assignment: single Codex pass in this worktree
- Benchmark environment: Linux `/usr/bin/time -v`; local `node` cannot execute `.ts` benchmark inputs, so Node columns and correctness comparisons were skipped by the harness
- Baseline commands:
- `cargo build --release`
- `./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-final-e816fc3e4.json`
- `./benchmarks/quick.sh`
- `target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-multiply-final --quiet`
- `perf stat -e cycles,instructions,branches,branch-misses /tmp/perry-matrix-multiply-final`
- Baseline results:
- compare quick medians: loop_overhead 74ms/18768KB, fibonacci 261ms/18920KB, math_intensive 69ms/18944KB, nested_loops 956ms/19152KB, factorial 94ms/18896KB
- quick: fibonacci 262ms/18MB, math_intensive 55ms/18MB, nested_loops 965ms/18MB, factorial 75ms/18MB, matrix_multiply 1842ms/28MB
- direct matrix binary: `matrix_multiply:1778`, `checksum:41079519680`
- `perf stat` direct matrix binary: 6,569,183,197 cycles, 30,876,077,204 instructions, 5,501,828,073 branches, 2,178,745 branch-misses, 1.8236s elapsed
- Selected gap and evidence:
- After the registration hoist, `matrix_multiply` was still the slowest `quick.sh` case at 1842ms.
- LLVM trace for `benchmarks/suite/16_matrix_multiply.ts` showed hot-path calls to `js_array_numeric_get_f64_unboxed` and `js_array_numeric_set_f64_unboxed` after the existing typed-feedback numeric array guards.
- The guards prove a live, non-forwarded array, in-bounds index where required, raw-f64 numeric layout, and numeric set values; the runtime helpers then only repeat checks before loading or storing the raw-f64 payload.
- Change:
- Inlined raw-f64 array element loads/stores in guarded numeric array index get/set lowering after the typed-feedback guard and codegen length checks.
- Recorded direct-load/direct-store native proof consumers and taught the verifier to accept them only with the existing consumed raw-f64 layout fact.
- Updated typed-feedback, typed-shape, and native-proof tests to expect direct payload access instead of helper calls on the guarded fast paths.
- Post-change benchmark commands:
- `cargo build --release`
- `target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-direct-final --trace llvm --quiet`
- `for i in 1 2 3 4 5; do /tmp/perry-matrix-direct-final; done`
- `perf stat -e cycles,instructions,branches,branch-misses /tmp/perry-matrix-direct-final`
- `./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-direct-numeric-final-e816fc3e4.json`
- `./benchmarks/quick.sh`
- Post-change results:
- traced matrix binary: 1736ms, 1730ms, 1729ms, 1738ms, 1714ms; checksum always `41079519680`
- `perf stat` direct matrix binary: 6,337,280,206 cycles, 28,036,164,989 instructions, 4,648,261,291 branches, 488,073 branch-misses, 1.7806s elapsed
- compare quick medians: loop_overhead 56ms/19040KB, fibonacci 239ms/18764KB, math_intensive 58ms/18756KB, nested_loops 921ms/18944KB, factorial 89ms/18828KB
- quick: fibonacci 264ms/18MB, math_intensive 55ms/18MB, nested_loops 928ms/18MB, factorial 76ms/18MB, matrix_multiply 1745ms/28MB
- Measured impact:
- `16_matrix_multiply` quick: 1842ms -> 1745ms, 5.3% faster
- Direct matrix binary instructions: 30.88B -> 28.04B, 9.2% fewer
- Direct matrix binary branches: 5.50B -> 4.65B, 15.5% fewer
- `10_nested_loops` compare median: 956ms -> 921ms, 3.7% faster
- Verification:
- `cargo fmt --check`
- `git diff --check`
- `cargo test -p perry-codegen --test typed_feedback`
- `cargo test -p perry-codegen --test typed_shape_descriptors`
- `cargo test -p perry-codegen --test native_proof_regressions artifact_records_numeric_array_f64_fast_paths_and_fallback_reasons`
- `cargo test -p perry-codegen native_value::verify::tests`
- `cargo build --release`
- `PERRY_BIN=target/release/perry python3 tests/test_typed_feedback_runtime_evidence.py`
- `tests/test_benchmark_output_verifier.sh`
- Trace check confirmed `js_array_numeric_get_f64_unboxed` and `js_array_numeric_set_f64_unboxed` are declared but have no `call` sites in the generated matrix module; raw-f64 `load double` and `store double` operations remain in the guarded paths.
- Notes:
- `benchmarks/baseline.json` is stale for this Linux environment; compare was run with `--warn-only` and the before/after comparison above uses the captured local first-cycle results.
- This follow-up is intended as a stacked draft PR on top of the typed-feedback registration-hoist PR.
- PR: https://github.com/PerryTS/perry/pull/5302

## 2026-06-17 - Monomorphic array guard fast cache

- Start revision: `ed71efde8585`
- Branch: `codex/perry-array-guard-cache-fastpath`
- Worker assignment: single Codex pass in this worktree
- Benchmark environment: Linux `/usr/bin/time -v`; local `node` cannot execute `.ts` benchmark inputs, so Node columns and correctness comparisons were skipped by the harness
- Baseline commands:
- `cargo build --release`
- `target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-direct-final --trace llvm --quiet`
- `for i in 1 2 3 4 5; do /tmp/perry-matrix-direct-final; done`
- `perf stat -e cycles,instructions,branches,branch-misses /tmp/perry-matrix-direct-final`
- `PERRY_TYPED_FEEDBACK_TRACE=/tmp/perry-matrix-typed-feedback.json /tmp/perry-matrix-direct-final`
- `./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-direct-numeric-final-e816fc3e4.json`
- `./benchmarks/quick.sh`
- Baseline results:
- direct matrix binary: 1736ms, 1730ms, 1729ms, 1738ms, 1714ms; checksum always `41079519680`
- `perf stat` direct matrix binary: 6,337,280,206 cycles, 28,036,164,989 instructions, 4,648,261,291 branches, 488,073 branch-misses, 1.7806s elapsed
- typed-feedback trace for direct matrix binary: 33,619,968 numeric array index-get guard passes, 65,536 numeric array index-set guard passes, 0 get/set guard failures
- compare quick medians: loop_overhead 56ms/19040KB, fibonacci 239ms/18764KB, math_intensive 58ms/18756KB, nested_loops 921ms/18944KB, factorial 89ms/18828KB
- quick: fibonacci 264ms/18MB, math_intensive 55ms/18MB, nested_loops 928ms/18MB, factorial 76ms/18MB, matrix_multiply 1745ms/28MB
- Selected gap and evidence:
- After direct raw-f64 payload access, `matrix_multiply` remained the slowest `quick.sh` case at 1745ms.
- Matrix trace showed 33.6M successful numeric array get guard calls and 65K set guard calls, all monomorphic with no get/set failures.
- Sampled profiling/disassembly of `/tmp/perry-matrix-direct-final` showed the inner loop still calling `js_typed_feedback_numeric_array_index_get_guard` twice per `k` iteration; the guard path enters `guard_observe`, locks the global typed-feedback registry, does a `HashMap` lookup, updates counters, and rechecks the same monomorphic observation.
- A narrower raw-f64 classification shortcut was tested first and discarded: five direct matrix runs were 1767ms, 1774ms, 1757ms, 1806ms, 1763ms, which was slower/noisier than the 1714-1738ms baseline.
- Change:
- Added a small lock-free, direct-mapped cache for array typed-feedback guard sites.
- The cache is seeded by the existing slow `guard_observe` path and fast-passes only when the current array observation exactly matches the cached feedback key and the runtime contract guard is valid.
- Slow paths still update the registry, failures, megamorphic state, invalidation-visible observations, and fallback counters; trace snapshots merge cache fast-pass counters back into `observed_count`, per-site guard passes, and by-guard totals.
- Direct non-guard observations also update or disable the cache so a reused site that becomes megamorphic cannot keep fast-passing from stale cache state.
- Post-change benchmark commands:
- `cargo build --release`
- `target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-array-guard-cache-final --quiet`
- `for i in 1 2 3 4 5; do /tmp/perry-matrix-array-guard-cache-final; done`
- `PERRY_TYPED_FEEDBACK_TRACE=/tmp/perry-matrix-array-guard-cache-final-trace.json /tmp/perry-matrix-array-guard-cache-final`
- `perf stat -e cycles,instructions,branches,branch-misses /tmp/perry-matrix-array-guard-cache-final`
- `./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-array-guard-cache-final-ed71efde8.json`
- `./benchmarks/quick.sh`
- Post-change results:
- direct matrix binary: 1239ms, 1258ms, 1223ms, 1247ms, 1226ms; checksum always `41079519680`
- final trace run: `matrix_multiply:1237`, checksum `41079519680`, 33,619,968 numeric array index-get guard passes, 65,536 numeric array index-set guard passes, 0 get/set guard failures
- `perf stat` direct matrix binary: 4,485,321,202 cycles, 16,737,765,528 instructions, 3,085,068,790 branches, 382,419 branch-misses, 1.2376s elapsed
- compare quick medians: loop_overhead 56ms/18728KB, fibonacci 240ms/18888KB, math_intensive 55ms/18768KB, nested_loops 662ms/22888KB, factorial 76ms/18836KB
- quick: fibonacci 268ms/18MB, math_intensive 74ms/18MB, nested_loops 670ms/22MB, factorial 75ms/18MB, matrix_multiply 1228ms/30MB
- Measured impact:
- `16_matrix_multiply` direct median: 1730ms -> 1239ms, 28.4% faster
- `16_matrix_multiply` quick: 1745ms -> 1228ms, 29.6% faster
- Direct matrix binary instructions: 28.04B -> 16.74B, 40.3% fewer
- Direct matrix binary branches: 4.65B -> 3.09B, 33.6% fewer
- `10_nested_loops` compare median: 921ms -> 662ms, 28.1% faster
- Verification:
- `cargo fmt --check`
- `git diff --check`
- `cargo test -p perry-runtime typed_feedback`
- `cargo test -p perry-codegen --test typed_feedback`
- `cargo test -p perry-codegen --test typed_shape_descriptors`
- `PERRY_BIN=target/release/perry python3 tests/test_typed_feedback_runtime_evidence.py`
- `tests/test_benchmark_output_verifier.sh`
- `cargo build --release`
- Typed-feedback trace confirmed aggregate and per-site guard pass counts remain consistent with the pre-cache trace despite fast-path counter merging.
- Notes:
- `benchmarks/baseline.json` is stale for this Linux environment; compare was run with `--warn-only` and the before/after comparison above uses the captured local second-cycle results.
- This follow-up is intended as a stacked draft PR on top of the guarded numeric array direct payload access PR.
- PR: https://github.com/PerryTS/perry/pull/5307
10 changes: 6 additions & 4 deletions benchmarks/compiler_output/workloads.toml
Original file line number Diff line number Diff line change
Expand Up @@ -632,8 +632,10 @@ detail = "numeric indexed read takes the guarded raw-f64 fast path and loads the

[[workloads.numeric_arrays.ir_checks]]
name = "numeric_array_uses_unboxed_set"
contains = "js_array_numeric_set_f64_unboxed"
detail = "numeric indexed write uses the guarded raw-f64 helper"
contains = "js_typed_feedback_numeric_array_index_set_guard"
regex = '''idxset\.inbounds\.\d+:[\s\S]*?inttoptr i64 %\w+ to ptr\s*\n\s*store double %\w+, ptr %\w+[^\n]*\n\s*br label %idxset\.merge'''

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Broaden SSA identifier matching in the IR regex.

%\w+ is too restrictive for LLVM IR names and may fail on valid identifiers containing dots (e.g., %tmp.1), making this workload check flaky/overly brittle.

Suggested patch
-regex = '''idxset\.inbounds\.\d+:[\s\S]*?inttoptr i64 %\w+ to ptr\s*\n\s*store double %\w+, ptr %\w+[^\n]*\n\s*br label %idxset\.merge'''
+regex = '''idxset\.inbounds\.\d+:[\s\S]*?inttoptr i64 %[-a-zA-Z$._0-9]+ to ptr\s*\n\s*store double %[-a-zA-Z$._0-9]+, ptr %[-a-zA-Z$._0-9]+[^\n]*\n\s*br label %idxset\.merge'''
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
regex = '''idxset\.inbounds\.\d+:[\s\S]*?inttoptr i64 %\w+ to ptr\s*\n\s*store double %\w+, ptr %\w+[^\n]*\n\s*br label %idxset\.merge'''
regex = '''idxset\.inbounds\.\d+:[\s\S]*?inttoptr i64 %[-a-zA-Z$._0-9]+ to ptr\s*\n\s*store double %[-a-zA-Z$._0-9]+, ptr %[-a-zA-Z$._0-9]+[^\n]*\n\s*br label %idxset\.merge'''
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/compiler_output/workloads.toml` at line 636, The regex pattern in
the workload check uses `%\w+` which is too restrictive for LLVM IR identifiers
and fails to match valid names containing dots like `%tmp.1`. Broaden the SSA
identifier matching pattern by replacing all three occurrences of `%\w+` with
`%[\w.]+` to allow dots in LLVM IR identifiers, making the workload check more
robust and less brittle to different IR output variations.

regex_none = ["call i32 @js_array_numeric_set_f64_unboxed"]
detail = "numeric indexed write takes the guarded raw-f64 fast path and stores the slot inline (inttoptr + store double in idxset.inbounds; helper call elided)"

[[workloads.numeric_arrays.stdout_checks]]
name = "numeric_arrays_checksum"
Expand Down Expand Up @@ -674,7 +676,7 @@ rejected_fact_state = "invalidated"
[[workloads.numeric_arrays.native_rep_checks.require_records]]
name = "numeric_array_get_fast_f64"
expr_kind = "NumericArrayIndexGet"
consumer = "js_array_numeric_get_f64_unboxed"
consumer = "numeric_array_index_get.raw_f64_load"
native_rep_name = "f64"
access_mode = "checked_native"
bounds_state = "proven_or_guarded"
Expand Down Expand Up @@ -702,7 +704,7 @@ rejected_fact_state = "invalidated"
[[workloads.numeric_arrays.native_rep_checks.require_records]]
name = "numeric_array_set_fast_f64"
expr_kind = "NumericArrayIndexSet"
consumer = "js_array_numeric_set_f64_unboxed"
consumer = "numeric_array_index_set.raw_f64_store"
native_rep_name = "f64"
access_mode = "checked_native"
bounds_state = "proven_or_guarded"
Expand Down
16 changes: 10 additions & 6 deletions crates/perry-codegen/src/expr/index.rs
Original file line number Diff line number Diff line change
Expand Up @@ -212,11 +212,15 @@ pub(crate) fn lower_index_set_fast(
{
let blk = ctx.block();
if require_numeric_layout {
blk.call(
I32,
"js_array_numeric_set_f64_unboxed",
&[(I64, &arr_handle), (I32, &idx_i32), (DOUBLE, val_double)],
);
let (_element_addr, element_ptr) = element_slot(blk, &arr_handle, &idx_i32);
// The numeric-array guard proves the receiver has raw-f64 numeric
// layout and the value is numeric; the preceding length check
// proves this specific store is in-bounds. Store the numeric
// payload directly instead of calling the runtime helper.
// GC_STORE_AUDIT(POINTER_FREE): the stored value is a guard-proven
// numeric f64 written into a raw-f64 array payload slot — no GC
// pointer is stored, so no write barrier is required.
blk.store(DOUBLE, val_double, &element_ptr);
} else {
let (element_addr, element_ptr) = element_slot(blk, &arr_handle, &idx_i32);
// In-place overwrite of a non-raw-layout (e.g. downgraded `any[]`)
Expand Down Expand Up @@ -251,7 +255,7 @@ pub(crate) fn lower_index_set_fast(
ctx.record_lowered_value_with_access_mode_and_facts(
"NumericArrayIndexSet",
Some(local_id),
"js_array_numeric_set_f64_unboxed",
"numeric_array_index_set.raw_f64_store",
&stored,
Some(BoundsState::Guarded {
guard_id: "numeric_array_index_set_guard".to_string(),
Expand Down
20 changes: 8 additions & 12 deletions crates/perry-codegen/src/expr/index_get.rs
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,11 @@ fn lower_guarded_array_index_get(
let fast_blk = ctx.block();
let arr_bits = fast_blk.bitcast_double_to_i64(arr_box);
let arr_handle = fast_blk.and(I64, &arr_bits, POINTER_MASK_I64);
let idx_i64 = fast_blk.zext(I32, idx_i32, I64);
let byte_offset = fast_blk.shl(I64, &idx_i64, "3");
let with_header = fast_blk.add(I64, &byte_offset, "8");
let element_addr = fast_blk.add(I64, &arr_handle, &with_header);
let element_ptr = fast_blk.inttoptr(I64, &element_addr);
let fast_val = if require_numeric_layout {
// The `numeric_array_index_get_guard` on the way into this block already
// proved: a plain, non-forwarded `Array`, in raw-f64 numeric layout,
Expand All @@ -297,19 +302,10 @@ fn lower_guarded_array_index_get(
// of calling `js_array_numeric_get_f64_unboxed`, whose hot path
// re-validates exactly those same conditions and then does this load.
// Raw-f64 arrays are dense (no HOLE slots) and the slot holds a raw f64,
// matching the runtime helper's `return *elements_ptr.add(index)`.
let idx_i64 = fast_blk.zext(I32, idx_i32, I64);
let byte_offset = fast_blk.shl(I64, &idx_i64, "3");
let with_header = fast_blk.add(I64, &byte_offset, "8");
let element_addr = fast_blk.add(I64, &arr_handle, &with_header);
let element_ptr = fast_blk.inttoptr(I64, &element_addr);
// matching the runtime helper's `return *elements_ptr.add(index)`. The
// `element_ptr` is hoisted above the branch since both arms reuse it.
fast_blk.load(DOUBLE, &element_ptr)
} else {
let idx_i64 = fast_blk.zext(I32, idx_i32, I64);
let byte_offset = fast_blk.shl(I64, &idx_i64, "3");
let with_header = fast_blk.add(I64, &byte_offset, "8");
let element_addr = fast_blk.add(I64, &arr_handle, &with_header);
let element_ptr = fast_blk.inttoptr(I64, &element_addr);
let fast_raw = fast_blk.load(DOUBLE, &element_ptr);
// `new Array(n)` slots are TAG_HOLE internally; JavaScript reads expose
// `undefined`.
Expand All @@ -330,7 +326,7 @@ fn lower_guarded_array_index_get(
ctx.record_lowered_value_with_access_mode_and_facts(
"NumericArrayIndexGet",
None,
"js_array_numeric_get_f64_unboxed",
"numeric_array_index_get.raw_f64_load",
&fast,
Some(BoundsState::Guarded {
guard_id: "numeric_array_index_get_guard".to_string(),
Expand Down
16 changes: 10 additions & 6 deletions crates/perry-codegen/src/expr/index_set.rs
Original file line number Diff line number Diff line change
Expand Up @@ -439,11 +439,15 @@ pub(crate) fn lower(ctx: &mut FnCtx<'_>, expr: &Expr) -> Result<String> {
let blk = ctx.block();
let arr_bits = blk.bitcast_double_to_i64(&arr_box);
let arr_handle = blk.and(I64, &arr_bits, POINTER_MASK_I64);
blk.call(
I32,
"js_array_numeric_set_f64_unboxed",
&[(I64, &arr_handle), (I32, &idx_i32), (DOUBLE, &val_double)],
);
let idx_i64 = blk.zext(I32, &idx_i32, I64);
let byte_offset = blk.shl(I64, &idx_i64, "3");
let with_header = blk.add(I64, &byte_offset, "8");
let element_addr = blk.add(I64, &arr_handle, &with_header);
let element_ptr = blk.inttoptr(I64, &element_addr);
// GC_STORE_AUDIT(POINTER_FREE): guard-proven
// numeric f64 stored into a raw-f64 array
// payload slot — no GC pointer, no barrier.
blk.store(DOUBLE, &val_double, &element_ptr);
blk.br(&merge_label);
}
let stored = LoweredValue {
Expand All @@ -455,7 +459,7 @@ pub(crate) fn lower(ctx: &mut FnCtx<'_>, expr: &Expr) -> Result<String> {
ctx.record_lowered_value_with_access_mode_and_facts(
"NumericArrayIndexSet",
Some(*arr_id),
"js_array_numeric_set_f64_unboxed",
"numeric_array_index_set.raw_f64_store",
&stored,
Some(BoundsState::Guarded {
guard_id: "numeric_array_index_set_guard".to_string(),
Expand Down
10 changes: 10 additions & 0 deletions crates/perry-codegen/src/native_value/verify.rs
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,8 @@ fn raw_f64_checked_native_consumer(record: &NativeRepRecord) -> bool {
record.consumer.as_str(),
"js_array_numeric_get_f64_unboxed"
| "js_array_numeric_set_f64_unboxed"
| "numeric_array_index_get.raw_f64_load"
| "numeric_array_index_set.raw_f64_store"
| "js_array_numeric_push_f64_unboxed"
| "class_field_get.raw_f64_load"
| "class_field_set.raw_f64_store"
Expand Down Expand Up @@ -1257,6 +1259,14 @@ mod tests {
for (expr_kind, consumer) in [
("NumericArrayIndexGet", "js_array_numeric_get_f64_unboxed"),
("NumericArrayIndexSet", "js_array_numeric_set_f64_unboxed"),
(
"NumericArrayIndexGet",
"numeric_array_index_get.raw_f64_load",
),
(
"NumericArrayIndexSet",
"numeric_array_index_set.raw_f64_store",
),
("NumericArrayPush", "js_array_numeric_push_f64_unboxed"),
("ClassFieldGet", "class_field_get.raw_f64_load"),
("ClassFieldSet", "class_field_set.raw_f64_store"),
Expand Down
Loading