Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions PERF_RUN_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,68 @@
- `benchmarks/baseline.json` is stale for this Linux environment; compare was run with `--warn-only` and the before/after comparison above uses the captured local first-cycle results.
- This follow-up is intended as a stacked draft PR on top of the typed-feedback registration-hoist PR.
- PR: https://github.com/PerryTS/perry/pull/5302

## 2026-06-17 - Monomorphic array guard fast cache

- Start revision: `ed71efde8585`
- Branch: `codex/perry-array-guard-cache-fastpath`
- Worker assignment: single Codex pass in this worktree
- Benchmark environment: Linux `/usr/bin/time -v`; local `node` cannot execute `.ts` benchmark inputs, so Node columns and correctness comparisons were skipped by the harness
- Baseline commands:
- `cargo build --release`
- `target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-direct-final --trace llvm --quiet`
- `for i in 1 2 3 4 5; do /tmp/perry-matrix-direct-final; done`
- `perf stat -e cycles,instructions,branches,branch-misses /tmp/perry-matrix-direct-final`
- `PERRY_TYPED_FEEDBACK_TRACE=/tmp/perry-matrix-typed-feedback.json /tmp/perry-matrix-direct-final`
- `./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-direct-numeric-final-e816fc3e4.json`
- `./benchmarks/quick.sh`
- Baseline results:
- direct matrix binary: 1736ms, 1730ms, 1729ms, 1738ms, 1714ms; checksum always `41079519680`
- `perf stat` direct matrix binary: 6,337,280,206 cycles, 28,036,164,989 instructions, 4,648,261,291 branches, 488,073 branch-misses, 1.7806s elapsed
- typed-feedback trace for direct matrix binary: 33,619,968 numeric array index-get guard passes, 65,536 numeric array index-set guard passes, 0 get/set guard failures
- compare quick medians: loop_overhead 56ms/19040KB, fibonacci 239ms/18764KB, math_intensive 58ms/18756KB, nested_loops 921ms/18944KB, factorial 89ms/18828KB
- quick: fibonacci 264ms/18MB, math_intensive 55ms/18MB, nested_loops 928ms/18MB, factorial 76ms/18MB, matrix_multiply 1745ms/28MB
- Selected gap and evidence:
- After direct raw-f64 payload access, `matrix_multiply` remained the slowest `quick.sh` case at 1745ms.
- Matrix trace showed 33.6M successful numeric array get guard calls and 65K set guard calls, all monomorphic with no get/set failures.
- Sampled profiling/disassembly of `/tmp/perry-matrix-direct-final` showed the inner loop still calling `js_typed_feedback_numeric_array_index_get_guard` twice per `k` iteration; the guard path enters `guard_observe`, locks the global typed-feedback registry, does a `HashMap` lookup, updates counters, and rechecks the same monomorphic observation.
- A narrower raw-f64 classification shortcut was tested first and discarded: five direct matrix runs were 1767ms, 1774ms, 1757ms, 1806ms, 1763ms, which was slower/noisier than the 1714-1738ms baseline.
- Change:
- Added a small lock-free, direct-mapped cache for array typed-feedback guard sites.
- The cache is seeded by the existing slow `guard_observe` path and fast-passes only when the current array observation exactly matches the cached feedback key and the runtime contract guard is valid.
- Slow paths still update the registry, failures, megamorphic state, invalidation-visible observations, and fallback counters; trace snapshots merge cache fast-pass counters back into `observed_count`, per-site guard passes, and by-guard totals.
- Direct non-guard observations also update or disable the cache so a reused site that becomes megamorphic cannot keep fast-passing from stale cache state.
- Post-change benchmark commands:
- `cargo build --release`
- `target/release/perry compile --no-cache benchmarks/suite/16_matrix_multiply.ts -o /tmp/perry-matrix-array-guard-cache-final --quiet`
- `for i in 1 2 3 4 5; do /tmp/perry-matrix-array-guard-cache-final; done`
- `PERRY_TYPED_FEEDBACK_TRACE=/tmp/perry-matrix-array-guard-cache-final-trace.json /tmp/perry-matrix-array-guard-cache-final`
- `perf stat -e cycles,instructions,branches,branch-misses /tmp/perry-matrix-array-guard-cache-final`
- `./benchmarks/compare.sh --quick --runs 3 --warn-only --json-out /tmp/perry-array-guard-cache-final-ed71efde8.json`
- `./benchmarks/quick.sh`
- Post-change results:
- direct matrix binary: 1239ms, 1258ms, 1223ms, 1247ms, 1226ms; checksum always `41079519680`
- final trace run: `matrix_multiply:1237`, checksum `41079519680`, 33,619,968 numeric array index-get guard passes, 65,536 numeric array index-set guard passes, 0 get/set guard failures
- `perf stat` direct matrix binary: 4,485,321,202 cycles, 16,737,765,528 instructions, 3,085,068,790 branches, 382,419 branch-misses, 1.2376s elapsed
- compare quick medians: loop_overhead 56ms/18728KB, fibonacci 240ms/18888KB, math_intensive 55ms/18768KB, nested_loops 662ms/22888KB, factorial 76ms/18836KB
- quick: fibonacci 268ms/18MB, math_intensive 74ms/18MB, nested_loops 670ms/22MB, factorial 75ms/18MB, matrix_multiply 1228ms/30MB
- Measured impact:
- `16_matrix_multiply` direct median: 1730ms -> 1239ms, 28.4% faster
- `16_matrix_multiply` quick: 1745ms -> 1228ms, 29.6% faster
- Direct matrix binary instructions: 28.04B -> 16.74B, 40.3% fewer
- Direct matrix binary branches: 4.65B -> 3.09B, 33.6% fewer
- `10_nested_loops` compare median: 921ms -> 662ms, 28.1% faster
- Verification:
- `cargo fmt --check`
- `git diff --check`
- `cargo test -p perry-runtime typed_feedback`
- `cargo test -p perry-codegen --test typed_feedback`
- `cargo test -p perry-codegen --test typed_shape_descriptors`
- `PERRY_BIN=target/release/perry python3 tests/test_typed_feedback_runtime_evidence.py`
- `tests/test_benchmark_output_verifier.sh`
- `cargo build --release`
- Typed-feedback trace confirmed aggregate and per-site guard pass counts remain consistent with the pre-cache trace despite fast-path counter merging.
- Notes:
- `benchmarks/baseline.json` is stale for this Linux environment; compare was run with `--warn-only` and the before/after comparison above uses the captured local second-cycle results.
- This follow-up is intended as a stacked draft PR on top of the guarded numeric array direct payload access PR.
- PR: https://github.com/PerryTS/perry/pull/5307
140 changes: 136 additions & 4 deletions crates/perry-runtime/src/typed_feedback.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@
use std::collections::{BTreeMap, HashMap};
#[cfg(any(feature = "diagnostics", test))]
use std::sync::atomic::AtomicBool;
#[cfg(any(feature = "diagnostics", test))]
use std::sync::atomic::Ordering;
use std::sync::atomic::{AtomicU64, AtomicU8, Ordering};
use std::sync::{LazyLock, Mutex};

use crate::array::ArrayHeader;
Expand All @@ -19,11 +18,20 @@ use crate::value::{
};

const POLYMORPHIC_CAP: usize = 4;
const ARRAY_GUARD_FAST_CACHE_SIZE: usize = 4096;
const ARRAY_GUARD_FAST_CACHE_ENABLED: u8 = 1;
const ARRAY_GUARD_FAST_CACHE_DISABLED: u8 = 2;

static REGISTRY: LazyLock<Mutex<TypedFeedbackRegistry>> =
LazyLock::new(|| Mutex::new(TypedFeedbackRegistry::default()));
#[cfg(any(feature = "diagnostics", test))]
static TRACE_DUMPED: AtomicBool = AtomicBool::new(false);
static ARRAY_GUARD_FAST_CACHE: LazyLock<Box<[ArrayGuardFastCacheEntry]>> = LazyLock::new(|| {
(0..ARRAY_GUARD_FAST_CACHE_SIZE)
.map(|_| ArrayGuardFastCacheEntry::default())
.collect::<Vec<_>>()
.into_boxed_slice()
});

#[cfg(not(test))]
static TYPED_FEEDBACK_ENABLED: LazyLock<bool> = LazyLock::new(|| {
Expand Down Expand Up @@ -329,8 +337,10 @@ pub struct GuardCounterSnapshot {
}

impl GuardCounterSnapshot {
fn add_site(&mut self, site: &TypedFeedbackSite) {
self.passes = self.passes.saturating_add(site.guard_passes);
fn add_site(&mut self, site: &TypedFeedbackSite, extra_guard_passes: u64) {
self.passes = self
.passes
.saturating_add(site.guard_passes.saturating_add(extra_guard_passes));
self.failures = self.failures.saturating_add(site.guard_failures);
self.fallback_calls = self.fallback_calls.saturating_add(site.fallback_calls);
}
Expand Down Expand Up @@ -370,6 +380,122 @@ fn registry() -> crate::gc::GcRootRegistryGuard<'static, TypedFeedbackRegistry>
crate::gc::lock_gc_root_registry(&REGISTRY)
}

#[derive(Default)]
struct ArrayGuardFastCacheEntry {
site_id: AtomicU64,
packed: AtomicU64,
aux: AtomicU64,
fast_passes: AtomicU64,
state: AtomicU8,
}

fn array_guard_cache_index(site_id: u64) -> usize {
let mixed = site_id ^ (site_id >> 32) ^ (site_id >> 17);
(mixed as usize) & (ARRAY_GUARD_FAST_CACHE_SIZE - 1)
}

fn pack_array_guard_observation(observation: &Observation) -> Option<(u64, u64)> {
if observation.source != ObservationSource::Array || observation.shape_addr != 0 {
return None;
}
Some((
(observation.class_id as u64)
| ((observation.heap_type as u64) << 32)
| ((observation.value_tag as u64) << 48),
observation.aux,
))
}

fn array_guard_fast_pass(site_id: u64, observation: &Observation, contract_valid: bool) -> bool {
if site_id == 0 || !contract_valid {
return false;
}
let Some((packed, aux)) = pack_array_guard_observation(observation) else {
return false;
};
let entry = &ARRAY_GUARD_FAST_CACHE[array_guard_cache_index(site_id)];
if entry.state.load(Ordering::Acquire) != ARRAY_GUARD_FAST_CACHE_ENABLED {
return false;
}
if entry.site_id.load(Ordering::Relaxed) != site_id {
return false;
}
if entry.packed.load(Ordering::Relaxed) == packed && entry.aux.load(Ordering::Relaxed) == aux {
entry.fast_passes.fetch_add(1, Ordering::Relaxed);
return true;
}
false
}

fn note_array_guard_cache_slow_observation(
site_id: u64,
observation: &Observation,
site: &TypedFeedbackSite,
) {
if site_id == 0 {
return;
}
let Some((packed, aux)) = pack_array_guard_observation(observation) else {
return;
};
let entry = &ARRAY_GUARD_FAST_CACHE[array_guard_cache_index(site_id)];
let existing_site = entry.site_id.load(Ordering::Acquire);
if existing_site != site_id {
if existing_site != 0 {
return;
}
if entry
.site_id
.compare_exchange(0, site_id, Ordering::AcqRel, Ordering::Acquire)
.is_err()
{
return;
}
}
if site.megamorphic {
entry
.state
.store(ARRAY_GUARD_FAST_CACHE_DISABLED, Ordering::Release);
return;
}
if site
.observations
.iter()
.any(|seen| seen.same_feedback_key(observation))
{
entry.packed.store(packed, Ordering::Relaxed);
entry.aux.store(aux, Ordering::Relaxed);
entry
.state
.store(ARRAY_GUARD_FAST_CACHE_ENABLED, Ordering::Release);
}
}

fn array_guard_cache_fast_passes(site_id: u64) -> u64 {
if site_id == 0 {
return 0;
}
let entry = &ARRAY_GUARD_FAST_CACHE[array_guard_cache_index(site_id)];
if entry.site_id.load(Ordering::Acquire) == site_id {
entry.fast_passes.load(Ordering::Relaxed)
} else {
0
}
}

#[cfg(test)]
fn reset_array_guard_fast_cache_for_tests() {
for entry in ARRAY_GUARD_FAST_CACHE.iter() {
entry
.state
.store(ARRAY_GUARD_FAST_CACHE_DISABLED, Ordering::Release);
entry.site_id.store(0, Ordering::Release);
entry.packed.store(0, Ordering::Relaxed);
entry.aux.store(0, Ordering::Relaxed);
entry.fast_passes.store(0, Ordering::Relaxed);
}
}

#[no_mangle]
pub extern "C" fn js_typed_feedback_register_site(
site_id: u64,
Expand Down Expand Up @@ -730,6 +856,7 @@ fn observe(site_id: u64, fallback_kind: TypedFeedbackSiteKind, observation: Obse
)
});
site.observe(observation);
note_array_guard_cache_slow_observation(site_id, &observation, site);
}

fn site_entry(
Expand Down Expand Up @@ -762,6 +889,9 @@ fn guard_observe(
if site_id == 0 || !typed_feedback_enabled() {
return contract_valid;
}
if array_guard_fast_pass(site_id, &observation, contract_valid) {
return true;
}
let mut reg = registry();
let site = site_entry(&mut reg, site_id, fallback_kind);
let guard_passed = contract_valid
Expand All @@ -777,6 +907,7 @@ fn guard_observe(
site.guard_failures = site.guard_failures.saturating_add(1);
}
site.observe(observation);
note_array_guard_cache_slow_observation(site_id, &observation, site);
guard_passed
}

Expand Down Expand Up @@ -1863,6 +1994,7 @@ pub fn scan_typed_feedback_roots_mut(visitor: &mut crate::gc::RuntimeRootVisitor
#[cfg(test)]
pub(crate) fn reset_typed_feedback_for_tests() {
TRACE_DUMPED.store(false, Ordering::Release);
reset_array_guard_fast_cache_for_tests();
let mut reg = registry();
*reg = TypedFeedbackRegistry::default();
}
Expand Down
84 changes: 84 additions & 0 deletions crates/perry-runtime/src/typed_feedback/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -505,6 +505,90 @@ fn typed_feedback_numeric_array_get_guard_requires_numeric_layout() {
assert_eq!(site.fallback_calls, 0);
}

#[test]
fn typed_feedback_numeric_array_guard_fast_path_preserves_snapshot_counts() {
let _guard = TYPED_FEEDBACK_TEST_LOCK.lock().unwrap();
reset_typed_feedback_for_tests();
register(29, TypedFeedbackSiteKind::ArrayElement, "arr[i]");

let values = [1.0, 2.0];
let arr = crate::array::js_array_from_f64(values.as_ptr(), values.len() as u32);
let arr_box = crate::value::js_nanbox_pointer(arr as i64);

assert_eq!(
js_typed_feedback_numeric_array_index_get_guard(29, arr_box, 0.0, 0, 1),
1
);
assert_eq!(
js_typed_feedback_numeric_array_index_get_guard(29, arr_box, 0.0, 0, 1),
1
);
assert_eq!(
js_typed_feedback_numeric_array_index_get_guard(29, arr_box, 0.0, 0, 1),
1
);
assert_eq!(array_guard_cache_fast_passes(29), 2);

let snapshot = typed_feedback_snapshot();
let site = &snapshot.sites[0];
assert_eq!(site.guard_passes, 3);
assert_eq!(site.guard_failures, 0);
assert_eq!(site.observed_count, 3);
assert_eq!(site.observation_count, 1);
}

#[test]
fn typed_feedback_numeric_array_guard_fast_path_respects_megamorphic_state() {
let _guard = TYPED_FEEDBACK_TEST_LOCK.lock().unwrap();
reset_typed_feedback_for_tests();
register(30, TypedFeedbackSiteKind::ArrayElement, "arr[i]");

let values = [1.0, 2.0];
let arr = crate::array::js_array_from_f64(values.as_ptr(), values.len() as u32);
let arr_box = crate::value::js_nanbox_pointer(arr as i64);

assert_eq!(
js_typed_feedback_numeric_array_index_get_guard(30, arr_box, 0.0, 0, 1),
1
);
assert_eq!(
js_typed_feedback_numeric_array_index_get_guard(30, arr_box, 0.0, 0, 1),
1
);
assert_eq!(array_guard_cache_fast_passes(30), 1);

for class_id in 1..=POLYMORPHIC_CAP {
observe(
30,
TypedFeedbackSiteKind::ArrayElement,
Observation {
source: ObservationSource::Array,
object_addr: 0,
shape_addr: 0,
key_hash: 0,
class_id: class_id as u32,
heap_type: crate::gc::GC_TYPE_ARRAY as u16,
aux: pack_array_aux(
ARRAY_ACCESS_INDEXED_IN_BOUNDS,
ARRAY_LAYOUT_POINTER_FREE,
STABLE_VALUE_NUMBER,
0,
),
value_tag: STABLE_VALUE_NUMBER,
},
);
}

let guard = js_typed_feedback_numeric_array_index_get_guard(30, arr_box, 0.0, 0, 1);
assert_eq!(guard, 0);

let snapshot = typed_feedback_snapshot();
let site = &snapshot.sites[0];
assert_eq!(site.state, "megamorphic");
assert_eq!(site.guard_passes, 2);
assert_eq!(site.guard_failures, 1);
}

#[test]
fn typed_feedback_numeric_array_set_guard_requires_numeric_value_and_layout() {
let _guard = TYPED_FEEDBACK_TEST_LOCK.lock().unwrap();
Expand Down
Loading