perf(codegen): integer-specialize i < n loop guards for any/untyped bounds#5086
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThis PR optimizes for-loop code generation with untyped (any) numeric bounds. It hoists a single is-number check and branches into a fast i32-compare path versus a slow coercion-preserving path, plus introduces an ABI-mirroring constant and bumps the workspace version to 0.5.1164. ChangesDynamic I32 Loop-Bound Optimization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/perry-codegen/src/stmt/loops.rs`:
- Around line 435-491: The dynamic fast path currently hoists is_number and
fptosi for a bound local without ensuring that bound_id is not mutated inside
the loop, so add a mutability/modified-local guard before creating
DynamicI32Bound: in the closure passed to and_then(|(counter_id, bound_id, op)|
{ ... }) (the classify_for_local_bound_dynamic path) check whether bound_id can
be written by the loop body or update (e.g. consult whichever
analysis/collection tracks locals modified by the loop such as a
loop-modified-locals set or a Local.is_mutable/modified-in-loop query) and
return None if it is mutated; only proceed to allocate flag_slot/bound_i32_slot
and return Some(DynamicI32Bound) when bound_id is guaranteed not to be changed
by the loop. Ensure this check uses the same identifying symbol bound_id and
prevents the one-time hoist when the local is mutable.
- Around line 480-483: The code currently does an unconditional fptosi on n_dbl
and stores it to bound_i32_slot, which can produce poison for
NaN/Inf/out-of-range values; instead, ensure the fptosi is only executed after
the finite/i32-range checks (i.e., inside the fast-path block or after the
flag_slot guard) — move the fptosi(&n_dbl, I32) and the store to bound_i32_slot
into the same basic block that is taken when the fast-number checks pass (or
emit a guarded conversion using the boolean guard that writes only when the
guard is true), leaving the pre-loop block to only allocate bound_i32_slot and
produce no fptosi result; update any uses that load bound_i32_slot (e.g., the
icmp compare) to occur in the fast block after the store so they never observe
an unguarded conversion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 60a74585-8d09-4f77-b456-7d5d2c1dcc88
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
CHANGELOG.mdCLAUDE.mdCargo.tomlcrates/perry-codegen/src/nanbox.rscrates/perry-codegen/src/stmt/loops.rs
| let dynamic_i32_bound: Option<DynamicI32Bound> = | ||
| if hoist_classification.is_none() && local_bound_classification.is_none() { | ||
| condition | ||
| .and_then(|cond| classify_for_local_bound_dynamic(cond, ctx)) | ||
| .and_then(|(counter_id, bound_id, op)| { | ||
| let bound_slot = ctx.locals.get(&bound_id).cloned()?; | ||
| // Ensure an i32 counter slot exists (the Let site allocates | ||
| // one for `integer_locals`, but allocate here if absent so | ||
| // the fast path and Update stay in sync). | ||
| let counter_i32_was_fresh = | ||
| if !ctx.i32_counter_slots.contains_key(&counter_id) { | ||
| let counter_slot = ctx.locals.get(&counter_id).cloned()?; | ||
| let i32_slot = ctx.func.alloca_entry(I32); | ||
| let cur_dbl = ctx.block().load(DOUBLE, &counter_slot); | ||
| let cur_i32 = ctx.block().fptosi(DOUBLE, &cur_dbl, I32); | ||
| ctx.block().store(I32, &cur_i32, &i32_slot); | ||
| ctx.i32_counter_slots.insert(counter_id, i32_slot); | ||
| true | ||
| } else { | ||
| false | ||
| }; | ||
| // One-time `is-number` test, mirroring runtime | ||
| // `JSValue::is_number`: a value is a number unless its tag | ||
| // bits fall in the Perry-owned band [SHORT_STRING_TAG, | ||
| // STRING_TAG]. | ||
| let n_dbl = ctx.block().load(DOUBLE, &bound_slot); | ||
| let n_bits = ctx.block().bitcast_double_to_i64(&n_dbl); | ||
| let tag = ctx.block().and( | ||
| I64, | ||
| &n_bits, | ||
| &crate::nanbox::i64_literal(crate::nanbox::TAG_MASK), | ||
| ); | ||
| let below = ctx.block().icmp_ult( | ||
| I64, | ||
| &tag, | ||
| &crate::nanbox::i64_literal(crate::nanbox::SHORT_STRING_TAG), | ||
| ); | ||
| let above = ctx.block().icmp_ugt( | ||
| I64, | ||
| &tag, | ||
| &crate::nanbox::i64_literal(crate::nanbox::STRING_TAG), | ||
| ); | ||
| let is_number = ctx.block().or(I1, &below, &above); | ||
| let flag_slot = ctx.func.alloca_entry(I1); | ||
| ctx.block().store(I1, &is_number, &flag_slot); | ||
| // `fptosi(n)` is valid only on the fast (is-number) path. | ||
| let bound_i32 = ctx.block().fptosi(DOUBLE, &n_dbl, I32); | ||
| let bound_i32_slot = ctx.func.alloca_entry(I32); | ||
| ctx.block().store(I32, &bound_i32, &bound_i32_slot); | ||
| Some(DynamicI32Bound { | ||
| counter_id, | ||
| op, | ||
| flag_slot, | ||
| bound_i32_slot, | ||
| counter_i32_was_fresh, | ||
| }) | ||
| }) |
There was a problem hiding this comment.
Reject the dynamic fast path when the bound local is mutable inside the loop.
This path hoists both is_number and fptosi(n) once, but it never proves that bound_id stays unchanged across the body/update. A loop like for (let i = 0; i < n; i++) { n = "0"; } will keep using the entry-time flag/bound and run too many iterations instead of re-reading n each trip.
Suggested guard
if hoist_classification.is_none() && local_bound_classification.is_none() {
condition
.and_then(|cond| classify_for_local_bound_dynamic(cond, ctx))
.and_then(|(counter_id, bound_id, op)| {
+ if stmts_mutate_local(body, bound_id)
+ || update.is_some_and(|expr| expr_mutates_local(expr, bound_id))
+ {
+ return None;
+ }
let bound_slot = ctx.locals.get(&bound_id).cloned()?;
// Ensure an i32 counter slot exists (the Let site allocates
// one for `integer_locals`, but allocate here if absent so🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/perry-codegen/src/stmt/loops.rs` around lines 435 - 491, The dynamic
fast path currently hoists is_number and fptosi for a bound local without
ensuring that bound_id is not mutated inside the loop, so add a
mutability/modified-local guard before creating DynamicI32Bound: in the closure
passed to and_then(|(counter_id, bound_id, op)| { ... }) (the
classify_for_local_bound_dynamic path) check whether bound_id can be written by
the loop body or update (e.g. consult whichever analysis/collection tracks
locals modified by the loop such as a loop-modified-locals set or a
Local.is_mutable/modified-in-loop query) and return None if it is mutated; only
proceed to allocate flag_slot/bound_i32_slot and return Some(DynamicI32Bound)
when bound_id is guaranteed not to be changed by the loop. Ensure this check
uses the same identifying symbol bound_id and prevents the one-time hoist when
the local is mutable.
| // `fptosi(n)` is valid only on the fast (is-number) path. | ||
| let bound_i32 = ctx.block().fptosi(DOUBLE, &n_dbl, I32); | ||
| let bound_i32_slot = ctx.func.alloca_entry(I32); | ||
| ctx.block().store(I32, &bound_i32, &bound_i32_slot); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
LLVM LangRef fptosi: what happens when converting NaN, ±Inf, or out-of-range doublevalues toi32?
💡 Result:
In LLVM IR, the fptosi instruction converts a floating-point value to a signed integer by rounding towards zero [1][2]. If the source floating-point value (such as NaN, ±Inf, or a value that is out of range for the destination integer type) cannot be represented in the destination integer type, the result is a poison value [1][2]. Poison values are a form of undefined behavior in LLVM that propagate through instructions [1][2]. Because the conversion of NaN or out-of-range values produces poison, using the result of such an fptosi instruction in other operations can lead to undefined behavior for the program as a whole [3]. If you require defined behavior for these edge cases—such as clamping to the minimum or maximum representable integer—you should use the llvm.fptosi.sat intrinsic, which provides saturating conversion behavior [4][5].
Citations:
- 1: https://reviews.llvm.org/D47851
- 2: https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180604/559300.html
- 3: https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20250915/1734007.html
- 4: https://releases.llvm.org/15.0.0/docs/LangRef.html
- 5: https://releases.llvm.org/14.0.0/docs/LangRef.html
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant region around the reported lines
FILE="crates/perry-codegen/src/stmt/loops.rs"
nl -ba "$FILE" | sed -n '430,540p'Repository: PerryTS/perry
Length of output: 100
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Also search for the surrounding guard / fast-slow split context to see where the fptosi lands.
FILE="crates/perry-codegen/src/stmt/loops.rs"
rg -n "fptosi\\(|is-number|NaN|fast|slow|primitive number|DOUBLE|I32" "$FILE"Repository: PerryTS/perry
Length of output: 6052
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
awk 'NR>=430 && NR<=510 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,120p'Repository: PerryTS/perry
Length of output: 4893
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
awk 'NR>=440 && NR<=500 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p'Repository: PerryTS/perry
Length of output: 3534
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
rg -n "for\\.cond\\.fast|for\\.cond\\.slow|cond_br\\(&flag|flag_slot|bound_slot|bound_i32_slot|fptosi\\(DOUBLE, &n_dbl, I32\\)" "$FILE"Repository: PerryTS/perry
Length of output: 1597
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
# Show the fast/slow split logic around where flag is used
sed -n '540,660p' "$FILE"Repository: PerryTS/perry
Length of output: 5550
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
rg -n "fptosi\\.sat|llvm\\.fptosi\\.sat|fptosi\\(|freeze\\b|is_finite|isfinite|fcmp" "$FILE"Repository: PerryTS/perry
Length of output: 1879
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
# Find where the slow path computes the bound/compare to see if it still uses bound_i32_slot
rg -n "for\\.cond\\.slow|for\\.cond\\.fast|js_rel_|js_.*compare|CompareOp" "$FILE" -nRepository: PerryTS/perry
Length of output: 1679
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
# Show a wider window around the DynamicI32Bound usage
sed -n '560,640p' "$FILE"Repository: PerryTS/perry
Length of output: 3949
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
rg -n "classify_for_local_bound_dynamic" -S "$FILE"Repository: PerryTS/perry
Length of output: 279
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
# Print around the function definition and any helpers
sed -n '660,820p' "$FILE"Repository: PerryTS/perry
Length of output: 6618
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
# Show the fast/slow split logic around where flag is used
sed -n '540,660p' "$FILE"Repository: PerryTS/perry
Length of output: 5550
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
rg -n "fptosi\\.sat|llvm\\.fptosi\\.sat|fptosi\\(|freeze\\b|is_finite|isfinite|fcmp" "$FILE"Repository: PerryTS/perry
Length of output: 1879
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
# Show how the slow path computes/compares (look for compare helpers/calls)
rg -n "for\\.cond\\.slow|for\\.cond\\.fast|js_rel_|CompareOp::|js_.*rel|js_.*compare" "$FILE"Repository: PerryTS/perry
Length of output: 1135
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
# Wider window around DynamicI32Bound usage to confirm where bound_i32_slot is read
sed -n '560,640p' "$FILE"Repository: PerryTS/perry
Length of output: 3949
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
# Locate the classifier to see whether it excludes NaN/Inf/out-of-range (or only checks tag)
rg -n "classify_for_local_bound_dynamic" "$FILE"Repository: PerryTS/perry
Length of output: 279
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
# Print around potential helper definitions near the matcher result (best-effort)
# (If this misses, we'll narrow further after seeing line numbers from rg.)
sed -n '820,1040p' "$FILE"Repository: PerryTS/perry
Length of output: 8246
Avoid speculatively fptosi-ing dynamic loop bounds (poison/UB risk)
At crates/perry-codegen/src/stmt/loops.rs:480-483, fptosi(DOUBLE, &n_dbl, I32) is computed unconditionally in the pre-loop block, then stored to bound_i32_slot even though only flag_slot gates use in the fast/slow condition. LLVM fptosi yields poison when the input is NaN/±Inf or out-of-range for i32, and the fast path subsequently loads that result for the icmp comparison.
// `fptosi(n)` is valid only on the fast (is-number) path.
let bound_i32 = ctx.block().fptosi(DOUBLE, &n_dbl, I32);
let bound_i32_slot = ctx.func.alloca_entry(I32);
ctx.block().store(I32, &bound_i32, &bound_i32_slot);Strengthen the guard so only finite, i32-representable doubles reach fptosi (and/or emit the conversion only inside the fast block after those checks).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/perry-codegen/src/stmt/loops.rs` around lines 480 - 483, The code
currently does an unconditional fptosi on n_dbl and stores it to bound_i32_slot,
which can produce poison for NaN/Inf/out-of-range values; instead, ensure the
fptosi is only executed after the finite/i32-range checks (i.e., inside the
fast-path block or after the flag_slot guard) — move the fptosi(&n_dbl, I32) and
the store to bound_i32_slot into the same basic block that is taken when the
fast-number checks pass (or emit a guarded conversion using the boolean guard
that writes only when the guard is true), leaving the pre-loop block to only
allocate bound_i32_slot and produce no fptosi result; update any uses that load
bound_i32_slot (e.g., the icmp compare) to occur in the fast block after the
store so they never observe an unguarded conversion.
… bounds Tight integer loops whose bound is not statically typed `number` (most commonly an `any`-typed or un-annotated count, e.g. out of `JSON.parse`) lowered their `i < n` / `i <= n` guard to a generic per-iteration comparison: `sitofp` the i32 counter back to a double, keep `n` as a NaN-boxed f64, and `call @js_rel_lt` every iteration — ~50x slower than an integer induction variable + `icmp`, and it blocks SCEV / the loop vectorizer. The emitted IR is identical across x86_64 and arm64 (only the target-triple header differs). It looked arch-specific only because the macOS auto-optimize build inlines `js_rel_lt` and folds the call away, while the `--target linux` prebuilt-runtime build keeps the per-iteration `callq` — the cause of poor compute throughput on Lambda. Fix: runtime-guarded i32 specialization extending the existing `classify_for_local_bound` peephole. New `classify_for_local_bound_dynamic` matches the shape; the loop head hoists, once, an `is-number` check (NaN-box tag test mirroring `JSValue::is_number`) + `fptosi(n)`; the cond block branches on that loop-invariant flag into `for.cond.fast` (`icmp slt i32`, no per-iteration sitofp/call) and `for.cond.slow` (generic `js_rel_lt`, preserving full JS coercion for non-number values). LoopUnswitch peels the invariant branch into two loops at -O2+. Added the `SHORT_STRING_TAG` ABI-mirror constant to codegen's nanbox.rs. Verified: any-bound guards now lower to `icmp slt i32`; numeric any computes correctly via the fast path; string/<= cases keep correct coercion via the slow path. perry-codegen 16/16, perry-hir + perry 522/0 + integration green.
b87afd2 to
c564446
Compare
Problem
A tight
for (let i = 0; i < n; i++)integer loop whose boundnis not statically typednumber— most commonly anany-typed or un-annotated value (e.g. a count fromJSON.parse/untyped request data) — lowered itsi < n/i <= nguard to a generic per-iteration comparison:On the hot path of a compute kernel this is ~50× slower than an integer induction variable +
icmp, and the call blocks SCEV / the loop vectorizer.Why it looked arch-specific (it isn't)
The reporter observed this only on x86_64-linux, with arm64 "optimizing it correctly." But the emitted LLVM IR is byte-identical across both targets — only the
target tripleheader line differs.js_rel_ltis a#[no_mangle] extern "C"runtime function in a separate compilation unit:js_rel_ltinto the loop and the call vanishes → looks optimized.--target linux/ Lambda links a prebuiltlibperry_runtime.awith no cross-module inlining → the per-iterationcallqsurvives → the ~50× hit.The suboptimal IR was present on both arches; arm64 just masked it at link time. (A complementary follow-up — marking
js_rel_*inlinable / enabling cross-module inlining on the prebuilt-archive path — would help every un-specialized hot comparison on Lambda, not just loop guards.)Fix
A runtime-guarded i32 specialization extending the existing
i < arr.lengthandi < n(number-typed) peepholes toany/untyped bounds:classify_for_local_bound_dynamicmatchesi < n/i <= nwherenis an accessible (unboxed, non-module-global) local whose static type is notnumber/int32.is-numbercheck (NaN-box tag test mirroringJSValue::is_number) plusfptosi(n).for.cond.fast→icmp slt i32(no per-iterationsitofp, no call)for.cond.slow→ the genericjs_rel_ltpath, preserving full JS coercion semantics for non-number values.SHORT_STRING_TAGABI-mirror constant to codegen'snanbox.rs.When the bound is a primitive number, hoisting
fptosi(n)once carries the same documented trust-types trade-off as the staticnumber-typed path (a non-integer float bound shifts the trip count by at most one).Verification
any-bound guard now lowers tofor.cond.fast: icmp slt i32with no per-iterationsitofp/call; the entry block computes theis-numberflag +fptosi(n)once.anybound → fast path, correct sum (499999500000);"3"string bound → slow path, coerces to3;<=bound →15.perry-codegen16/16;perry-hir+perrylib 522/0 plus integration suites green. The residualperry-runtimedate/url failures and theissue_4909HTTP-timeout test are pre-existing load/timezone flakes — unchanged with this patch stashed, and pass in isolation.Touches only
crates/perry-codegen(stmt/loops.rs,nanbox.rs); +version bump/changelog.Summary by CodeRabbit
Bug Fixes
Documentation
Chores