Skip to content

perf(codegen): integer-specialize i < n loop guards for any/untyped bounds#5086

Merged
proggeramlug merged 1 commit into
mainfrom
feat/loop-guard-any-bound
Jun 13, 2026
Merged

perf(codegen): integer-specialize i < n loop guards for any/untyped bounds#5086
proggeramlug merged 1 commit into
mainfrom
feat/loop-guard-any-bound

Conversation

@proggeramlug

@proggeramlug proggeramlug commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Problem

A tight for (let i = 0; i < n; i++) integer loop whose bound n is not statically typed number — most commonly an any-typed or un-annotated value (e.g. a count from JSON.parse/untyped request data) — lowered its i < n / i <= n guard to a generic per-iteration comparison:

for.cond:
  %i  = load i32, ptr %ctr
  %id = sitofp i32 %i to double          ; vcvtsi2sd — every iteration
  %nd = load double, ptr %n              ; n kept as a NaN-boxed f64 on the stack
  %r  = call double @js_rel_lt(%id, %nd) ; callq — every iteration
  %b  = icmp eq i64 bitcast(%r), <TAG_TRUE>
  br i1 %b, label %body, label %exit

On the hot path of a compute kernel this is ~50× slower than an integer induction variable + icmp, and the call blocks SCEV / the loop vectorizer.

Why it looked arch-specific (it isn't)

The reporter observed this only on x86_64-linux, with arm64 "optimizing it correctly." But the emitted LLVM IR is byte-identical across both targets — only the target triple header line differs. js_rel_lt is a #[no_mangle] extern "C" runtime function in a separate compilation unit:

  • macOS uses the default auto-optimize build, which rebuilds + inlines the runtime so LLVM folds js_rel_lt into the loop and the call vanishes → looks optimized.
  • --target linux / Lambda links a prebuilt libperry_runtime.a with no cross-module inlining → the per-iteration callq survives → the ~50× hit.

The suboptimal IR was present on both arches; arm64 just masked it at link time. (A complementary follow-up — marking js_rel_* inlinable / enabling cross-module inlining on the prebuilt-archive path — would help every un-specialized hot comparison on Lambda, not just loop guards.)

Fix

A runtime-guarded i32 specialization extending the existing i < arr.length and i < n (number-typed) peepholes to any/untyped bounds:

  • New classify_for_local_bound_dynamic matches i < n / i <= n where n is an accessible (unboxed, non-module-global) local whose static type is not number/int32.
  • The loop head hoists, once, an is-number check (NaN-box tag test mirroring JSValue::is_number) plus fptosi(n).
  • The cond block branches on that loop-invariant flag:
    • for.cond.fasticmp slt i32 (no per-iteration sitofp, no call)
    • for.cond.slow → the generic js_rel_lt path, preserving full JS coercion semantics for non-number values.
  • LLVM's LoopUnswitch peels the invariant branch into two loops at -O2+; even unswitched, the hot (is-number) path runs pure integer compares.
  • Added the SHORT_STRING_TAG ABI-mirror constant to codegen's nanbox.rs.

When the bound is a primitive number, hoisting fptosi(n) once carries the same documented trust-types trade-off as the static number-typed path (a non-integer float bound shifts the trip count by at most one).

Verification

  • The any-bound guard now lowers to for.cond.fast: icmp slt i32 with no per-iteration sitofp/call; the entry block computes the is-number flag + fptosi(n) once.
  • Runtime correctness: numeric any bound → fast path, correct sum (499999500000); "3" string bound → slow path, coerces to 3; <= bound → 15.
  • Tests: perry-codegen 16/16; perry-hir + perry lib 522/0 plus integration suites green. The residual perry-runtime date/url failures and the issue_4909 HTTP-timeout test are pre-existing load/timezone flakes — unchanged with this patch stashed, and pass in isolation.

Touches only crates/perry-codegen (stmt/loops.rs, nanbox.rs); +version bump/changelog.

Summary by CodeRabbit

  • Bug Fixes

    • Improved loop performance when loop bounds come from dynamic/untyped values, reducing per-iteration overhead and speeding execution.
  • Documentation

    • Added changelog entry for v0.5.1164 describing the performance improvement and verification status.
  • Chores

    • Version bumped to v0.5.1164.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: cde6a90f-71eb-436e-9fb0-1f030666e868

📥 Commits

Reviewing files that changed from the base of the PR and between b87afd2 and c564446.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • CHANGELOG.md
  • CLAUDE.md
  • Cargo.toml
  • crates/perry-codegen/src/nanbox.rs
  • crates/perry-codegen/src/stmt/loops.rs
✅ Files skipped from review due to trivial changes (2)
  • Cargo.toml
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • CLAUDE.md
  • crates/perry-codegen/src/nanbox.rs
  • crates/perry-codegen/src/stmt/loops.rs

📝 Walkthrough

Walkthrough

This PR optimizes for-loop code generation with untyped (any) numeric bounds. It hoists a single is-number check and branches into a fast i32-compare path versus a slow coercion-preserving path, plus introduces an ABI-mirroring constant and bumps the workspace version to 0.5.1164.

Changes

Dynamic I32 Loop-Bound Optimization

Layer / File(s) Summary
Version and changelog documentation
Cargo.toml, CLAUDE.md, CHANGELOG.md
Workspace version incremented to 0.5.1164, with changelog documenting the loop-guard specialization for any-typed bounds and its verification status.
ABI constant for short-string tagging
crates/perry-codegen/src/nanbox.rs
Added SHORT_STRING_TAG constant to mirror the runtime's NaN-boxed tag band for short-string payload classification during codegen.
Dynamic i32 bound loop optimization
crates/perry-codegen/src/stmt/loops.rs
Extended for-loop lowering with DynamicI32Bound helper to carry hoisted is-number flag and fptosi-converted i32 bound. Hoisting logic allocates counter i32 slot if missing, computes is-number flag from nanbox tag bits, and precomputes i32 bound. Condition generation adds branching on hoisted flag: fast path uses i32 icmp, slow path falls back to full truthiness lowering. Added cleanup to remove dynamically allocated counter slots, and added classify_for_local_bound_dynamic to detect any-typed numeric bounds while excluding static cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🐰 A loop with bounds so shy,
We hoist its checks way up on high,
Fast paths race with i32 grace,
Slow paths keep every coercion in place,
Hops of joy — optimized pace!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is comprehensive, detailing the problem, root cause analysis, the fix implementation, and verification results. However, the CLAUDE.md and CHANGELOG.md were edited despite the template instructions forbidding this. Remove edits to CLAUDE.md and CHANGELOG.md as per template instructions—maintainers handle version/changelog updates at merge time.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: a performance optimization that adds integer specialization for loop guards with any/untyped bounds.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/loop-guard-any-bound

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/perry-codegen/src/stmt/loops.rs`:
- Around line 435-491: The dynamic fast path currently hoists is_number and
fptosi for a bound local without ensuring that bound_id is not mutated inside
the loop, so add a mutability/modified-local guard before creating
DynamicI32Bound: in the closure passed to and_then(|(counter_id, bound_id, op)|
{ ... }) (the classify_for_local_bound_dynamic path) check whether bound_id can
be written by the loop body or update (e.g. consult whichever
analysis/collection tracks locals modified by the loop such as a
loop-modified-locals set or a Local.is_mutable/modified-in-loop query) and
return None if it is mutated; only proceed to allocate flag_slot/bound_i32_slot
and return Some(DynamicI32Bound) when bound_id is guaranteed not to be changed
by the loop. Ensure this check uses the same identifying symbol bound_id and
prevents the one-time hoist when the local is mutable.
- Around line 480-483: The code currently does an unconditional fptosi on n_dbl
and stores it to bound_i32_slot, which can produce poison for
NaN/Inf/out-of-range values; instead, ensure the fptosi is only executed after
the finite/i32-range checks (i.e., inside the fast-path block or after the
flag_slot guard) — move the fptosi(&n_dbl, I32) and the store to bound_i32_slot
into the same basic block that is taken when the fast-number checks pass (or
emit a guarded conversion using the boolean guard that writes only when the
guard is true), leaving the pre-loop block to only allocate bound_i32_slot and
produce no fptosi result; update any uses that load bound_i32_slot (e.g., the
icmp compare) to occur in the fast block after the store so they never observe
an unguarded conversion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 60a74585-8d09-4f77-b456-7d5d2c1dcc88

📥 Commits

Reviewing files that changed from the base of the PR and between e092362 and b87afd2.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • CHANGELOG.md
  • CLAUDE.md
  • Cargo.toml
  • crates/perry-codegen/src/nanbox.rs
  • crates/perry-codegen/src/stmt/loops.rs

Comment thread crates/perry-codegen/src/stmt/loops.rs Outdated
Comment on lines +435 to +491
let dynamic_i32_bound: Option<DynamicI32Bound> =
if hoist_classification.is_none() && local_bound_classification.is_none() {
condition
.and_then(|cond| classify_for_local_bound_dynamic(cond, ctx))
.and_then(|(counter_id, bound_id, op)| {
let bound_slot = ctx.locals.get(&bound_id).cloned()?;
// Ensure an i32 counter slot exists (the Let site allocates
// one for `integer_locals`, but allocate here if absent so
// the fast path and Update stay in sync).
let counter_i32_was_fresh =
if !ctx.i32_counter_slots.contains_key(&counter_id) {
let counter_slot = ctx.locals.get(&counter_id).cloned()?;
let i32_slot = ctx.func.alloca_entry(I32);
let cur_dbl = ctx.block().load(DOUBLE, &counter_slot);
let cur_i32 = ctx.block().fptosi(DOUBLE, &cur_dbl, I32);
ctx.block().store(I32, &cur_i32, &i32_slot);
ctx.i32_counter_slots.insert(counter_id, i32_slot);
true
} else {
false
};
// One-time `is-number` test, mirroring runtime
// `JSValue::is_number`: a value is a number unless its tag
// bits fall in the Perry-owned band [SHORT_STRING_TAG,
// STRING_TAG].
let n_dbl = ctx.block().load(DOUBLE, &bound_slot);
let n_bits = ctx.block().bitcast_double_to_i64(&n_dbl);
let tag = ctx.block().and(
I64,
&n_bits,
&crate::nanbox::i64_literal(crate::nanbox::TAG_MASK),
);
let below = ctx.block().icmp_ult(
I64,
&tag,
&crate::nanbox::i64_literal(crate::nanbox::SHORT_STRING_TAG),
);
let above = ctx.block().icmp_ugt(
I64,
&tag,
&crate::nanbox::i64_literal(crate::nanbox::STRING_TAG),
);
let is_number = ctx.block().or(I1, &below, &above);
let flag_slot = ctx.func.alloca_entry(I1);
ctx.block().store(I1, &is_number, &flag_slot);
// `fptosi(n)` is valid only on the fast (is-number) path.
let bound_i32 = ctx.block().fptosi(DOUBLE, &n_dbl, I32);
let bound_i32_slot = ctx.func.alloca_entry(I32);
ctx.block().store(I32, &bound_i32, &bound_i32_slot);
Some(DynamicI32Bound {
counter_id,
op,
flag_slot,
bound_i32_slot,
counter_i32_was_fresh,
})
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject the dynamic fast path when the bound local is mutable inside the loop.

This path hoists both is_number and fptosi(n) once, but it never proves that bound_id stays unchanged across the body/update. A loop like for (let i = 0; i < n; i++) { n = "0"; } will keep using the entry-time flag/bound and run too many iterations instead of re-reading n each trip.

Suggested guard
         if hoist_classification.is_none() && local_bound_classification.is_none() {
             condition
                 .and_then(|cond| classify_for_local_bound_dynamic(cond, ctx))
                 .and_then(|(counter_id, bound_id, op)| {
+                    if stmts_mutate_local(body, bound_id)
+                        || update.is_some_and(|expr| expr_mutates_local(expr, bound_id))
+                    {
+                        return None;
+                    }
                     let bound_slot = ctx.locals.get(&bound_id).cloned()?;
                     // Ensure an i32 counter slot exists (the Let site allocates
                     // one for `integer_locals`, but allocate here if absent so
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/perry-codegen/src/stmt/loops.rs` around lines 435 - 491, The dynamic
fast path currently hoists is_number and fptosi for a bound local without
ensuring that bound_id is not mutated inside the loop, so add a
mutability/modified-local guard before creating DynamicI32Bound: in the closure
passed to and_then(|(counter_id, bound_id, op)| { ... }) (the
classify_for_local_bound_dynamic path) check whether bound_id can be written by
the loop body or update (e.g. consult whichever analysis/collection tracks
locals modified by the loop such as a loop-modified-locals set or a
Local.is_mutable/modified-in-loop query) and return None if it is mutated; only
proceed to allocate flag_slot/bound_i32_slot and return Some(DynamicI32Bound)
when bound_id is guaranteed not to be changed by the loop. Ensure this check
uses the same identifying symbol bound_id and prevents the one-time hoist when
the local is mutable.

Comment thread crates/perry-codegen/src/stmt/loops.rs Outdated
Comment on lines +480 to +483
// `fptosi(n)` is valid only on the fast (is-number) path.
let bound_i32 = ctx.block().fptosi(DOUBLE, &n_dbl, I32);
let bound_i32_slot = ctx.func.alloca_entry(I32);
ctx.block().store(I32, &bound_i32, &bound_i32_slot);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

LLVM LangRef fptosi: what happens when converting NaN, ±Inf, or out-of-range doublevalues toi32?

💡 Result:

In LLVM IR, the fptosi instruction converts a floating-point value to a signed integer by rounding towards zero [1][2]. If the source floating-point value (such as NaN, ±Inf, or a value that is out of range for the destination integer type) cannot be represented in the destination integer type, the result is a poison value [1][2]. Poison values are a form of undefined behavior in LLVM that propagate through instructions [1][2]. Because the conversion of NaN or out-of-range values produces poison, using the result of such an fptosi instruction in other operations can lead to undefined behavior for the program as a whole [3]. If you require defined behavior for these edge cases—such as clamping to the minimum or maximum representable integer—you should use the llvm.fptosi.sat intrinsic, which provides saturating conversion behavior [4][5].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant region around the reported lines
FILE="crates/perry-codegen/src/stmt/loops.rs"
nl -ba "$FILE" | sed -n '430,540p'

Repository: PerryTS/perry

Length of output: 100


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Also search for the surrounding guard / fast-slow split context to see where the fptosi lands.
FILE="crates/perry-codegen/src/stmt/loops.rs"
rg -n "fptosi\\(|is-number|NaN|fast|slow|primitive number|DOUBLE|I32" "$FILE"

Repository: PerryTS/perry

Length of output: 6052


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
awk 'NR>=430 && NR<=510 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,120p'

Repository: PerryTS/perry

Length of output: 4893


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
awk 'NR>=440 && NR<=500 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p'

Repository: PerryTS/perry

Length of output: 3534


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
rg -n "for\\.cond\\.fast|for\\.cond\\.slow|cond_br\\(&flag|flag_slot|bound_slot|bound_i32_slot|fptosi\\(DOUBLE, &n_dbl, I32\\)" "$FILE"

Repository: PerryTS/perry

Length of output: 1597


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"

# Show the fast/slow split logic around where flag is used
sed -n '540,660p' "$FILE"

Repository: PerryTS/perry

Length of output: 5550


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"

rg -n "fptosi\\.sat|llvm\\.fptosi\\.sat|fptosi\\(|freeze\\b|is_finite|isfinite|fcmp" "$FILE"

Repository: PerryTS/perry

Length of output: 1879


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"

# Find where the slow path computes the bound/compare to see if it still uses bound_i32_slot
rg -n "for\\.cond\\.slow|for\\.cond\\.fast|js_rel_|js_.*compare|CompareOp" "$FILE" -n

Repository: PerryTS/perry

Length of output: 1679


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"

# Show a wider window around the DynamicI32Bound usage
sed -n '560,640p' "$FILE"

Repository: PerryTS/perry

Length of output: 3949


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
rg -n "classify_for_local_bound_dynamic" -S "$FILE"

Repository: PerryTS/perry

Length of output: 279


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"
# Print around the function definition and any helpers
sed -n '660,820p' "$FILE"

Repository: PerryTS/perry

Length of output: 6618


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"

# Show the fast/slow split logic around where flag is used
sed -n '540,660p' "$FILE"

Repository: PerryTS/perry

Length of output: 5550


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"

rg -n "fptosi\\.sat|llvm\\.fptosi\\.sat|fptosi\\(|freeze\\b|is_finite|isfinite|fcmp" "$FILE"

Repository: PerryTS/perry

Length of output: 1879


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"

# Show how the slow path computes/compares (look for compare helpers/calls)
rg -n "for\\.cond\\.slow|for\\.cond\\.fast|js_rel_|CompareOp::|js_.*rel|js_.*compare" "$FILE"

Repository: PerryTS/perry

Length of output: 1135


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"

# Wider window around DynamicI32Bound usage to confirm where bound_i32_slot is read
sed -n '560,640p' "$FILE"

Repository: PerryTS/perry

Length of output: 3949


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"

# Locate the classifier to see whether it excludes NaN/Inf/out-of-range (or only checks tag)
rg -n "classify_for_local_bound_dynamic" "$FILE"

Repository: PerryTS/perry

Length of output: 279


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="crates/perry-codegen/src/stmt/loops.rs"

# Print around potential helper definitions near the matcher result (best-effort)
# (If this misses, we'll narrow further after seeing line numbers from rg.)
sed -n '820,1040p' "$FILE"

Repository: PerryTS/perry

Length of output: 8246


Avoid speculatively fptosi-ing dynamic loop bounds (poison/UB risk)

At crates/perry-codegen/src/stmt/loops.rs:480-483, fptosi(DOUBLE, &n_dbl, I32) is computed unconditionally in the pre-loop block, then stored to bound_i32_slot even though only flag_slot gates use in the fast/slow condition. LLVM fptosi yields poison when the input is NaN/±Inf or out-of-range for i32, and the fast path subsequently loads that result for the icmp comparison.

// `fptosi(n)` is valid only on the fast (is-number) path.
let bound_i32 = ctx.block().fptosi(DOUBLE, &n_dbl, I32);
let bound_i32_slot = ctx.func.alloca_entry(I32);
ctx.block().store(I32, &bound_i32, &bound_i32_slot);

Strengthen the guard so only finite, i32-representable doubles reach fptosi (and/or emit the conversion only inside the fast block after those checks).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/perry-codegen/src/stmt/loops.rs` around lines 480 - 483, The code
currently does an unconditional fptosi on n_dbl and stores it to bound_i32_slot,
which can produce poison for NaN/Inf/out-of-range values; instead, ensure the
fptosi is only executed after the finite/i32-range checks (i.e., inside the
fast-path block or after the flag_slot guard) — move the fptosi(&n_dbl, I32) and
the store to bound_i32_slot into the same basic block that is taken when the
fast-number checks pass (or emit a guarded conversion using the boolean guard
that writes only when the guard is true), leaving the pre-loop block to only
allocate bound_i32_slot and produce no fptosi result; update any uses that load
bound_i32_slot (e.g., the icmp compare) to occur in the fast block after the
store so they never observe an unguarded conversion.

… bounds

Tight integer loops whose bound is not statically typed `number` (most
commonly an `any`-typed or un-annotated count, e.g. out of `JSON.parse`)
lowered their `i < n` / `i <= n` guard to a generic per-iteration comparison:
`sitofp` the i32 counter back to a double, keep `n` as a NaN-boxed f64, and
`call @js_rel_lt` every iteration — ~50x slower than an integer induction
variable + `icmp`, and it blocks SCEV / the loop vectorizer.

The emitted IR is identical across x86_64 and arm64 (only the target-triple
header differs). It looked arch-specific only because the macOS auto-optimize
build inlines `js_rel_lt` and folds the call away, while the `--target linux`
prebuilt-runtime build keeps the per-iteration `callq` — the cause of poor
compute throughput on Lambda.

Fix: runtime-guarded i32 specialization extending the existing
`classify_for_local_bound` peephole. New `classify_for_local_bound_dynamic`
matches the shape; the loop head hoists, once, an `is-number` check (NaN-box
tag test mirroring `JSValue::is_number`) + `fptosi(n)`; the cond block branches
on that loop-invariant flag into `for.cond.fast` (`icmp slt i32`, no
per-iteration sitofp/call) and `for.cond.slow` (generic `js_rel_lt`, preserving
full JS coercion for non-number values). LoopUnswitch peels the invariant
branch into two loops at -O2+. Added the `SHORT_STRING_TAG` ABI-mirror constant
to codegen's nanbox.rs.

Verified: any-bound guards now lower to `icmp slt i32`; numeric any computes
correctly via the fast path; string/<= cases keep correct coercion via the slow
path. perry-codegen 16/16, perry-hir + perry 522/0 + integration green.
@proggeramlug proggeramlug force-pushed the feat/loop-guard-any-bound branch from b87afd2 to c564446 Compare June 13, 2026 12:20
@proggeramlug proggeramlug merged commit 396e3ff into main Jun 13, 2026
13 of 14 checks passed
@proggeramlug proggeramlug deleted the feat/loop-guard-any-bound branch June 13, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant