Skip to content

codegen/perf: defensive struct-param copies + dead frame zero-fill + no small-aggregate scalarization #220

@0xGeorgii

Description

@0xGeorgii

Summary

For struct-heavy code, codegen emits a lot of avoidable linear-memory traffic. Three patterns, all visible in the simplest helpers:

  1. Read-only struct parameters are defensively memory.copy'd into the callee frame on entry, even when the function never mutates them.
  2. The whole frame is memory.fill-zeroed in the prologue, then immediately fully overwritten by those copies — a dead fill.
  3. Small fixed structs are never scalarized (no SROA). A 24-byte Vec3 always lives in linear memory and is accessed via load/store; leaf readers allocate a shadow-stack frame they would not otherwise need.

Evidence — dot(a: Vec3, b: Vec3) -> i64 (a pure reader)

(func $dot (param $a i32) (param $b i32) (result i64)
  (local $f i64) (local $__frame_ptr i32)
  global.get 0  i32.const 48  i32.sub  local.tee $__frame_ptr  global.set 0
  local.get $__frame_ptr  i32.const 0  i32.const 48  memory.fill   ;; (2) zero whole frame…
  local.get $__frame_ptr  local.get $a  i32.const 24  memory.copy   ;; (1) …then copy param a in
  local.get $__frame_ptr  local.set $a
  local.get $__frame_ptr  i32.const 24  i32.add  local.get $b  i32.const 24  memory.copy  ;; …and b
  ...
  ;; only now: 6 i64.load + 2 i64.mul + 2 i64.add + i64.shr_s
)

dot only reads a/b, yet it: allocates a 48-byte frame, zero-fills it, and copies both 24-byte params into it (the fill is fully clobbered by the two copies). With copies elided it would need no frame at all and could load straight from the caller pointers.

Module-wide

For out/main.wasm (the ray tracer): 119 memory.copy, 26 memory.fill, 1492 local.get, versus ~12 i64.mul / 2 i64.div_s of "real" arithmetic (most math is delegated to the linked fixmath). Hot-path examples: let sph: Sphere = scene[i] copies 80 bytes per sphere per bounce; every Vec3 helper copies its parameters. All correctness-neutral — output is correct — but it's the dominant cost for this workload.

Suggested directions

  • A simple mutation/escape analysis: pass non-mutated struct params by reference (skip the entry copy). Value semantics only require a copy when the callee mutates the param.
  • Skip the prologue memory.fill for frame regions that are fully overwritten before any read (here, the param slots). (Related to Redundant zero-initialization in array literal codegen #188, which fixed the all-zero array-literal case; this is a different trigger — param copies rather than literal stores.)
  • SROA for small fixed-size structs: keep Vec3-sized aggregates in i64 locals instead of linear memory, eliminating frame setup + load/store for temporaries.

Related: #188 (redundant zero-init in array-literal codegen, fixed).


Environment: infc 0.0.1 (commit df45600, ABI 1.0), Inferara/inference @ d82d935 (wasm-linker). Found while implementing a fixed-point ray tracer entirely in Inference (scratch/raytracing-in-one-weekend). WAT produced via wasmprinter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions