Skip to content

perf: rewrite composite_at as integer source-over kernel#10

Merged
AsPulse merged 1 commit into
mainfrom
perf/composite-at
May 25, 2026
Merged

perf: rewrite composite_at as integer source-over kernel#10
AsPulse merged 1 commit into
mainfrom
perf/composite-at

Conversation

@AsPulse
Copy link
Copy Markdown
Member

@AsPulse AsPulse commented May 25, 2026

Replaces composite_at — the per-pixel source-over kernel that every raster layer ultimately funnels into — with an integer fixed-point implementation. On memoization_benchmark (1920×1080 @ 60fps, 5s timeline) this drops wall-clock from 43.09s to 3.67s without the cache (~11.7x) and from 40.18s to 1.97s with CachingRenderContext (~18.3x). The previous PR identified composite_at as the bottleneck that memoization couldn't reach; this PR makes it not the bottleneck anymore.

  • tellur-core::composite (new): single home for composite_at, replacing the two byte-identical copies that lived in tellur-core::layer and tellur-renderer::shadow. Future tuning has one site to touch.
  • Per-pixel math switches from f32 (with / 255.0, * 255.0, round(), clamp(), as u8) to u32 fixed-point with 255 as the unit. Porter-Duff straight-alpha source-over expands to out_a_x255 = sa * 255 + da * (255 - sa) and out_c = (sc * sa * 255 + dc * da * (255 - sa) + out_a_x255 / 2) / out_a_x255; intermediates peak at 255³ ≈ 1.7×10⁷, well inside u32. No as f32 or round/clamp left in the hot path.
  • Fully-transparent (sa == 0) source pixels continue without touching dst; fully-opaque (sa == 255) pixels go through a 4-byte copy_from_slice. The scalar blend only runs on genuinely partial coverage.
  • Inner loop reshaped row-wise: a per-row dst_row / src_row slice is sliced once via chunks_exact_mut(4).zip(chunks_exact(4)), so the bounds check on src_pixels[idx + 0..3] collapses to one per pixel block and the codegen gets a friendlier shape to auto-vectorize.
  • composite::tests: property test sweeping 480 (sa, da, sr, dr) combinations confirms the integer kernel never diverges from an f64 straight-alpha oracle by more than 1 LSB on any channel. Plus targeted tests for the transparent / opaque / clipped / fully-clipped paths.
  • CachingRenderContext::DEFAULT_CAPACITY_BYTES lowered from 8 GiB to 1 GiB. With the faster kernel the cache no longer needs to be enormous to win: at 1 GiB the same scene caches 1019.99 MiB / evicts 2.69 GiB and still beats the 8 GiB run (1.97s vs 2.19s), because LRU naturally keeps the small high-reuse outputs and lets the large root images cycle out.

Per-type breakdown from memoization_benchmark (before → after):

component before self after self
Padding 5.19s 432.50ms
Stack 3.47s 339.58ms
BouncingDot 219.04ms 65.70ms
SolidRect 5.18ms 4.59ms
DropShadow 1.24ms 851µs

@AsPulse AsPulse merged commit 5ac8377 into main May 25, 2026
14 checks passed
@AsPulse AsPulse deleted the perf/composite-at branch May 25, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant