perf: rewrite composite_at as integer source-over kernel by AsPulse · Pull Request #10 · comnipl/tellur

AsPulse · 2026-05-25T11:33:58Z

Replaces composite_at — the per-pixel source-over kernel that every raster layer ultimately funnels into — with an integer fixed-point implementation. On memoization_benchmark (1920×1080 @ 60fps, 5s timeline) this drops wall-clock from 43.09s to 3.67s without the cache (~11.7x) and from 40.18s to 1.97s with CachingRenderContext (~18.3x). The previous PR identified composite_at as the bottleneck that memoization couldn't reach; this PR makes it not the bottleneck anymore.

tellur-core::composite (new): single home for composite_at, replacing the two byte-identical copies that lived in tellur-core::layer and tellur-renderer::shadow. Future tuning has one site to touch.
Per-pixel math switches from f32 (with / 255.0, * 255.0, round(), clamp(), as u8) to u32 fixed-point with 255 as the unit. Porter-Duff straight-alpha source-over expands to out_a_x255 = sa * 255 + da * (255 - sa) and out_c = (sc * sa * 255 + dc * da * (255 - sa) + out_a_x255 / 2) / out_a_x255; intermediates peak at 255³ ≈ 1.7×10⁷, well inside u32. No as f32 or round/clamp left in the hot path.
Fully-transparent (sa == 0) source pixels continue without touching dst; fully-opaque (sa == 255) pixels go through a 4-byte copy_from_slice. The scalar blend only runs on genuinely partial coverage.
Inner loop reshaped row-wise: a per-row dst_row / src_row slice is sliced once via chunks_exact_mut(4).zip(chunks_exact(4)), so the bounds check on src_pixels[idx + 0..3] collapses to one per pixel block and the codegen gets a friendlier shape to auto-vectorize.
composite::tests: property test sweeping 480 (sa, da, sr, dr) combinations confirms the integer kernel never diverges from an f64 straight-alpha oracle by more than 1 LSB on any channel. Plus targeted tests for the transparent / opaque / clipped / fully-clipped paths.
CachingRenderContext::DEFAULT_CAPACITY_BYTES lowered from 8 GiB to 1 GiB. With the faster kernel the cache no longer needs to be enormous to win: at 1 GiB the same scene caches 1019.99 MiB / evicts 2.69 GiB and still beats the 8 GiB run (1.97s vs 2.19s), because LRU naturally keeps the small high-reuse outputs and lets the large root images cycle out.

Per-type breakdown from memoization_benchmark (before → after):

component	before self	after self
`Padding`	5.19s	432.50ms
`Stack`	3.47s	339.58ms
`BouncingDot`	219.04ms	65.70ms
`SolidRect`	5.18ms	4.59ms
`DropShadow`	1.24ms	851µs

perf: rewrite composite_at as integer source-over kernel

5c4cfe8

AsPulse merged commit 5ac8377 into main May 25, 2026
14 checks passed

AsPulse deleted the perf/composite-at branch May 25, 2026 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: rewrite composite_at as integer source-over kernel#10

perf: rewrite composite_at as integer source-over kernel#10
AsPulse merged 1 commit into
mainfrom
perf/composite-at

AsPulse commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AsPulse commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant