perf: rewrite composite_at as integer source-over kernel#10
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces
composite_at— the per-pixel source-over kernel that every raster layer ultimately funnels into — with an integer fixed-point implementation. Onmemoization_benchmark(1920×1080 @ 60fps, 5s timeline) this drops wall-clock from 43.09s to 3.67s without the cache (~11.7x) and from 40.18s to 1.97s withCachingRenderContext(~18.3x). The previous PR identifiedcomposite_atas the bottleneck that memoization couldn't reach; this PR makes it not the bottleneck anymore.tellur-core::composite(new): single home forcomposite_at, replacing the two byte-identical copies that lived intellur-core::layerandtellur-renderer::shadow. Future tuning has one site to touch.f32(with/ 255.0,* 255.0,round(),clamp(),as u8) tou32fixed-point with 255 as the unit. Porter-Duff straight-alpha source-over expands toout_a_x255 = sa * 255 + da * (255 - sa)andout_c = (sc * sa * 255 + dc * da * (255 - sa) + out_a_x255 / 2) / out_a_x255; intermediates peak at 255³ ≈ 1.7×10⁷, well insideu32. Noas f32orround/clampleft in the hot path.sa == 0) source pixelscontinuewithout touchingdst; fully-opaque (sa == 255) pixels go through a 4-bytecopy_from_slice. The scalar blend only runs on genuinely partial coverage.dst_row/src_rowslice is sliced once viachunks_exact_mut(4).zip(chunks_exact(4)), so the bounds check onsrc_pixels[idx + 0..3]collapses to one per pixel block and the codegen gets a friendlier shape to auto-vectorize.composite::tests: property test sweeping 480(sa, da, sr, dr)combinations confirms the integer kernel never diverges from anf64straight-alpha oracle by more than 1 LSB on any channel. Plus targeted tests for the transparent / opaque / clipped / fully-clipped paths.CachingRenderContext::DEFAULT_CAPACITY_BYTESlowered from 8 GiB to 1 GiB. With the faster kernel the cache no longer needs to be enormous to win: at 1 GiB the same scene caches 1019.99 MiB / evicts 2.69 GiB and still beats the 8 GiB run (1.97s vs 2.19s), because LRU naturally keeps the small high-reuse outputs and lets the large root images cycle out.Per-type breakdown from
memoization_benchmark(before → after):PaddingStackBouncingDotSolidRectDropShadow