perf: 1.7-2.4x faster SSIM (experimental, AI-generated) by lilith · Pull Request #183 · kornelski/dssim

lilith · 2026-03-04T03:54:32Z

Summary

Experimental branch showing 1.7-2.4x speedup (CPU user time) on typical images. This is AI-generated and not intended for merging as-is — it's meant as inspiration for future optimizations.

What's here

Separable 5-tap blur — fused double 3-tap into single H5+V5 pass, halving memory traffic
blur_mul fusion — element-wise multiply folded into horizontal blur pass
AVX2+FMA SIMD (via archmage) for blur, RGB→Lab conversion, and 3-channel SSIM comparison
Pure Rust image loading — png/zune-jpeg/moxcms replace lodepng/load_image/lcms2
Edition 2024, MSRV 1.85, deny(unsafe_code) with targeted allows on FFI only

Benchmarks

CPU user time, averaged over 3-5 runs:

Image size	Baseline (s)	This branch (s)	Speedup
512×512 (CSIQ photo)	0.068	0.028	2.4×
1024×1024	0.340	0.190	1.8×
4096×4096	4.963	2.857	1.7×

Caveats

zenbitmaps is a git dependency (not published to crates.io)
AI-generated code — needs human review for correctness and style
Edge clamping in 5-tap blur differs slightly from double 3-tap at image borders (negligible for SSIM)
No avif/webp/mozjpeg feature support (C deps removed)

Apply cargo fmt across all source files. Replace GammaComponent::max_value() method with COMPONENT_MAX const. Fix std::f64::EPSILON -> f64::EPSILON legacy constant. Suppress clippy::enum_variant_names on vImage_Flags (macOS FFI).

Rewrite blur as separable horizontal + vertical passes. Fuse double 3-tap (H→V→H→V) into single 5-tap (H5→V5), halving memory traffic. Add blur_mul: fuses element-wise multiply into the horizontal blur pass, used for img_sq_blur and img1_img2_blur. Add compare_scale_3ch with manually unrolled 3-channel SSIM loop, replacing LAB interleaving via multizip. Parallelize 3-channel img1_img2_blur across channels.

Add archmage and magetypes as optional dependencies behind fma feature. SIMD blur: blur_avx2, blur_in_place_avx2, blur_mul_avx2 using f32x8. SIMD tolab: vectorized RGB→XYZ matrix multiply + cbrt polynomial. SIMD SSIM: compare_3ch_avx2 processes 8 pixels per iteration. Runtime dispatch via X64V3Token::summon() with scalar fallback. All archmage code is safe — no unsafe blocks in SIMD modules.

Replace lodepng/load_image/lcms2 C dependencies with pure Rust: - PNG via `png` crate (8/16-bit, ICC profiles) - JPEG via `zune-jpeg` (RGB and grayscale, ICC profiles) - PNM/PAM via `zenbitmaps` - ICC color management via moxcms (sRGB parametric TRC) - Opaque alpha stripping, 16-bit big-endian byte-swap - Remove avif/webp/mozjpeg features (C deps eliminated)

- Edition 2021 → 2024, rust-version 1.72 → 1.85 - #[no_mangle] → #[unsafe(no_mangle)] in c_api.rs - Add unsafe {} blocks inside unsafe fns (edition 2024 requirement) - cargo fmt for edition 2024 import sorting rules

- Add #![deny(unsafe_code)] crate-level lint to dssim-core - Replace uninit_f32_vec (unsafe set_len) with zeroed_f32_vec (vec![0.0; n]) - #[allow(unsafe_code)] only on c_api and ffi modules (FFI boundary) - blur module is now fully safe — no allow needed - unsafe extern "C" block in ffi.rs (edition 2024 requirement) - All non-FFI modules are verified unsafe-free (including SIMD via archmage)

Point at imazen/zenbitmaps rev 818992c instead of local path.

lilith added 7 commits March 3, 2026 19:44

refactor: cargo fmt and clippy cleanup

16004ea

Apply cargo fmt across all source files. Replace GammaComponent::max_value() method with COMPONENT_MAX const. Fix std::f64::EPSILON -> f64::EPSILON legacy constant. Suppress clippy::enum_variant_names on vImage_Flags (macOS FFI).

chore: update to Rust edition 2024, MSRV 1.85

725127d

- Edition 2021 → 2024, rust-version 1.72 → 1.85 - #[no_mangle] → #[unsafe(no_mangle)] in c_api.rs - Add unsafe {} blocks inside unsafe fns (edition 2024 requirement) - cargo fmt for edition 2024 import sorting rules

chore: switch zenbitmaps from path dep to git rev

bacaad7

Point at imazen/zenbitmaps rev 818992c instead of local path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: 1.7-2.4x faster SSIM (experimental, AI-generated)#183

perf: 1.7-2.4x faster SSIM (experimental, AI-generated)#183
lilith wants to merge 7 commits intokornelski:mainfrom
lilith:pr-cleanup

lilith commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lilith commented Mar 4, 2026

Summary

What's here

Benchmarks

Caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant