Skip to content

perf: 1.7-2.4x faster SSIM (experimental, AI-generated)#183

Open
lilith wants to merge 7 commits intokornelski:mainfrom
lilith:pr-cleanup
Open

perf: 1.7-2.4x faster SSIM (experimental, AI-generated)#183
lilith wants to merge 7 commits intokornelski:mainfrom
lilith:pr-cleanup

Conversation

@lilith
Copy link

@lilith lilith commented Mar 4, 2026

Summary

Experimental branch showing 1.7-2.4x speedup (CPU user time) on typical images. This is AI-generated and not intended for merging as-is — it's meant as inspiration for future optimizations.

What's here

  • Separable 5-tap blur — fused double 3-tap into single H5+V5 pass, halving memory traffic
  • blur_mul fusion — element-wise multiply folded into horizontal blur pass
  • AVX2+FMA SIMD (via archmage) for blur, RGB→Lab conversion, and 3-channel SSIM comparison
  • Pure Rust image loading — png/zune-jpeg/moxcms replace lodepng/load_image/lcms2
  • Edition 2024, MSRV 1.85, deny(unsafe_code) with targeted allows on FFI only

Benchmarks

CPU user time, averaged over 3-5 runs:

Image size Baseline (s) This branch (s) Speedup
512×512 (CSIQ photo) 0.068 0.028 2.4×
1024×1024 0.340 0.190 1.8×
4096×4096 4.963 2.857 1.7×

Caveats

  • zenbitmaps is a git dependency (not published to crates.io)
  • AI-generated code — needs human review for correctness and style
  • Edge clamping in 5-tap blur differs slightly from double 3-tap at image borders (negligible for SSIM)
  • No avif/webp/mozjpeg feature support (C deps removed)

lilith added 7 commits March 3, 2026 19:44
Apply cargo fmt across all source files.
Replace GammaComponent::max_value() method with COMPONENT_MAX const.
Fix std::f64::EPSILON -> f64::EPSILON legacy constant.
Suppress clippy::enum_variant_names on vImage_Flags (macOS FFI).
Rewrite blur as separable horizontal + vertical passes.
Fuse double 3-tap (H→V→H→V) into single 5-tap (H5→V5), halving
memory traffic.
Add blur_mul: fuses element-wise multiply into the horizontal blur
pass, used for img_sq_blur and img1_img2_blur.
Add compare_scale_3ch with manually unrolled 3-channel SSIM loop,
replacing LAB interleaving via multizip.
Parallelize 3-channel img1_img2_blur across channels.
Add archmage and magetypes as optional dependencies behind fma feature.
SIMD blur: blur_avx2, blur_in_place_avx2, blur_mul_avx2 using f32x8.
SIMD tolab: vectorized RGB→XYZ matrix multiply + cbrt polynomial.
SIMD SSIM: compare_3ch_avx2 processes 8 pixels per iteration.
Runtime dispatch via X64V3Token::summon() with scalar fallback.
All archmage code is safe — no unsafe blocks in SIMD modules.
Replace lodepng/load_image/lcms2 C dependencies with pure Rust:
- PNG via `png` crate (8/16-bit, ICC profiles)
- JPEG via `zune-jpeg` (RGB and grayscale, ICC profiles)
- PNM/PAM via `zenbitmaps`
- ICC color management via moxcms (sRGB parametric TRC)
- Opaque alpha stripping, 16-bit big-endian byte-swap
- Remove avif/webp/mozjpeg features (C deps eliminated)
- Edition 2021 → 2024, rust-version 1.72 → 1.85
- #[no_mangle] → #[unsafe(no_mangle)] in c_api.rs
- Add unsafe {} blocks inside unsafe fns (edition 2024 requirement)
- cargo fmt for edition 2024 import sorting rules
- Add #![deny(unsafe_code)] crate-level lint to dssim-core
- Replace uninit_f32_vec (unsafe set_len) with zeroed_f32_vec (vec![0.0; n])
- #[allow(unsafe_code)] only on c_api and ffi modules (FFI boundary)
- blur module is now fully safe — no allow needed
- unsafe extern "C" block in ffi.rs (edition 2024 requirement)
- All non-FFI modules are verified unsafe-free (including SIMD via archmage)
Point at imazen/zenbitmaps rev 818992c instead of local path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant