Skip to content

PERF: AlignSections Filters OoC Optimization#1560

Draft
joeykleingers wants to merge 3 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-OptimizeGroupE
Draft

PERF: AlignSections Filters OoC Optimization#1560
joeykleingers wants to merge 3 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-OptimizeGroupE

Conversation

@joeykleingers
Copy link
Contributor

@joeykleingers joeykleingers commented Mar 5, 2026

Summary

Adds slice-buffered OOC algorithm paths for the 4 AlignSections filters, using dual-dispatch (Strategy C) to preserve the original in-core code untouched while adding OOC-optimized variants.

Changes

Base class (AlignSections.cpp)

  • New AlignSectionsTransferDataOocImpl<T> dispatched when any cell array is OOC
  • Reads each Z-slice into a local buffer, applies the 2D X/Y shift in memory, writes back — eliminates per-tuple chunk thrashing

AlignSectionsMisorientation

  • New findShiftsOoc() buffers 2 adjacent Z-slices of quats, cellPhases, and mask before the convergence loop
  • Pre-loads first reference slice, then std::swap cur→ref each iteration to avoid re-reading the reference from DataStore
  • Eliminates repeated ZarrStore reads during the 7×7 candidate shift grid convergence

AlignSectionsMutualInformation

  • New formFeaturesSectionsOoc() buffers one Z-slice of quats, cellPhases, and mask for the per-slice 2D flood-fill segmentation

AlignSectionsFeatureCentroid & AlignSectionsListFilter

  • Benefit from transfer phase optimization only (findShifts access patterns are already sequential/trivial)

Tests

  • All 9 correctness tests now exercise both in-core and OOC algorithm paths via GENERATE(false, true) + ForceOocAlgorithmGuard
  • Benchmark tests (200³ programmatic datasets) left without GENERATE for clean timing

Benchmark Results (200×200×200)

Filter In-Core Before → After OOC Before → After OOC Speedup
AlignSectionsMisorientation 0.74s → 0.79s (~1.0x) 32.89s → 16.14s 2.0x
AlignSectionsMutualInformation 0.49s → 0.52s (~1.0x) 15.61s → 15.14s ~1.0x
AlignSectionsFeatureCentroid 0.24s → 0.28s (~1.0x) 8.41s → 8.82s ~1.0x
AlignSectionsListFilter 0.22s → 0.25s (~1.0x) 7.50s → 7.95s ~1.0x

Optimization Ceiling Analysis

The OOC speedups are more modest than Groups B/C/D because the transfer phase dominates OOC runtime and is bottlenecked by ZarrStore's per-element overhead (~55–75ns per operator[]: mutex lock/unlock + chunk lookup vs ~1ns for in-core DataStore). The Misorientation filter shows the most benefit (2.0x) because its findShifts convergence loop re-reads the same 2 slices many times — slice buffering plus reference-slice swap (reusing the previous iteration's current-slice buffer as the next iteration's reference) eliminates that repeated I/O.

Further improvement requires deeper OOC infrastructure changes:

  1. Bulk read/write API on AbstractDataStore — eliminates ~47.8M mutex lock/unlock cycles per filter (~1–2s savings)
  2. Chunk-level bulk transfer in FileCore — bypasses per-element chunk lookup entirely (estimated 3–5x transfer phase improvement)
  3. Larger FIFO cache or per-array cache isolation — enables parallel OOC array processing

Test Plan

  • All 9 correctness tests pass on in-core build (simplnx-Rel)
  • All 9 correctness tests pass on OOC build (simplnx-ooc-Rel)
  • GENERATE(false, true) exercises both algorithm paths in both builds
  • Benchmark tests confirm zero in-core regression

Add slice-buffered OOC paths for the AlignSections filter family:
- AlignSectionsMisorientation: OOC findShiftsOoc() with 2-slice quats/phases/mask buffering (1.6x OOC speedup)
- AlignSectionsMutualInformation: OOC formFeaturesSectionsOoc() with per-slice buffering
- AlignSectionsFeatureCentroid: Transfer phase optimization only
- AlignSectionsListFilter: Transfer phase optimization only

Base class AlignSections::execute() now dispatches to AlignSectionsTransferDataOocImpl
when any cell array is OOC, using sequential read-into-buffer then write-back-shifted
pattern that eliminates per-tuple chunk thrashing.

All correctness tests now exercise both in-core and OOC algorithm paths via
GENERATE(false, true) + ForceOocAlgorithmGuard.

Signed-off-by: BlueQuartz Software <info@bluequartz.net>
@joeykleingers joeykleingers force-pushed the worktree-OptimizeGroupE branch from 200daaa to 8251640 Compare March 5, 2026 15:12
@imikejackson imikejackson changed the title PERF: OOC optimization for AlignSections family (Group E) PERF: OoC optimization for AlignSections Filters Mar 5, 2026
…ndShifts

Pre-load the first reference slice before the convergence loop and swap
cur→ref buffers at each iteration instead of re-reading the reference
from DataStore. Halves per-iteration DataStore reads, improving OOC
Misorientation from 21s to 16s (2.0x vs baseline). Add Doxygen comments
for private OOC methods in Misorientation and MutualInformation headers.
- Remove unused #include <iostream> from AlignSectionsMisorientation.cpp
- Remove duplicate cancel check (m_ShouldCancel before getCancel())
- Fix local variable naming: m_CellPhases/m_CrystalStructures → cellPhases/crystalStructures in formFeaturesSections
- Use hidden Catch2 tag [.Benchmark] so benchmark tests don't run in default CI
- Run clang-format on all PR files
@imikejackson imikejackson changed the title PERF: OoC optimization for AlignSections Filters PERF: AlignSections Filters OoC Optimization Mar 9, 2026
@joeykleingers joeykleingers marked this pull request as draft March 10, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant