Closed
Conversation
…ation, FTRTRI blocked base case, double->float conversion for ftrsm/ftrmm/ftrsv, and pre-existing bug fixes
- Make primary MMHelper template minimal (recLevel + parseq only), suitable for non-bounded modes (DefaultTag, ConvertTo, etc.) - Extract bounds tracking into MMHelperBounded base class with all delayed reduction machinery (Amin/Amax, Bmin/Bmax, etc.) - Specialize MMHelper for LazyTag, DelayedTag, DefaultBoundedTag via inheritance from MMHelperBounded - Add IsBoundedMode trait to identify bounded mode categories - Introduce AddHelper<IsSub> to replace NeedPreAddReduction and NeedPreSubReduction free functions (29 call sites updated) - Restore inline comments in ftrtri_basecase - Update TODO to mark simplification items as done
Removed macOS job configuration from CI workflow.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improved base cases, double→float conversion, and MMHelper simplification
Summary
This PR addresses four items from the TODO list:
AddHelperreplacingNeedPreAddReductionAll 43 tests pass with zero warnings. Benchmarks show 3–5× speedup for
Modular<int32_t>operations with no regression forModular<double>.Changes
1. LUdivine-PLUQ (
ffpack_pluq.inl,ffpack_ludivine.inl,ffpack.h)#ifdef LEFTLOOKINGwith a runtimeFFPACK_LU_TAGenum(
LAPACK_CROUT,LAPACK_LEFT_LOOKING,TILE_CROUT,TILE_LEFT_LOOKING), defaulting toLAPACK_CROUT.floatanddoublespecializations into a single SFINAE-basedtemplate, eliminating ~120 lines of duplicated code.
cyclic_shift_row_colusedElement_ptrinstead ofElementinan intermediate variable.
2. FTRTRI base case optimization (
ffpack_ftrtr.inl)ftrtri_basecase()with a blocked algorithm (block size 4): splits the matrix into asmall block and remainder, using
ftrmmon blocks instead of single rows.ftrtri()now delegates toftrtri_basecase()at the threshold.3. Double→float conversion (
fflas_ftrsm.inl,fflas_ftrmm.inl,fflas_ftrsv.inl)_try_convertSFINAE helpers inFFLAS::Protectedto attempt conversion fromModular<double>toModular<float>when p² < 2²⁴ (single-precision safe).ftrsm,ftrmm, andftrsv— matching the existing pattern infgemm.4. MMHelper simplification (
fflas_helpers.inl,fflas_fgemm.inl, + 3 schedule files)MMHelpernow contains onlyrecLevelandparseqbydefault, suitable for non-bounded modes (
DefaultTag,ConvertTo, etc.). The previousDefaultTagandConvertTopartial specializations are removed (now redundant).MMHelperBoundedbase class: all bounds tracking machinery (Amin/Amax,Bmin/Bmax,Cmin/Cmax,Outmin/Outmax,FieldMin/FieldMax,MaxStorableValue,delayedField,MaxDelayedDim,setOutBounds,checkA/B/Out, etc.) extracted into this base.MMHelper<..., LazyTag, ...>,MMHelper<..., DelayedTag, ...>,and
MMHelper<..., DefaultBoundedTag, ...>inherit fromMMHelperBoundedviausing Base::Base.IsBoundedMode<ModeTrait>trait: identifies bounded mode categories at compile time.AddHelper<IsSub>: replacesNeedPreAddReduction(IsSub=false) andNeedPreSubReduction(IsSub=true) free functions. 29 call sites updated acrossschedule_winograd.inl,schedule_winograd_acc.inl, andfflas_fsyrk_strassen.inl.NeedDoublePreAddReduction,NeedPreScalReduction, andNeedPreAxpyReductionare keptas-is (different signatures).
Bug fixes
ffpack_permutation.inl: fixedcyclic_shift_row_colusingElement_ptrwhereElementwas needed.
Files changed (14 files, +620 −374)
ffpack/ffpack.hFFPACK_LU_TAGenumffpack/ffpack_pluq.inlffpack/ffpack_ludivine.inlffpack/ffpack_ftrtr.inlftrtri_basecase+ inline commentsffpack/ffpack_permutation.inlcyclic_shift_row_colfflas/fflas_ftrsm.inlfflas/fflas_ftrmm.inlfflas/fflas_ftrsv.inlfflas/fflas_helpers.inlMMHelperBoundedbasefflas/fflas_fgemm.inlAddHelper<IsSub>replacingNeedPre*Reductionfflas/fflas_fgemm/schedule_winograd.inlfflas/fflas_fgemm/schedule_winograd_acc.inlfflas/fflas_fsyrk_strassen.inlTODOTesting
make check -j 16: 43/43 PASS, 0 FAIL, 0 warningsftrtri,fgemm,ftrsmonModular<int32_t>andModular<double>:3–5× speedup for integer modular types, no regression for floating-point types.