Skip to content

Improved basecases#419

Closed
cxzhong wants to merge 5 commits intolinbox-team:masterfrom
cxzhong:improved-basecases
Closed

Improved basecases#419
cxzhong wants to merge 5 commits intolinbox-team:masterfrom
cxzhong:improved-basecases

Conversation

@cxzhong
Copy link
Copy Markdown

@cxzhong cxzhong commented Mar 22, 2026

Improved base cases, double→float conversion, and MMHelper simplification

Summary

This PR addresses four items from the TODO list:

  1. LUdivine-PLUQ cleanup — unified routine with runtime dispatch
  2. FTRTRI base case optimization — blocked base case for better cache locality
  3. Double→float conversion — extended from fgemm to ftrsm, ftrmm, ftrsv
  4. MMHelper simplification — minimal primary template + AddHelper replacing NeedPreAddReduction

All 43 tests pass with zero warnings. Benchmarks show 3–5× speedup for Modular<int32_t> operations with no regression for Modular<double>.


Changes

1. LUdivine-PLUQ (ffpack_pluq.inl, ffpack_ludivine.inl, ffpack.h)

  • PLUQ: replaced compile-time #ifdef LEFTLOOKING with a runtime FFPACK_LU_TAG enum
    (LAPACK_CROUT, LAPACK_LEFT_LOOKING, TILE_CROUT, TILE_LEFT_LOOKING), defaulting to LAPACK_CROUT.
  • LUdivine: unified the float and double specializations into a single SFINAE-based
    template, eliminating ~120 lines of duplicated code.
  • Fixed pre-existing bug: cyclic_shift_row_col used Element_ptr instead of Element in
    an intermediate variable.

2. FTRTRI base case optimization (ffpack_ftrtr.inl)

  • Added ftrtri_basecase() with a blocked algorithm (block size 4): splits the matrix into a
    small block and remainder, using ftrmm on blocks instead of single rows.
  • The row-by-row micro base case is retained for N ≤ 4.
  • The recursive ftrtri() now delegates to ftrtri_basecase() at the threshold.

3. Double→float conversion (fflas_ftrsm.inl, fflas_ftrmm.inl, fflas_ftrsv.inl)

  • Added _try_convert SFINAE helpers in FFLAS::Protected to attempt conversion from
    Modular<double> to Modular<float> when p² < 2²⁴ (single-precision safe).
  • Applied to ftrsm, ftrmm, and ftrsv — matching the existing pattern in fgemm.

4. MMHelper simplification (fflas_helpers.inl, fflas_fgemm.inl, + 3 schedule files)

  • Minimal primary template: MMHelper now contains only recLevel and parseq by
    default, suitable for non-bounded modes (DefaultTag, ConvertTo, etc.). The previous
    DefaultTag and ConvertTo partial specializations are removed (now redundant).
  • MMHelperBounded base class: all bounds tracking machinery (Amin/Amax, Bmin/Bmax,
    Cmin/Cmax, Outmin/Outmax, FieldMin/FieldMax, MaxStorableValue, delayedField,
    MaxDelayedDim, setOutBounds, checkA/B/Out, etc.) extracted into this base.
  • Bounded specializations: MMHelper<..., LazyTag, ...>, MMHelper<..., DelayedTag, ...>,
    and MMHelper<..., DefaultBoundedTag, ...> inherit from MMHelperBounded via using Base::Base.
  • IsBoundedMode<ModeTrait> trait: identifies bounded mode categories at compile time.
  • AddHelper<IsSub>: replaces NeedPreAddReduction (IsSub=false) and
    NeedPreSubReduction (IsSub=true) free functions. 29 call sites updated across
    schedule_winograd.inl, schedule_winograd_acc.inl, and fflas_fsyrk_strassen.inl.
    NeedDoublePreAddReduction, NeedPreScalReduction, and NeedPreAxpyReduction are kept
    as-is (different signatures).

Bug fixes

  • ffpack_permutation.inl: fixed cyclic_shift_row_col using Element_ptr where Element
    was needed.

Files changed (14 files, +620 −374)

File Description
ffpack/ffpack.h Added FFPACK_LU_TAG enum
ffpack/ffpack_pluq.inl Runtime LU variant dispatch
ffpack/ffpack_ludivine.inl Unified float/double via SFINAE
ffpack/ffpack_ftrtr.inl Blocked ftrtri_basecase + inline comments
ffpack/ffpack_permutation.inl Bug fix in cyclic_shift_row_col
fflas/fflas_ftrsm.inl Double→float conversion
fflas/fflas_ftrmm.inl Double→float conversion
fflas/fflas_ftrsv.inl Double→float conversion
fflas/fflas_helpers.inl Minimal primary MMHelper + MMHelperBounded base
fflas/fflas_fgemm.inl AddHelper<IsSub> replacing NeedPre*Reduction
fflas/fflas_fgemm/schedule_winograd.inl Updated 14 call sites
fflas/fflas_fgemm/schedule_winograd_acc.inl Updated 4 call sites
fflas/fflas_fsyrk_strassen.inl Updated 11 call sites
TODO Marked items as done

Testing

  • make check -j 16: 43/43 PASS, 0 FAIL, 0 warnings
  • Benchmarked ftrtri, fgemm, ftrsm on Modular<int32_t> and Modular<double>:
    3–5× speedup for integer modular types, no regression for floating-point types.

cxzhong added 2 commits March 22, 2026 21:09
…ation, FTRTRI blocked base case, double->float conversion for ftrsm/ftrmm/ftrsv, and pre-existing bug fixes
- Make primary MMHelper template minimal (recLevel + parseq only),
  suitable for non-bounded modes (DefaultTag, ConvertTo, etc.)
- Extract bounds tracking into MMHelperBounded base class with all
  delayed reduction machinery (Amin/Amax, Bmin/Bmax, etc.)
- Specialize MMHelper for LazyTag, DelayedTag, DefaultBoundedTag
  via inheritance from MMHelperBounded
- Add IsBoundedMode trait to identify bounded mode categories
- Introduce AddHelper<IsSub> to replace NeedPreAddReduction and
  NeedPreSubReduction free functions (29 call sites updated)
- Restore inline comments in ftrtri_basecase
- Update TODO to mark simplification items as done
@cxzhong cxzhong marked this pull request as ready for review March 22, 2026 15:21
@cxzhong cxzhong closed this Apr 3, 2026
@cxzhong cxzhong deleted the improved-basecases branch April 3, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant