Add LoopVectorization tensor-product fast path extension by AshtonSBradley · Pull Request #369 · SciML/SciMLOperators.jl

AshtonSBradley · 2026-05-05T21:07:58Z

Revision note

Thanks for the early review comments. I revised the PR to separate fast-path availability/control flow from the mutating implementation call, rewrote the k == 1/batched fast-path branch so the flow is explicit, and removed unrelated cache-slice edits plus a redundant test block. Hopefully this is now back on track as a minimal extension-backed optimization.

Summary

This PR adds an optional fast path for batched TensorProductOperator multiplication when the outer operator is a MatrixOperator wrapping a CPU StridedMatrix.

The core tensor implementation now exposes a small internal backend-neutral hook, _tensor_outer_mul_fast!, guarded by a separate _has_tensor_outer_mul_fast predicate. The existing generic implementation remains the fallback and still uses only the operator interface (mul!) after reshaping/permuting into the layout it needs. That keeps arbitrary matrix-free operators, sparse matrices, GPU arrays, and other non-strided operators on the abstraction-preserving path.

A new SciMLOperatorsLoopVectorizationExt extension overloads this hook with an @turbo loop for the narrow dense-strided MatrixOperator case. LoopVectorization is a weak dependency and a test dependency, not a hard dependency of SciMLOperators.

Backend design

The hook is intentionally not named after LoopVectorization. Core owns a generic availability predicate plus an implementation hook:

_has_tensor_outer_mul_fast(outer) = false
function _tensor_outer_mul_fast! end

The predicate keeps control flow separate from the mutating fast-path call. Backend extensions opt in by overloading the predicate for supported operator types and providing the corresponding _tensor_outer_mul_fast! methods.

That means LoopVectorization is only the first backend. If LoopModels, or another successor backend, becomes the preferred implementation later, it can live in its own extension and overload the same hook without changing the core tensor-product algorithm or public API.

The intended backend-switching model is:

Core SciMLOperators owns the generic fallback and the backend-neutral hook.
SciMLOperatorsLoopVectorizationExt owns today’s @turbo dense-strided implementation.
The same extension design could support an opt-in @tturbo method later, for users who want to experiment with threaded LoopVectorization behavior, without changing the core tensor code. This PR intentionally uses only @turbo as the safer default because it avoids thread oversubscription concerns with threaded BLAS.
A future SciMLOperatorsLoopModelsExt can provide the same hook for LoopModels once its API is ready.
A future GPU backend could also overload the same hook from a CUDA/KernelAbstractions-style extension, but this PR does not add a GPU kernel.
If no backend is loaded, behavior remains unchanged.

GPU compatibility

This PR should not change existing GPU behavior by dispatch. The new LoopVectorization method only applies to MatrixOperator wrapping a StridedMatrix, so GPU arrays such as CuArray should not be caught by the CPU scalar-indexing @turbo path. GPU-backed operators continue to use the existing generic fallback, which relies on the current permutedims!/mul! behavior for those array/operator types.

I did not run GPU tests locally, so this is a dispatch/abstraction argument rather than a measured GPU validation.

Performance notes

Local benchmarks on the existing benchmarks/tensor.jl setup showed the extension active only when LoopVectorization is loaded. These numbers were collected on Apple Silicon, so they should be treated as a conservative local check rather than the expected best case. The improvement may be larger on Intel AVX/AVX2 machines, which are closer to the hardware context discussed in #58 and where LoopVectorization's SIMD code generation has historically shown larger gains.

I also tried the other idea from #58: replacing explicit permutedims! calls with PermutedDimsArray. That benchmarked substantially worse for the dense batched cases, likely because BLAS then sees a lazy/strided permuted input instead of contiguous storage. This PR therefore keeps the existing explicit permutation fallback and only bypasses it for the narrow @turbo extension path.

Without LoopVectorization, the fallback path remains at baseline:

⊗(A, B) cached mul!:          ~29.1 μs, 5 allocs
⊗(A, I) cached mul!:          ~19.8 μs, 4 allocs
⊗(⊗(A, B), C) cached mul!:   ~283.7 μs, 10 allocs
⊗(A, ⊗(B, C)) cached mul!:   ~283.5 μs, 10 allocs

With LoopVectorization loaded:

⊗(A, B) cached mul!:          ~17.0 μs, 3 allocs
⊗(A, I) cached mul!:           ~7.1 μs, 2 allocs
⊗(⊗(A, B), C) cached mul!:   ~234.5 μs, 8 allocs
⊗(A, ⊗(B, C)) cached mul!:   ~188.7 μs, 6 allocs

This follows the discussion in #58: avoid the generic permutedims! + mul! + permutedims! path only where fast indexing is actually valid, rather than assuming all operators are indexable.

Tests

Added LoopVectorization to the test environment so the existing dense batched TensorProductOperator plain/scaled mul! tests exercise the extension path.
Pkg.test() passes locally with LoopVectorization in the test environment:

SciMLOperators | 823 passed, 2 broken

ChrisRackauckas · 2026-05-06T04:09:05Z

+_tensor_outer_mul_fast!(w, outer, C, mi::Int, mo::Int, no::Int, k::Int) = false
+function _tensor_outer_mul_fast!(w, outer, C, mi::Int, mo::Int, no::Int, k::Int, α, β)
+    return false
+end
+


seems weird to mix the control flow with the actual call?

this has been separated

ChrisRackauckas · 2026-05-06T06:33:09Z

        C1 = reshape(C1, (mi, no))
        mul!(transpose(W), outer, transpose(C1))
        return w
+    elseif _has_tensor_outer_mul_fast(outer)


if it's the elseif, isn't it bypassed?

yes, good catch

AshtonSBradley marked this pull request as ready for review May 5, 2026 21:17

ChrisRackauckas reviewed May 6, 2026

View reviewed changes

AshtonSBradley force-pushed the tensor-product-fast-paths branch from 9fe4176 to 2253d76 Compare May 6, 2026 06:31

ChrisRackauckas reviewed May 6, 2026

View reviewed changes

AshtonSBradley force-pushed the tensor-product-fast-paths branch from 2253d76 to bd70324 Compare May 6, 2026 07:10

Add LoopVectorization tensor fast path extension

aff55b2

AshtonSBradley force-pushed the tensor-product-fast-paths branch from bd70324 to aff55b2 Compare May 6, 2026 08:34

ChrisRackauckas merged commit d4df0f6 into SciML:master May 6, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add LoopVectorization tensor-product fast path extension#369

Add LoopVectorization tensor-product fast path extension#369
ChrisRackauckas merged 1 commit into
SciML:masterfrom
AshtonSBradley:tensor-product-fast-paths

AshtonSBradley commented May 5, 2026 •

edited

Loading

Uh oh!

ChrisRackauckas May 6, 2026

Uh oh!

AshtonSBradley May 6, 2026

Uh oh!

ChrisRackauckas May 6, 2026

Uh oh!

AshtonSBradley May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AshtonSBradley commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Revision note

Summary

Backend design

GPU compatibility

Performance notes

Tests

Uh oh!

ChrisRackauckas May 6, 2026

Choose a reason for hiding this comment

Uh oh!

AshtonSBradley May 6, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisRackauckas May 6, 2026

Choose a reason for hiding this comment

Uh oh!

AshtonSBradley May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AshtonSBradley commented May 5, 2026 •

edited

Loading