Skip to content

Add LoopVectorization tensor-product fast path extension#369

Merged
ChrisRackauckas merged 1 commit into
SciML:masterfrom
AshtonSBradley:tensor-product-fast-paths
May 6, 2026
Merged

Add LoopVectorization tensor-product fast path extension#369
ChrisRackauckas merged 1 commit into
SciML:masterfrom
AshtonSBradley:tensor-product-fast-paths

Conversation

@AshtonSBradley
Copy link
Copy Markdown
Contributor

@AshtonSBradley AshtonSBradley commented May 5, 2026

Revision note

Thanks for the early review comments. I revised the PR to separate fast-path availability/control flow from the mutating implementation call, rewrote the k == 1/batched fast-path branch so the flow is explicit, and removed unrelated cache-slice edits plus a redundant test block. Hopefully this is now back on track as a minimal extension-backed optimization.

Summary

This PR adds an optional fast path for batched TensorProductOperator multiplication when the outer operator is a MatrixOperator wrapping a CPU StridedMatrix.

The core tensor implementation now exposes a small internal backend-neutral hook, _tensor_outer_mul_fast!, guarded by a separate _has_tensor_outer_mul_fast predicate. The existing generic implementation remains the fallback and still uses only the operator interface (mul!) after reshaping/permuting into the layout it needs. That keeps arbitrary matrix-free operators, sparse matrices, GPU arrays, and other non-strided operators on the abstraction-preserving path.

A new SciMLOperatorsLoopVectorizationExt extension overloads this hook with an @turbo loop for the narrow dense-strided MatrixOperator case. LoopVectorization is a weak dependency and a test dependency, not a hard dependency of SciMLOperators.

Backend design

The hook is intentionally not named after LoopVectorization. Core owns a generic availability predicate plus an implementation hook:

_has_tensor_outer_mul_fast(outer) = false
function _tensor_outer_mul_fast! end

The predicate keeps control flow separate from the mutating fast-path call. Backend extensions opt in by overloading the predicate for supported operator types and providing the corresponding _tensor_outer_mul_fast! methods.

That means LoopVectorization is only the first backend. If LoopModels, or another successor backend, becomes the preferred implementation later, it can live in its own extension and overload the same hook without changing the core tensor-product algorithm or public API.

The intended backend-switching model is:

  • Core SciMLOperators owns the generic fallback and the backend-neutral hook.
  • SciMLOperatorsLoopVectorizationExt owns today’s @turbo dense-strided implementation.
  • The same extension design could support an opt-in @tturbo method later, for users who want to experiment with threaded LoopVectorization behavior, without changing the core tensor code. This PR intentionally uses only @turbo as the safer default because it avoids thread oversubscription concerns with threaded BLAS.
  • A future SciMLOperatorsLoopModelsExt can provide the same hook for LoopModels once its API is ready.
  • A future GPU backend could also overload the same hook from a CUDA/KernelAbstractions-style extension, but this PR does not add a GPU kernel.
  • If no backend is loaded, behavior remains unchanged.

GPU compatibility

This PR should not change existing GPU behavior by dispatch. The new LoopVectorization method only applies to MatrixOperator wrapping a StridedMatrix, so GPU arrays such as CuArray should not be caught by the CPU scalar-indexing @turbo path. GPU-backed operators continue to use the existing generic fallback, which relies on the current permutedims!/mul! behavior for those array/operator types.

I did not run GPU tests locally, so this is a dispatch/abstraction argument rather than a measured GPU validation.

Performance notes

Local benchmarks on the existing benchmarks/tensor.jl setup showed the extension active only when LoopVectorization is loaded. These numbers were collected on Apple Silicon, so they should be treated as a conservative local check rather than the expected best case. The improvement may be larger on Intel AVX/AVX2 machines, which are closer to the hardware context discussed in #58 and where LoopVectorization's SIMD code generation has historically shown larger gains.

I also tried the other idea from #58: replacing explicit permutedims! calls with PermutedDimsArray. That benchmarked substantially worse for the dense batched cases, likely because BLAS then sees a lazy/strided permuted input instead of contiguous storage. This PR therefore keeps the existing explicit permutation fallback and only bypasses it for the narrow @turbo extension path.

Without LoopVectorization, the fallback path remains at baseline:

⊗(A, B) cached mul!:          ~29.1 μs, 5 allocs
⊗(A, I) cached mul!:          ~19.8 μs, 4 allocs
⊗(⊗(A, B), C) cached mul!:   ~283.7 μs, 10 allocs
⊗(A, ⊗(B, C)) cached mul!:   ~283.5 μs, 10 allocs

With LoopVectorization loaded:

⊗(A, B) cached mul!:          ~17.0 μs, 3 allocs
⊗(A, I) cached mul!:           ~7.1 μs, 2 allocs
⊗(⊗(A, B), C) cached mul!:   ~234.5 μs, 8 allocs
⊗(A, ⊗(B, C)) cached mul!:   ~188.7 μs, 6 allocs

This follows the discussion in #58: avoid the generic permutedims! + mul! + permutedims! path only where fast indexing is actually valid, rather than assuming all operators are indexable.

Tests

  • Added LoopVectorization to the test environment so the existing dense batched TensorProductOperator plain/scaled mul! tests exercise the extension path.
  • Pkg.test() passes locally with LoopVectorization in the test environment:
SciMLOperators | 823 passed, 2 broken

@AshtonSBradley AshtonSBradley marked this pull request as ready for review May 5, 2026 21:17
Comment thread src/tensor.jl Outdated
Comment on lines +416 to +420
_tensor_outer_mul_fast!(w, outer, C, mi::Int, mo::Int, no::Int, k::Int) = false
function _tensor_outer_mul_fast!(w, outer, C, mi::Int, mo::Int, no::Int, k::Int, α, β)
return false
end

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems weird to mix the control flow with the actual call?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been separated

@AshtonSBradley AshtonSBradley force-pushed the tensor-product-fast-paths branch from 9fe4176 to 2253d76 Compare May 6, 2026 06:31
Comment thread src/tensor.jl Outdated
C1 = reshape(C1, (mi, no))
mul!(transpose(W), outer, transpose(C1))
return w
elseif _has_tensor_outer_mul_fast(outer)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's the elseif, isn't it bypassed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good catch

@AshtonSBradley AshtonSBradley force-pushed the tensor-product-fast-paths branch from 2253d76 to bd70324 Compare May 6, 2026 07:10
@AshtonSBradley AshtonSBradley force-pushed the tensor-product-fast-paths branch from bd70324 to aff55b2 Compare May 6, 2026 08:34
@ChrisRackauckas ChrisRackauckas merged commit d4df0f6 into SciML:master May 6, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants