Share same cache among similar operators in AddedOperator#370
Share same cache among similar operators in AddedOperator#370albertomercurio wants to merge 6 commits into
AddedOperator#370Conversation
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
|
I think this PR is ready. It improves the cache handling for |
ChrisRackauckas-Claude
left a comment
There was a problem hiding this comment.
Thanks for the work on this — the memory wins are real for the benchmark case, but I think there are correctness issues that need to be fixed before merging. I dug into the diff, ran reproducers locally against this branch on Julia 1.11.9, and found the following.
1. InexactError on mixed-eltype ComposedOperators in the same AddedOperator (correctness)
A real-world QM-style pattern fails:
using SciMLOperators, LinearAlgebra
Ar = MatrixOperator(rand(Float64, 4, 4)); Br = MatrixOperator(rand(Float64, 4, 4))
Ac = MatrixOperator(rand(ComplexF64, 4, 4)); Bc = MatrixOperator(rand(ComplexF64, 4, 4))
L = (Ar*Br) + (Ac*Bc)
Ls = cache_operator(L, rand(Float64, 4))
mul!(similar(rand(Float64,4), ComplexF64), Ls, rand(Float64,4))
# ERROR: InexactError: Float64(... + 0.121im)Both sub-ops are ComposedOperator, so the asker Ac*Bc is forced to share Ar*Br's cache. The donor's cache is Float64 (because v is real and cache_self uses promote_type(eltype(op), eltype(zero(v)))). The asker would normally allocate ComplexF64 slots, since its inner Bc is ComplexF64. But _cache_compatible at src/interface.jl:210 only checks size(hint) == shape — no eltype check — so sharing proceeds. Then mul!(L.cache[1], Bc, v) writes a complex result into a Float64 buffer.
The reverse order (Ac*Bc) + (Ar*Br) also breaks, in the 5-arg path: copy!(cache, w) with w::ComplexF64 and cache::Float64.
Worth noting that Ls * v happens to return correct results because Base.:*(L::AddedOperator, v) (src/basic.jl:676) does an out-of-place sum(op -> op*v, L.ops) that bypasses the cache entirely. So the new testset asserts L * u ≈ expected, but that doesn't exercise the in-place path that the cache is built for. The mul!(w, L, v) path is what users on hot loops will hit.
2. _get_cache_shapes(::ComposedOperator) is off-by-one (logic bug)
# A is 5×3, B is 3×4, C = A*B, v: 4-vec
SciMLOperators._get_cache_shapes(cache_operator(A*B, v), v)
# ((5,), (3,)) ← claimed
map(size, cache_operator(A*B, v).cache)
# ((3,), (4,)) ← actual_get_cache_shapes at src/basic.jl:959 claims slot i has shape (size(L.ops[i],1), K). But cache_self at src/basic.jl:969 actually allocates slot i (for i<N) with the next operator's row dim — (size(L.ops[i+1],1), K) — and slot N with size(v), not (size(L.ops[N],1), K).
Effect:
- For non-square
ComposedOperators, the spec-vs-actual mismatch makes_cache_compatiblereject the donor's cache → sharing never triggers for any non-square inner ops. The benchmark wins disappear in this regime. - For square inner ops, the bug is masked because
size(op,1) == size(op,2). But that's also exactly where the eltype-mismatch bug bites (#1).
Better to have a single source of truth here — either compute shapes inside cache_self and stash them, or factor out a helper that both functions use, so they can't drift.
3. FunctionOperator._get_cache_shapes ignores size(v,2) and eltype
src/func.jl:593:
function _get_cache_shapes(L::FunctionOperator, v::AbstractVecOrMat)
return (L.traits.sizes[1], L.traits.sizes[2])
endFor matrix input it returns single-column shapes. That'll always reject sharing for a matrix v, but could also falsely match against a same-vector-shaped donor and graft the wrong buffer. Should incorporate size(v,2) and output_eltype.
4. No device check in _cache_compatible
A CPU donor and GPU asker (or vice versa) with the same size(...) would silently graft a wrong-device cache. Unlikely to occur in normal usage but the API permits it.
5. Aliasing is brittle — worth a comment
After sharing, several L.ops[i].cache are ===. This works only because AddedOperator's mul! is strictly serial. Any future change that parallelizes L.ops evaluation (Threads, separate CUDA streams, async) creates data races on the shared buffers. A comment near the donor selection or near the cache field would help future maintainers not break this invariant.
6. Test coverage gap
The new testset at test/basic.jl:411 only uses square Float64 MatrixOperator(rand(N,N)) matrices and asserts L * u (the path that bypasses the cache). None of the failure modes above are covered. Suggest adding:
- Mixed-eltype
ComposedOperators (e.g. real + complex inner ops). - Non-square inner ops in
ComposedOperator. FunctionOperatorinside anAddedOperatorwith both vector and matrixv.mul!(w, L, v)andmul!(w, L, v, α, β)against the cachedAddedOperator(not just*).- The benchmark in the PR description also only measures
Base.summarysize; an in-placemul!benchmark would catch correctness regressions of this kind.
Suggested minimal fixes
- Extend
_cache_compatible(or_get_cache_shapes) to carry and check eltype. - Fix
_get_cache_shapes(::ComposedOperator)so it matches whatcache_selfallocates, ideally by sharing a common helper. - Fix
_get_cache_shapes(::FunctionOperator)to includesize(v,2)andoutput_eltype. - Add the missing tests above so the eltype/non-square/in-place paths are covered.
Happy to take a closer look at a follow-up commit if useful. Reproducers I used:
repro scripts
# repro3.jl — confirms #1
using SciMLOperators, LinearAlgebra
Ar = MatrixOperator(rand(Float64, 4, 4)); Br = MatrixOperator(rand(Float64, 4, 4))
Ac = MatrixOperator(rand(ComplexF64, 4, 4)); Bc = MatrixOperator(rand(ComplexF64, 4, 4))
CR = Ar * Br; CC = Ac * Bc
L = CR + CC
v = rand(Float64, 4)
Ls = cache_operator(L, v)
@show map(eltype, Ls.ops[1].cache) map(eltype, Ls.ops[2].cache) Ls.ops[2].cache === Ls.ops[1].cache
w = similar(v, ComplexF64); fill!(w, 0)
mul!(w, Ls, v) # InexactError# repro.jl — confirms #2
using SciMLOperators
A = MatrixOperator(rand(5, 3)); B = MatrixOperator(rand(3, 4))
C = A * B; v = rand(4)
Cc = cache_operator(C, v)
@show SciMLOperators._get_cache_shapes(Cc, v) # ((5,), (3,))
@show map(size, Cc.cache) # ((3,), (4,))|
Ok I should have fixed all the issues. I'm not sure about the FunctionOperator. But the others should be fixed |
Checklist
contributor guidelines, in particular the SciML Style Guide and
COLPRAC.
Additional context
When caching an
AddedOperator, we currently cache all the suboperators iteratively. However, this is suboptimal, as many of them can in principle share the same cache when possible.In this PR I implement this cache sharing in the
AddedOperator. Let's imagine I have the sum ofComposed + Tensor + Tensor + Composed + Matrix + Tensor, many of them require some cache, but they could share the same cache. We can be sure they can share the same cache ifComposedOperator)the first one can be checked even at compile time, while the second requires to define
_get_cache_shapes(op, v), which returns the shape of the cache.Benchmarks
This method allows to save a lot of memory, as also shown in this example