Skip to content

Share same cache among similar operators in AddedOperator#370

Open
albertomercurio wants to merge 6 commits into
SciML:masterfrom
albertomercurio:master
Open

Share same cache among similar operators in AddedOperator#370
albertomercurio wants to merge 6 commits into
SciML:masterfrom
albertomercurio:master

Conversation

@albertomercurio
Copy link
Copy Markdown
Contributor

Checklist

  • Appropriate tests were added
  • Any code changes were done in a way that does not break public API
  • All documentation related to code changes were updated
  • The new code follows the
    contributor guidelines, in particular the SciML Style Guide and
    COLPRAC.
  • Any new documentation only uses public API

Additional context

When caching an AddedOperator, we currently cache all the suboperators iteratively. However, this is suboptimal, as many of them can in principle share the same cache when possible.

In this PR I implement this cache sharing in the AddedOperator. Let's imagine I have the sum of Composed + Tensor + Tensor + Composed + Matrix + Tensor, many of them require some cache, but they could share the same cache. We can be sure they can share the same cache if

  1. They are the same constructor (e.g., ComposedOperator)
  2. The size of their cache is the same

the first one can be checked even at compile time, while the second requires to define _get_cache_shapes(op, v), which returns the shape of the cache.

Benchmarks

This method allows to save a lot of memory, as also shown in this example

using SparseArrays
using SciMLOperators
using CairoMakie

function generate_op(N)
    M = 10
    sparsity = 1 / N

    A_list = ntuple(i -> MatrixOperator(sprand(N, N, sparsity)) * MatrixOperator(sprand(N, N, sparsity)), Val(M))

    op = SciMLOperators.AddedOperator(A_list)

    u = rand(N, N)
    return cache_operator(op, u)
    # return op
end

# %%

N_list = 4 .^ (2:7)
sizes_main = [Base.summarysize(generate_op(N)) for N in N_list]
sizes_pr = [Base.summarysize(generate_op(N)) for N in N_list]

# %%

fig = Figure()
ax = Axis(fig[1, 1], xlabel="N", ylabel="Size (MB)", xscale=log10, yscale=log10)

scatterlines!(ax, N_list, sizes_main ./ 1e6, label="main")
scatterlines!(ax, N_list, sizes_pr ./ 1e6, label="PR")

axislegend(ax; position = :lt)

fig
image

albertomercurio and others added 4 commits May 17, 2026 14:16
@albertomercurio
Copy link
Copy Markdown
Contributor Author

I think this PR is ready. It improves the cache handling for AddedOperator

Copy link
Copy Markdown
Contributor

@ChrisRackauckas-Claude ChrisRackauckas-Claude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this — the memory wins are real for the benchmark case, but I think there are correctness issues that need to be fixed before merging. I dug into the diff, ran reproducers locally against this branch on Julia 1.11.9, and found the following.

1. InexactError on mixed-eltype ComposedOperators in the same AddedOperator (correctness)

A real-world QM-style pattern fails:

using SciMLOperators, LinearAlgebra
Ar = MatrixOperator(rand(Float64, 4, 4));    Br = MatrixOperator(rand(Float64, 4, 4))
Ac = MatrixOperator(rand(ComplexF64, 4, 4)); Bc = MatrixOperator(rand(ComplexF64, 4, 4))
L  = (Ar*Br) + (Ac*Bc)
Ls = cache_operator(L, rand(Float64, 4))
mul!(similar(rand(Float64,4), ComplexF64), Ls, rand(Float64,4))
# ERROR: InexactError: Float64(... + 0.121im)

Both sub-ops are ComposedOperator, so the asker Ac*Bc is forced to share Ar*Br's cache. The donor's cache is Float64 (because v is real and cache_self uses promote_type(eltype(op), eltype(zero(v)))). The asker would normally allocate ComplexF64 slots, since its inner Bc is ComplexF64. But _cache_compatible at src/interface.jl:210 only checks size(hint) == shapeno eltype check — so sharing proceeds. Then mul!(L.cache[1], Bc, v) writes a complex result into a Float64 buffer.

The reverse order (Ac*Bc) + (Ar*Br) also breaks, in the 5-arg path: copy!(cache, w) with w::ComplexF64 and cache::Float64.

Worth noting that Ls * v happens to return correct results because Base.:*(L::AddedOperator, v) (src/basic.jl:676) does an out-of-place sum(op -> op*v, L.ops) that bypasses the cache entirely. So the new testset asserts L * u ≈ expected, but that doesn't exercise the in-place path that the cache is built for. The mul!(w, L, v) path is what users on hot loops will hit.

2. _get_cache_shapes(::ComposedOperator) is off-by-one (logic bug)

# A is 5×3, B is 3×4, C = A*B, v: 4-vec
SciMLOperators._get_cache_shapes(cache_operator(A*B, v), v)
# ((5,), (3,))         ← claimed
map(size, cache_operator(A*B, v).cache)
# ((3,), (4,))         ← actual

_get_cache_shapes at src/basic.jl:959 claims slot i has shape (size(L.ops[i],1), K). But cache_self at src/basic.jl:969 actually allocates slot i (for i<N) with the next operator's row dim — (size(L.ops[i+1],1), K) — and slot N with size(v), not (size(L.ops[N],1), K).

Effect:

  • For non-square ComposedOperators, the spec-vs-actual mismatch makes _cache_compatible reject the donor's cache → sharing never triggers for any non-square inner ops. The benchmark wins disappear in this regime.
  • For square inner ops, the bug is masked because size(op,1) == size(op,2). But that's also exactly where the eltype-mismatch bug bites (#1).

Better to have a single source of truth here — either compute shapes inside cache_self and stash them, or factor out a helper that both functions use, so they can't drift.

3. FunctionOperator._get_cache_shapes ignores size(v,2) and eltype

src/func.jl:593:

function _get_cache_shapes(L::FunctionOperator, v::AbstractVecOrMat)
    return (L.traits.sizes[1], L.traits.sizes[2])
end

For matrix input it returns single-column shapes. That'll always reject sharing for a matrix v, but could also falsely match against a same-vector-shaped donor and graft the wrong buffer. Should incorporate size(v,2) and output_eltype.

4. No device check in _cache_compatible

A CPU donor and GPU asker (or vice versa) with the same size(...) would silently graft a wrong-device cache. Unlikely to occur in normal usage but the API permits it.

5. Aliasing is brittle — worth a comment

After sharing, several L.ops[i].cache are ===. This works only because AddedOperator's mul! is strictly serial. Any future change that parallelizes L.ops evaluation (Threads, separate CUDA streams, async) creates data races on the shared buffers. A comment near the donor selection or near the cache field would help future maintainers not break this invariant.

6. Test coverage gap

The new testset at test/basic.jl:411 only uses square Float64 MatrixOperator(rand(N,N)) matrices and asserts L * u (the path that bypasses the cache). None of the failure modes above are covered. Suggest adding:

  • Mixed-eltype ComposedOperators (e.g. real + complex inner ops).
  • Non-square inner ops in ComposedOperator.
  • FunctionOperator inside an AddedOperator with both vector and matrix v.
  • mul!(w, L, v) and mul!(w, L, v, α, β) against the cached AddedOperator (not just *).
  • The benchmark in the PR description also only measures Base.summarysize; an in-place mul! benchmark would catch correctness regressions of this kind.

Suggested minimal fixes

  1. Extend _cache_compatible (or _get_cache_shapes) to carry and check eltype.
  2. Fix _get_cache_shapes(::ComposedOperator) so it matches what cache_self allocates, ideally by sharing a common helper.
  3. Fix _get_cache_shapes(::FunctionOperator) to include size(v,2) and output_eltype.
  4. Add the missing tests above so the eltype/non-square/in-place paths are covered.

Happy to take a closer look at a follow-up commit if useful. Reproducers I used:

repro scripts

# repro3.jl — confirms #1
using SciMLOperators, LinearAlgebra
Ar = MatrixOperator(rand(Float64, 4, 4));    Br = MatrixOperator(rand(Float64, 4, 4))
Ac = MatrixOperator(rand(ComplexF64, 4, 4)); Bc = MatrixOperator(rand(ComplexF64, 4, 4))
CR = Ar * Br; CC = Ac * Bc
L = CR + CC
v = rand(Float64, 4)
Ls = cache_operator(L, v)
@show map(eltype, Ls.ops[1].cache) map(eltype, Ls.ops[2].cache) Ls.ops[2].cache === Ls.ops[1].cache
w = similar(v, ComplexF64); fill!(w, 0)
mul!(w, Ls, v)   # InexactError
# repro.jl — confirms #2
using SciMLOperators
A = MatrixOperator(rand(5, 3)); B = MatrixOperator(rand(3, 4))
C = A * B; v = rand(4)
Cc = cache_operator(C, v)
@show SciMLOperators._get_cache_shapes(Cc, v)   # ((5,), (3,))
@show map(size, Cc.cache)                       # ((3,), (4,))

@albertomercurio
Copy link
Copy Markdown
Contributor Author

Ok I should have fixed all the issues. I'm not sure about the FunctionOperator. But the others should be fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants