EDIT:
Julia 1.11.1
Enzyme v0.13.13
StaticArrays v1.9.8
6.11.6-2-cachyos
AMD Ryzen 7 8845hs
Newbie in Julia and Enzyme here so forgive me if it's a skill issue.
First here's the gist with the problem documented: https://gist.github.com/ernestds/8dd881253ad7b49617da57aa22e61548
All I am trying to do is to calculate the hessian of some function.
I've identified three problems:
1)
Current scope:
; Function Attrs: alwaysinline mustprogress willreturn
error when trying to run F1, which goes away by replacing the loop with a nested one (F2)
F1:
F2 (nested loop):
for i = 1:200
for j=1:2
:)
...
The problem also doesn't happen when doing fewer iterations (~ < 40) (F1_less).
I've also tested the outer loop with more than 10000 iterations and it also worked; however, if I increase the inner loop above ~20 I get the same error.
-
The F3 definition gives the wrong result. This one uses the function cos.(x) in a for loop. Note that this doesn't happen if I manually nest the function call (F4) or manually (F5)*.
Additionally, I've found out that if I run F2 (correct result, not the same as the previous F2 of problem (1) because I messed up naming, I think it is clear in the code tho) after F3 (wrong) WITHOUT zeroing dx, dy and hess, F2 also gets the wrong result (numerically the same as F3). This doesn't happen if I zero the derivative arrays after F3 or if I run F2 two consecutive times (without zeroing the arrays).
*F5 has its own problem.
-
F5 (loop unrolled) doesn't compile if I include z[1] and z[2] in the output. (note that for F52, doesn't compile, y[1] = z[1]*z[1] + z[2]*z[2] and for F5, which does compile, y[1] = z[1]*z[1].
Also F5 takes much longer to compile than anything else.
I'll also just throw the question here: Note the amount of allocations when using the cos.(x) function (F3 and F5) - and to a lesser extent F4. Is this expected, a bug or user error? Would the correct way to do this be like F1 SVector{N,T}(cos(zz) for zz in z) or something?
To verify the values I've used Casadi in Python:
x = ca.SX.sym("x",2,1)
def f(x):
y = ca.cos(x)
for i in range(20):
y = ca.cos(y)
return y[0]**2 + y[1]**2
f = ca.Function("f",[x],ca.cse(ca.hessian(f(x),x)),["x"],["y","y2"],{"jit":True,"jit_options":{"flags":["-O3", "-march=native"]}})
#time_function(f,s=3)
f(1)
#hess = [[0.000270065, 00],
# [00, 0.000270065]]
#grad = [-0.000366693, -0.000366693]
Just for fun, the above Casadi function (but with 400 iterations loop) benchmarked:
Mean: 21.562 us; standard deviation: 1.006 us; min: 21.219 us; max: 54.193 us
vs ~9.3us with Enzyme. Pretty cool.
For completeness here's the code again:
using Enzyme
using StaticArrays
using BenchmarkTools
function cos_vec(z::SVector{N,T}) where {N,T}
return SVector{N,T}(cos(zz) for zz in z)
end
function cos_vec(z::NTuple{N,T}) where {N,T}
return NTuple{N,T}(cos(zz) for zz in z)
end
function cos_vec(z::AbstractArray{T}) where {T}
return T[cos(zz) for zz in z]
end
function f1!(x, y)
z1,z2 = cos_vec((x[1],x[2]))
for i = 1:400
z1,z2 = cos_vec((z1,z2))
end
y[1] = z1*z1 + z2*z2
return nothing
end
function f1_less!(x, y)
z1,z2 = cos_vec((x[1],x[2]))
for i = 1:20
z1,z2 = cos_vec((z1,z2))
end
y[1] = z1*z1 + z2*z2
return nothing
end
function f2!(x, y)
z1,z2 = cos_vec((x[1],x[2]))
for i = 1:200
for j=1:2
z1,z2 = cos_vec((z1,z2))
end
end
y[1] = z1*z1 + z2*z2
return nothing
end
function f2_less!(x, y)
z1,z2 = cos_vec((x[1],x[2]))
for i = 1:10
for j=1:2
z1,z2 = cos_vec((z1,z2))
end
end
y[1] = z1*z1 + z2*z2
return nothing
end
function grad!(x, dx, y, dy,f)
Enzyme.autodiff(Reverse, f, Duplicated(x, dx), DuplicatedNoNeed(y, dy),)
nothing
end
x = [1.0, 1.0]
y = Vector{Float64}(undef, 1)
dx = [0.0, 0.0]
dy = [1.0]
hess = ([0.0, 0.0], [0.0, 0.0])
F1:
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f1!))
Current scope:
; Function Attrs: alwaysinline mustprogress willreturn
define void @preprocess_diffejulia_f1__99290wrap({} addrspace(10)* %0, {} addrspace(10)* nocapture readonly %1, {} addrspace(10)* %2, {} addrspace(10)* nocapture readonly %3) local_unnamed_addr #16 !dbg !910 {
entry:
%4 = call {}*** @julia.get_pgcstack() #18
F1 with less iterations: works with correct result
x = [1.0, 1.0]
y = Vector{Float64}(undef, 1)
dx = [0.0, 0.0]
dy = [1.0]
hess = ([0.0, 0.0], [0.0, 0.0])
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f1_less!))
@show hess;
@show dx;
@benchmark Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f1_less!))
hess = ([0.000270064862399231, 0.0], [0.0, 0.000270064862399231])
dx = [-0.00036669251092078, -0.00036669251092078]
BenchmarkTools.Trial: 10000 samples with 104 evaluations.
Range (min … max): 786.481 ns … 1.147 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 802.385 ns ┊ GC (median): 0.00%
Time (mean ± σ): 810.563 ns ± 24.679 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂███▇▅▃▁ ▁ ▁▂▂▂▃▃▃▃▂▂▁ ▂
▃▆█▇████████▇▇██████████████▇▇▇▇▇▇▆▆▇▅▆▇▅▇▆▆▅▂▅▅▅▆▃▄▅▄▅▆▆▆▅▅ █
786 ns Histogram: log(frequency) by time 918 ns <
Memory estimate: 96 bytes, allocs estimate: 4.
x = [1.0, 1.0]
y = Vector{Float64}(undef, 1)
dx = [0.0, 0.0]
dy = [1.0]
hess = ([0.0, 0.0], [0.0, 0.0])
### F2
# works (result is also correct)
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f2!))
@show hess;
@show dx;
@benchmark Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f2!))
@benchmark Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f2_less!))
hess = ([1.687031797366739e-69, 0.0], [0.0, 1.687031797366739e-69])
dx = [-2.2911801638083815e-69, -2.2911801638083815e-69]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 9.307 μs … 22.642 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 9.618 μs ┊ GC (median): 0.00%
Time (mean ± σ): 9.809 μs ± 980.584 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▃▆█▆▃▂▁ ▁
████████▇▆▆▆▅▅▄▅▅▃▄▄▅▃▃▃▄▅▄▃▅▄▅▅▄▅▄▅▅▅▄▅▅▆▄▄▃▃▅▄▄▃▄▅▅▆▅▆▆▆▄ █
9.31 μs Histogram: log(frequency) by time 14.7 μs <
Memory estimate: 96 bytes, allocs estimate: 4.
BenchmarkTools.Trial: 10000 samples with 101 evaluations.
Range (min … max): 795.634 ns … 2.424 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 809.812 ns ┊ GC (median): 0.00%
Time (mean ± σ): 837.734 ns ± 86.157 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▇▃▂▃▄▃▂▁▁▂▁▁▁▁▂▁▁▁▁▁▁ ▁
████████████████████████▇▆▆▅▅▅▅▅▅▅▅▄▅▅▃▄▄▃▄▂▂▅▅▃▃▃▄▄▃▃▄▃▄▃▂▅ █
796 ns Histogram: log(frequency) by time 1.21 μs <
Memory estimate: 96 bytes, allocs estimate: 4.
ground truth from Casadi:
x = ca.SX.sym("x",2,1)
def f(x):
y = ca.cos(x)
for i in range(20):
y = ca.cos(y)
return y[0]**2 + y[1]**2
f = ca.Function("f",[x],ca.cse(ca.hessian(f(x),x)),["x"],["y","y2"],{"jit":True,"jit_options":{"flags":["-O3", "-march=native"]}})
f(1)
#hess = [[0.000270065, 00],
# [00, 0.000270065]]
#grad = [-0.000366693, -0.000366693]
Problem 2:
F2a (different than the last F2 because I messed up the naming along the way, result is correct)
function f2!(x, y)
z1,z2 = cos_vec((x[1],x[2]))
# z1,z2 = cos_vec(SVector{2,Float64}(x))
for i = 1:20
z1,z2 = cos_vec((z1,z2))
end
# y[1] = z1*z1 + z2*z2
y[1] = z1*z1 + z2*z2
return nothing
end
dx = [0.0, 0.0]
dy = [1.0]
vx = ([1.0, 0.0], [0.0, 1.0])
hess = ([0.0, 0.0], [0.0, 0.0])
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f2!));
@show hess;
@show dx;
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f2!));
@show hess;
@show dx;
@benchmark Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f2!))
hess = ([0.000270064862399231, 0.0], [0.0, 0.000270064862399231])
dx = [-0.00036669251092078, -0.00036669251092078]
hess = ([0.000270064862399231, 0.0], [0.0, 0.000270064862399231])
dx = [-0.00036669251092078, -0.00036669251092078]
BenchmarkTools.Trial: 10000 samples with 103 evaluations.
Range (min … max): 781.369 ns … 2.937 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 802.864 ns ┊ GC (median): 0.00%
Time (mean ± σ): 831.833 ns ± 97.917 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▆▂▃▄▅▄▄▃▂▂▂▁ ▁
▇█████████████████▇▇▇█▇▆▆▆▆▅▆▆▅▅▅▅▅▆▆▆▅▆▆▆▅▅▅▅▄▄▄▂▄▄▆▅▅▅▃▅▅▅ █
781 ns Histogram: log(frequency) by time 1.19 μs <
Memory estimate: 96 bytes, allocs estimate: 4.
F3 (result is wrong)
function f3!(x, y)
z = cos.(x)
for i = 1:20
z = cos.(z)
end
y[1] = z[1]*z[1] + z[2]*z[2]
return nothing
end
dx = [0.0, 0.0]
dy = [1.0]
vx = ([1.0, 0.0], [0.0, 1.0])
hess = ([0.0, 0.0], [0.0, 0.0])
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f3!));
sleep(1)
@show hess;
@show dx;
@benchmark Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f3!))
hess = ([-0.0002354505535806518, 0.0], [0.0, -0.0002354505535806518])
dx = [-0.00036669251092078, -0.00036669251092078]
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min … max): 3.249 μs … 2.134 ms ┊ GC (min … max): 0.00% … 99.25%
Time (median): 5.194 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.926 μs ± 34.013 μs ┊ GC (mean ± σ): 11.10% ± 1.98%
██ ▁
▂▇██▆▄▃▃▂▂▂▂▃▄▄▄▃▄▅▇█▇▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂ ▃
3.25 μs Histogram: frequency by time 12.1 μs <
Memory estimate: 9.94 KiB, allocs estimate: 256.
F2 after F3
interestingly the result of F2 also goes wrong (coinciding with F3) if I run it directly after F3 without zeroing dx, dy and hess:
dx = [0.0, 0.0]
dy = [1.0]
vx = ([1.0, 0.0], [0.0, 1.0])
hess = ([0.0, 0.0], [0.0, 0.0])
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f3!));
println("f2 after f3")
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f2!));
@show hess;
@show dx;
dx = [0.0, 0.0]
dy = [1.0]
hess = ([0.0, 0.0], [0.0, 0.0])
println("reseting dx/hess/dy")
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f2!));
@show hess;
@show dx;
f2 after f3
hess = ([-0.0002354505535806518, 0.0], [0.0, -0.0002354505535806518])
dx = [-0.00036669251092078, -0.00036669251092078]
reseting dx/hess/dy
hess = ([0.000270064862399231, 0.0], [0.0, 0.000270064862399231])
dx = [-0.00036669251092078, -0.00036669251092078]
Just some other tries:
F4 (works fine):
function f4!(x, y)
z = cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(cos.(x)))))))))))))))))))))
y[1] = z[1]*z[1] + z[2]*z[2]
return nothing
end
dx = [0.0, 0.0]
dy = [1.0]
vx = ([1.0, 0.0], [0.0, 1.0])
hess = ([0.0, 0.0], [0.0, 0.0])
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f4!));
@show hess;
@show dx;
hess = ([0.00027006486239923106, 0.0], [0.0, 0.00027006486239923106])
dx = [-0.00036669251092077997, -0.00036669251092077997]
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.184 μs … 6.682 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.214 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.308 μs ± 238.067 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▆▅▄▂ ▁▁▁▁▁ ▁
███████▇▇▆▆▇▇▇▇▇▇▇▇███████████████▇█▇▆▆▆▆▇▇▇█▇▇▇▇█▇▇▇▆▇▆▅▅▃ █
1.18 μs Histogram: log(frequency) by time 2.07 μs <
Memory estimate: 416 bytes, allocs estimate: 11.
F5 (note I changed the output to be only z[1]^2, works):
function f5!(x, y)
z = cos.(x)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
y[1] = z[1]*z[1]# + z[2]*z[2]
return nothing
end
dx = [0.0, 0.0]
dy = [1.0]
vx = ([1.0, 0.0], [0.0, 1.0])
hess = ([0.0, 0.0], [0.0, 0.0])
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f5!));
@show hess;
@show dx;
hess = ([0.000270064862399231, 0.0], [0.0, 0.0])
dx = [-0.00036669251092078, 0.0]
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min … max): 3.996 μs … 2.585 ms ┊ GC (min … max): 0.00% … 99.46%
Time (median): 4.877 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.813 μs ± 38.121 μs ┊ GC (mean ± σ): 10.71% ± 1.98%
▆ █▃
▄█▄▃▄███▅▂▂▂▂▁▁▁▁▁▁▂▂▁▂▂▂▂▃▃▃▃▃▃▃▂▂▃▄▅▅▅▅▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
4 μs Histogram: frequency by time 10.3 μs <
Memory estimate: 9.75 KiB, allocs estimate: 250.
F5_2 (I changed back the output to z[1]^2 + z[2]^2, compilation fails):
function f52!(x, y)
z = cos.(x)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
z = cos.(z)
y[1] = z[1]*z[1] + z[2]*z[2]
return nothing
end
dx = [0.0, 0.0]
dy = [1.0]
vx = ([1.0, 0.0], [0.0, 1.0])
hess = ([0.0, 0.0], [0.0, 0.0])
Enzyme.autodiff(Enzyme.Forward, grad!, Enzyme.BatchDuplicated(x, vx), Enzyme.BatchDuplicated(dx, hess), Const(y), Const(dy),Const(f52!));
@show hess;
@show dx;
Enzyme compilation failed.
Current scope:
define void @julia_f52__102846({} addrspace(10)* nocapture noundef nonnull readonly align 8 dereferenceable(24) "enzyme_type"="{[-1]:Pointer, [-1,0]:Pointer, [-1,0,-1]:Float@double, [-1,8]:Pointer, [-1,8,0]:Integer, [-1,8,1]:Integer, [-1,8,2]:Integer, [-1,8,3]:Integer, [-1,8,4]:Integer, [-1,8,5]:Integer, [-1,8,6]:Integer, [-1,8,7]:Integer, [-1,8,8]:Pointer, [-1,8,8,-1]:Float@double, [-1,16]:Integer, [-1,17]:Integer, [-1,18]:Integer, [-1,19]:Integer, [-1,20]:Integer, [-1,21]:Integer, [-1,22]:Integer, [-1,23]:Integer}" "enzymejl_parmtype"="127076369513104" "enzymejl_parmtype_ref"="2" %0, {} addrspace(10)* noundef nonnull align 8 dereferenceable(24) "enzyme_type"="{[-1]:Pointer, [-1,0]:Pointer, [-1,0,-1]:Float@double, [-1,8]:Pointer, [-1,8,0]:Integer, [-1,8,1]:Integer, [-1,8,2]:Integer, [-1,8,3]:Integer, [-1,8,4]:Integer, [-1,8,5]:Integer, [-1,8,6]:Integer, [-1,8,7]:Integer, [-1,8,8]:Pointer, [-1,8,8,-1]:Float@double, [-1,16]:Integer, [-1,17]:Integer, [-1,18]:Integer, [-1,19]:Integer, [-1,20]:Integer, [-1,21]:Integer, [-1,22]:Integer, [-1,23]:Integer}" "enzymejl_parmtype"="127076369513104" "enzymejl_parmtype_ref"="2" %1) local_unnamed_addr #0 !dbg !10 {
top:
EDIT:
Julia 1.11.1
Enzyme v0.13.13
StaticArrays v1.9.8
6.11.6-2-cachyos
AMD Ryzen 7 8845hs
Newbie in Julia and Enzyme here so forgive me if it's a skill issue.
First here's the gist with the problem documented: https://gist.github.com/ernestds/8dd881253ad7b49617da57aa22e61548
All I am trying to do is to calculate the hessian of some function.
I've identified three problems:
1)
error when trying to run F1, which goes away by replacing the loop with a nested one (F2)
F1:
F2 (nested loop):
The problem also doesn't happen when doing fewer iterations (~ < 40) (F1_less).
I've also tested the outer loop with more than 10000 iterations and it also worked; however, if I increase the inner loop above ~20 I get the same error.
The F3 definition gives the wrong result. This one uses the function cos.(x) in a for loop. Note that this doesn't happen if I manually nest the function call (F4) or manually (F5)*.
Additionally, I've found out that if I run F2 (correct result, not the same as the previous F2 of problem (1) because I messed up naming, I think it is clear in the code tho) after F3 (wrong) WITHOUT zeroing
dx,dyandhess, F2 also gets the wrong result (numerically the same as F3). This doesn't happen if I zero the derivative arrays after F3 or if I run F2 two consecutive times (without zeroing the arrays).*F5 has its own problem.
F5 (loop unrolled) doesn't compile if I include z[1] and z[2] in the output. (note that for F52, doesn't compile,
y[1] = z[1]*z[1] + z[2]*z[2]and for F5, which does compile,y[1] = z[1]*z[1].Also F5 takes much longer to compile than anything else.
I'll also just throw the question here: Note the amount of allocations when using the
cos.(x)function (F3 and F5) - and to a lesser extent F4. Is this expected, a bug or user error? Would the correct way to do this be like F1SVector{N,T}(cos(zz) for zz in z)or something?To verify the values I've used Casadi in Python:
Just for fun, the above Casadi function (but with 400 iterations loop) benchmarked:
Mean: 21.562 us; standard deviation: 1.006 us; min: 21.219 us; max: 54.193 usvs ~9.3us with Enzyme. Pretty cool.
For completeness here's the code again:
F1:
F1 with less iterations: works with correct result
ground truth from Casadi:
Problem 2:
F2a (different than the last F2 because I messed up the naming along the way, result is correct)
F3 (result is wrong)
F2 after F3
interestingly the result of F2 also goes wrong (coinciding with F3) if I run it directly after F3 without zeroing dx, dy and hess:
Just some other tries:
F4 (works fine):
F5 (note I changed the output to be only z[1]^2, works):
F5_2 (I changed back the output to z[1]^2 + z[2]^2, compilation fails):