Is your feature request related to a problem? Please describe.
To get kernel performance matching clang we have had to add fast-math flags such as contract (which clang and nvcc do by default). Currently, we do this by an ugly-hack, see for example
|
# HACK: module-local versions of core arithmetic; needed to get FMA |
|
for (jlf, f) in zip((:+, :*, :-), (:add, :mul, :sub)) |
|
for (T, llvmT) in ((:Float32, "float"), (:Float64, "double")) |
|
ir = """ |
|
%x = f$f contract nsz $llvmT %0, %1 |
|
ret $llvmT %x |
|
""" |
|
@eval begin |
|
# the @pure is necessary so that we can constant propagate. |
|
@inline Base.@pure function $jlf(a::$T, b::$T) |
|
Base.llvmcall($ir, $T, Tuple{$T, $T}, a, b) |
|
end |
|
end |
|
end |
|
@eval function $jlf(args...) |
|
Base.$jlf(args...) |
|
end |
|
end |
|
|
|
let (jlf, f) = (:div_arcp, :div) |
|
for (T, llvmT) in ((:Float32, "float"), (:Float64, "double")) |
|
ir = """ |
|
%x = f$f fast $llvmT %0, %1 |
|
ret $llvmT %x |
|
""" |
|
@eval begin |
|
# the @pure is necessary so that we can constant propagate. |
|
@inline Base.@pure function $jlf(a::$T, b::$T) |
|
Base.llvmcall($ir, $T, Tuple{$T, $T}, a, b) |
|
end |
|
end |
|
end |
|
@eval function $jlf(args...) |
|
Base.$jlf(args...) |
|
end |
|
end |
|
rcp(x) = div_arcp(one(x), x) # still leads to rcp.rn which is also a function call |
Describe the solution you'd like
I would like a macro like @fastmath that had fine-grained control over the fast-math flags.
Describe alternatives you've considered
KernelAbstractions used to do this with https://github.com/JuliaLabs/Cassette.jl and other people use macros (although it opens up less optimization and thus not desired) https://github.com/JuliaLabs/Cassette.jl. I don't know if https://github.com/JuliaDebug/CassetteOverlay.jl can be used with kernels but it might be a possible way to implement this.
It would be nice if this functionality eventually got added to base julia.
Is your feature request related to a problem? Please describe.
To get kernel performance matching
clangwe have had to add fast-math flags such ascontract(whichclangandnvccdo by default). Currently, we do this by an ugly-hack, see for exampleCUDA.jl/perf/volumerhs.jl
Lines 21 to 57 in bb37b50
Describe the solution you'd like
I would like a macro like
@fastmaththat had fine-grained control over the fast-math flags.Describe alternatives you've considered
KernelAbstractions used to do this with https://github.com/JuliaLabs/Cassette.jl and other people use macros (although it opens up less optimization and thus not desired) https://github.com/JuliaLabs/Cassette.jl. I don't know if https://github.com/JuliaDebug/CassetteOverlay.jl can be used with kernels but it might be a possible way to implement this.
It would be nice if this functionality eventually got added to base julia.