Reduce usage of libdevice, relying more on LLVM#3149
Merged
Conversation
vchuravy
reviewed
May 20, 2026
vchuravy
reviewed
May 20, 2026
vchuravy
reviewed
May 20, 2026
GPUCompiler's `PTXFDivFastPass` handles `afn`-flagged fdiv (covering `@fastmath` per-call and the `fastmath=true` job kwarg), and NVPTX already pattern-matches plain `fdiv 1.0, x` to `rcp.rn`. The only remaining override is `FastMath.inv_fast(::AbstractFloat)`, which Julia upstream doesn't implement for floats — route through `div_fast` so the pass sees `afn`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isfinite/isinf/isnan, signbit/copysign/abs, trunc/ceil/floor, fma, and muladd inherit from Julia. Julia emits canonical LLVM ops (`llvm.fabs`, `llvm.floor`, `llvm.copysign`, `llvm.fma`, `fmul contract + fadd contract`, etc.), and the NVPTX backend lowers them to the same single-instruction PTX the libdevice overrides used to produce after inlining. `Base.fma(::Float16,...)` is the lone exception — its `jl_have_fma` runtime call isn't recognized by GPUCompiler's `cpu_features!`, so the branch survives the optimizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin down PTX for the ops whose `@device_override`s were dropped — abs,
floor/ceil/trunc, isnan/isinf/isfinite/signbit, copysign, min/max — across
{f32, f64}, plain vs. `@fastmath` where it matters, and with job-wide
`fastmath=true` (which also flips f32 ops to their `.ftz` variants via
`apply_fastmath!`'s `denormal-fp-math-f32` attribute).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 986ca42 | Previous: c0534dd | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
99469 ns |
99381 ns |
1.00 |
array/accumulate/Float32/dims=1 |
74699 ns |
74944 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1575060 ns |
1574920 ns |
1.00 |
array/accumulate/Float32/dims=2 |
140156 ns |
140278 ns |
1.00 |
array/accumulate/Float32/dims=2L |
652352 ns |
652596 ns |
1.00 |
array/accumulate/Int64/1d |
116679 ns |
116586 ns |
1.00 |
array/accumulate/Int64/dims=1 |
78628 ns |
78207 ns |
1.01 |
array/accumulate/Int64/dims=1L |
1683947 ns |
1682646 ns |
1.00 |
array/accumulate/Int64/dims=2 |
150782 ns |
150838 ns |
1.00 |
array/accumulate/Int64/dims=2L |
959162 ns |
959136 ns |
1.00 |
array/broadcast |
20011 ns |
19887 ns |
1.01 |
array/construct |
1185.9 ns |
1190.3 ns |
1.00 |
array/copy |
16565 ns |
17076 ns |
0.97 |
array/copyto!/cpu_to_gpu |
212210 ns |
211721 ns |
1.00 |
array/copyto!/gpu_to_cpu |
279922 ns |
279606 ns |
1.00 |
array/copyto!/gpu_to_gpu |
10401 ns |
10658 ns |
0.98 |
array/iteration/findall/bool |
130793 ns |
131160 ns |
1.00 |
array/iteration/findall/int |
145151 ns |
145504 ns |
1.00 |
array/iteration/findfirst/bool |
78686 ns |
78933 ns |
1.00 |
array/iteration/findfirst/int |
80187 ns |
80313 ns |
1.00 |
array/iteration/findmin/1d |
63791 ns |
67478 ns |
0.95 |
array/iteration/findmin/2d |
101586 ns |
111587 ns |
0.91 |
array/iteration/logical |
187789 ns |
189045 ns |
0.99 |
array/iteration/scalar |
64529 ns |
65291 ns |
0.99 |
array/permutedims/2d |
49215 ns |
49578 ns |
0.99 |
array/permutedims/3d |
49748 ns |
50060 ns |
0.99 |
array/permutedims/4d |
49496 ns |
49640 ns |
1.00 |
array/random/rand/Float32 |
11940 ns |
11793 ns |
1.01 |
array/random/rand/Int64 |
23373 ns |
23510 ns |
0.99 |
array/random/rand!/Float32 |
8159.333333333333 ns |
8143.666666666667 ns |
1.00 |
array/random/rand!/Int64 |
20269 ns |
20558 ns |
0.99 |
array/random/randn/Float32 |
34507 ns |
35246 ns |
0.98 |
array/random/randn!/Float32 |
24040 ns |
24445 ns |
0.98 |
array/reductions/mapreduce/Float32/1d |
32769 ns |
32850 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1 |
37846 ns |
38019 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
50060 ns |
50260 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
55201 ns |
55487 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2L |
66587 ns |
66867 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
40014 ns |
39021 ns |
1.03 |
array/reductions/mapreduce/Int64/dims=1 |
40604 ns |
41069 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1L |
86187 ns |
86620 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
57917 ns |
57870 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
82557 ns |
82290 ns |
1.00 |
array/reductions/reduce/Float32/1d |
32651 ns |
32691 ns |
1.00 |
array/reductions/reduce/Float32/dims=1 |
37928 ns |
38016 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
50240 ns |
50354 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
55386 ns |
55481 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
66932 ns |
67234 ns |
1.00 |
array/reductions/reduce/Int64/1d |
39259 ns |
38804 ns |
1.01 |
array/reductions/reduce/Int64/dims=1 |
40346 ns |
40976 ns |
0.98 |
array/reductions/reduce/Int64/dims=1L |
86048 ns |
86550 ns |
0.99 |
array/reductions/reduce/Int64/dims=2 |
57824 ns |
57935 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
82149 ns |
82488 ns |
1.00 |
array/reverse/1d |
16626 ns |
16936 ns |
0.98 |
array/reverse/1dL |
67580 ns |
67830 ns |
1.00 |
array/reverse/1dL_inplace |
65280 ns |
65354 ns |
1.00 |
array/reverse/1d_inplace |
8294.333333333334 ns |
10092 ns |
0.82 |
array/reverse/2d |
19779 ns |
19828 ns |
1.00 |
array/reverse/2dL |
71532 ns |
71681 ns |
1.00 |
array/reverse/2dL_inplace |
65121 ns |
65299 ns |
1.00 |
array/reverse/2d_inplace |
9697 ns |
9999 ns |
0.97 |
array/sorting/1d |
2714988 ns |
2721914 ns |
1.00 |
array/sorting/2d |
1064147 ns |
1066145 ns |
1.00 |
array/sorting/by |
3280513 ns |
3288145 ns |
1.00 |
cuda/synchronization/context/auto |
1097.2 ns |
1130.5 ns |
0.97 |
cuda/synchronization/context/blocking |
894.2045454545455 ns |
895.8947368421053 ns |
1.00 |
cuda/synchronization/context/nonblocking |
5957 ns |
6056.2 ns |
0.98 |
cuda/synchronization/stream/auto |
968.7894736842105 ns |
976.7333333333333 ns |
0.99 |
cuda/synchronization/stream/blocking |
791.1923076923077 ns |
793.7916666666666 ns |
1.00 |
cuda/synchronization/stream/nonblocking |
5850.5 ns |
5977.285714285715 ns |
0.98 |
integration/byval/reference |
143210 ns |
143365 ns |
1.00 |
integration/byval/slices=1 |
145163 ns |
145406 ns |
1.00 |
integration/byval/slices=2 |
283899 ns |
283824 ns |
1.00 |
integration/byval/slices=3 |
422144 ns |
422343 ns |
1.00 |
integration/cudadevrt |
101658 ns |
101785 ns |
1.00 |
integration/volumerhs |
11240669 ns |
11093917 ns |
1.01 |
kernel/indexing |
12483 ns |
12618 ns |
0.99 |
kernel/indexing_checked |
13360 ns |
13432 ns |
0.99 |
kernel/launch |
2116.777777777778 ns |
2074 ns |
1.02 |
kernel/occupancy |
698.3197278911565 ns |
695.625850340136 ns |
1.00 |
kernel/rand |
15798 ns |
13974 ns |
1.13 |
latency/import |
3847309915 ns |
3869060589 ns |
0.99 |
latency/precompile |
4623850023 ns |
4623502823 ns |
1.00 |
latency/ttfp |
4492330513 ns |
4510157748 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
vchuravy
reviewed
May 20, 2026
vchuravy
reviewed
May 20, 2026
vchuravy
approved these changes
May 20, 2026
gbaraldi
reviewed
May 20, 2026
| # `llvm.fmuladd` (rather than Julia's default `fmul contract + fadd contract`) | ||
| # keeps the fusion robust under vectorization (per JuliaGPU/CUDA.jl#3149). | ||
| @device_override Base.muladd(x::Float64, y::Float64, z::Float64) = | ||
| ccall("llvm.fmuladd.f64", llvmcall, Cdouble, (Cdouble, Cdouble, Cdouble), x, y, z) |
Member
There was a problem hiding this comment.
We should really do this upstream as well. Our mulladd could technically leak in the way we emit it.
rsqrt now uses high-level `@fastmath 1/sqrt(x)`; GPUCompiler's new
PTXRSqrtFastPass lowers it to `nvvm.rsqrt.approx.{f,d}` directly. Adds a
PTX FileCheck test pinning the lowering.
Pin `arch=sm"80"` on the min.NaN.f32 / max.NaN.f32 PTX checks so they
pass on sm_75 CI runners.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `@fastmath 1/sqrt(x)` form stamps `fast` (nnan/ninf/...) on the IR
operations, which let LLVM DCE caller-side `isnan(rsqrt(x))` and
`isinf(rsqrt(x))` checks before our PTXRSqrtFastPass folded the pattern —
a behavior regression versus the libdevice path. Direct `ccall` to
`llvm.nvvm.rsqrt.approx.{f,d}` is opaque to fast-math reasoning, matches
what libdevice itself does (a thin wrapper around the same intrinsic),
and produces strictly cleaner IR than libdevice (single rsqrt call +
select rather than phi + duplicate call).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3149 +/- ##
==========================================
+ Coverage 16.40% 16.41% +0.01%
==========================================
Files 124 124
Lines 9827 9827
==========================================
+ Hits 1612 1613 +1
+ Misses 8215 8214 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Building on JuliaGPU/GPUCompiler.jl#805, JuliaGPU/GPUCompiler.jl#804, JuliaGPU/GPUCompiler.jl#800, this PR aims to avoid some of the uses of
libdevice's intrinsics, instead emitting vanilla LLVM IR and having GPUCompiler.jl post-process it into what we need in PTX. This has many advantages, including (potentially) better optimization, compatibility with LLVM tools like Enzyme, etc.cc @vchuravy