Skip to content

Reduce usage of libdevice, relying more on LLVM#3149

Merged
maleadt merged 10 commits into
mainfrom
tb/fastmath
May 21, 2026
Merged

Reduce usage of libdevice, relying more on LLVM#3149
maleadt merged 10 commits into
mainfrom
tb/fastmath

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented May 20, 2026

Building on JuliaGPU/GPUCompiler.jl#805, JuliaGPU/GPUCompiler.jl#804, JuliaGPU/GPUCompiler.jl#800, this PR aims to avoid some of the uses of libdevice's intrinsics, instead emitting vanilla LLVM IR and having GPUCompiler.jl post-process it into what we need in PTX. This has many advantages, including (potentially) better optimization, compatibility with LLVM tools like Enzyme, etc.

cc @vchuravy

Comment thread CUDACore/src/device/intrinsics/math.jl Outdated
Comment thread CUDACore/src/device/intrinsics/math.jl
Comment thread test/core/codegen.jl
maleadt and others added 6 commits May 20, 2026 17:15
GPUCompiler's `PTXFDivFastPass` handles `afn`-flagged fdiv (covering
`@fastmath` per-call and the `fastmath=true` job kwarg), and NVPTX
already pattern-matches plain `fdiv 1.0, x` to `rcp.rn`. The only
remaining override is `FastMath.inv_fast(::AbstractFloat)`, which
Julia upstream doesn't implement for floats — route through `div_fast`
so the pass sees `afn`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isfinite/isinf/isnan, signbit/copysign/abs, trunc/ceil/floor, fma, and
muladd inherit from Julia. Julia emits canonical LLVM ops (`llvm.fabs`,
`llvm.floor`, `llvm.copysign`, `llvm.fma`, `fmul contract + fadd contract`,
etc.), and the NVPTX backend lowers them to the same single-instruction
PTX the libdevice overrides used to produce after inlining.

`Base.fma(::Float16,...)` is the lone exception — its `jl_have_fma`
runtime call isn't recognized by GPUCompiler's `cpu_features!`, so the
branch survives the optimizer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin down PTX for the ops whose `@device_override`s were dropped — abs,
floor/ceil/trunc, isnan/isinf/isfinite/signbit, copysign, min/max — across
{f32, f64}, plain vs. `@fastmath` where it matters, and with job-wide
`fastmath=true` (which also flips f32 ops to their `.ftz` variants via
`apply_fastmath!`'s `denormal-fp-math-f32` attribute).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 986ca42 Previous: c0534dd Ratio
array/accumulate/Float32/1d 99469 ns 99381 ns 1.00
array/accumulate/Float32/dims=1 74699 ns 74944 ns 1.00
array/accumulate/Float32/dims=1L 1575060 ns 1574920 ns 1.00
array/accumulate/Float32/dims=2 140156 ns 140278 ns 1.00
array/accumulate/Float32/dims=2L 652352 ns 652596 ns 1.00
array/accumulate/Int64/1d 116679 ns 116586 ns 1.00
array/accumulate/Int64/dims=1 78628 ns 78207 ns 1.01
array/accumulate/Int64/dims=1L 1683947 ns 1682646 ns 1.00
array/accumulate/Int64/dims=2 150782 ns 150838 ns 1.00
array/accumulate/Int64/dims=2L 959162 ns 959136 ns 1.00
array/broadcast 20011 ns 19887 ns 1.01
array/construct 1185.9 ns 1190.3 ns 1.00
array/copy 16565 ns 17076 ns 0.97
array/copyto!/cpu_to_gpu 212210 ns 211721 ns 1.00
array/copyto!/gpu_to_cpu 279922 ns 279606 ns 1.00
array/copyto!/gpu_to_gpu 10401 ns 10658 ns 0.98
array/iteration/findall/bool 130793 ns 131160 ns 1.00
array/iteration/findall/int 145151 ns 145504 ns 1.00
array/iteration/findfirst/bool 78686 ns 78933 ns 1.00
array/iteration/findfirst/int 80187 ns 80313 ns 1.00
array/iteration/findmin/1d 63791 ns 67478 ns 0.95
array/iteration/findmin/2d 101586 ns 111587 ns 0.91
array/iteration/logical 187789 ns 189045 ns 0.99
array/iteration/scalar 64529 ns 65291 ns 0.99
array/permutedims/2d 49215 ns 49578 ns 0.99
array/permutedims/3d 49748 ns 50060 ns 0.99
array/permutedims/4d 49496 ns 49640 ns 1.00
array/random/rand/Float32 11940 ns 11793 ns 1.01
array/random/rand/Int64 23373 ns 23510 ns 0.99
array/random/rand!/Float32 8159.333333333333 ns 8143.666666666667 ns 1.00
array/random/rand!/Int64 20269 ns 20558 ns 0.99
array/random/randn/Float32 34507 ns 35246 ns 0.98
array/random/randn!/Float32 24040 ns 24445 ns 0.98
array/reductions/mapreduce/Float32/1d 32769 ns 32850 ns 1.00
array/reductions/mapreduce/Float32/dims=1 37846 ns 38019 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 50060 ns 50260 ns 1.00
array/reductions/mapreduce/Float32/dims=2 55201 ns 55487 ns 0.99
array/reductions/mapreduce/Float32/dims=2L 66587 ns 66867 ns 1.00
array/reductions/mapreduce/Int64/1d 40014 ns 39021 ns 1.03
array/reductions/mapreduce/Int64/dims=1 40604 ns 41069 ns 0.99
array/reductions/mapreduce/Int64/dims=1L 86187 ns 86620 ns 1.00
array/reductions/mapreduce/Int64/dims=2 57917 ns 57870 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 82557 ns 82290 ns 1.00
array/reductions/reduce/Float32/1d 32651 ns 32691 ns 1.00
array/reductions/reduce/Float32/dims=1 37928 ns 38016 ns 1.00
array/reductions/reduce/Float32/dims=1L 50240 ns 50354 ns 1.00
array/reductions/reduce/Float32/dims=2 55386 ns 55481 ns 1.00
array/reductions/reduce/Float32/dims=2L 66932 ns 67234 ns 1.00
array/reductions/reduce/Int64/1d 39259 ns 38804 ns 1.01
array/reductions/reduce/Int64/dims=1 40346 ns 40976 ns 0.98
array/reductions/reduce/Int64/dims=1L 86048 ns 86550 ns 0.99
array/reductions/reduce/Int64/dims=2 57824 ns 57935 ns 1.00
array/reductions/reduce/Int64/dims=2L 82149 ns 82488 ns 1.00
array/reverse/1d 16626 ns 16936 ns 0.98
array/reverse/1dL 67580 ns 67830 ns 1.00
array/reverse/1dL_inplace 65280 ns 65354 ns 1.00
array/reverse/1d_inplace 8294.333333333334 ns 10092 ns 0.82
array/reverse/2d 19779 ns 19828 ns 1.00
array/reverse/2dL 71532 ns 71681 ns 1.00
array/reverse/2dL_inplace 65121 ns 65299 ns 1.00
array/reverse/2d_inplace 9697 ns 9999 ns 0.97
array/sorting/1d 2714988 ns 2721914 ns 1.00
array/sorting/2d 1064147 ns 1066145 ns 1.00
array/sorting/by 3280513 ns 3288145 ns 1.00
cuda/synchronization/context/auto 1097.2 ns 1130.5 ns 0.97
cuda/synchronization/context/blocking 894.2045454545455 ns 895.8947368421053 ns 1.00
cuda/synchronization/context/nonblocking 5957 ns 6056.2 ns 0.98
cuda/synchronization/stream/auto 968.7894736842105 ns 976.7333333333333 ns 0.99
cuda/synchronization/stream/blocking 791.1923076923077 ns 793.7916666666666 ns 1.00
cuda/synchronization/stream/nonblocking 5850.5 ns 5977.285714285715 ns 0.98
integration/byval/reference 143210 ns 143365 ns 1.00
integration/byval/slices=1 145163 ns 145406 ns 1.00
integration/byval/slices=2 283899 ns 283824 ns 1.00
integration/byval/slices=3 422144 ns 422343 ns 1.00
integration/cudadevrt 101658 ns 101785 ns 1.00
integration/volumerhs 11240669 ns 11093917 ns 1.01
kernel/indexing 12483 ns 12618 ns 0.99
kernel/indexing_checked 13360 ns 13432 ns 0.99
kernel/launch 2116.777777777778 ns 2074 ns 1.02
kernel/occupancy 698.3197278911565 ns 695.625850340136 ns 1.00
kernel/rand 15798 ns 13974 ns 1.13
latency/import 3847309915 ns 3869060589 ns 0.99
latency/precompile 4623850023 ns 4623502823 ns 1.00
latency/ttfp 4492330513 ns 4510157748 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Comment thread CUDACore/src/device/intrinsics/math.jl Outdated
Comment thread CUDACore/src/device/intrinsics/math.jl
# `llvm.fmuladd` (rather than Julia's default `fmul contract + fadd contract`)
# keeps the fusion robust under vectorization (per JuliaGPU/CUDA.jl#3149).
@device_override Base.muladd(x::Float64, y::Float64, z::Float64) =
ccall("llvm.fmuladd.f64", llvmcall, Cdouble, (Cdouble, Cdouble, Cdouble), x, y, z)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really do this upstream as well. Our mulladd could technically leak in the way we emit it.

rsqrt now uses high-level `@fastmath 1/sqrt(x)`; GPUCompiler's new
PTXRSqrtFastPass lowers it to `nvvm.rsqrt.approx.{f,d}` directly. Adds a
PTX FileCheck test pinning the lowering.

Pin `arch=sm"80"` on the min.NaN.f32 / max.NaN.f32 PTX checks so they
pass on sm_75 CI runners.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `@fastmath 1/sqrt(x)` form stamps `fast` (nnan/ninf/...) on the IR
operations, which let LLVM DCE caller-side `isnan(rsqrt(x))` and
`isinf(rsqrt(x))` checks before our PTXRSqrtFastPass folded the pattern —
a behavior regression versus the libdevice path. Direct `ccall` to
`llvm.nvvm.rsqrt.approx.{f,d}` is opaque to fast-math reasoning, matches
what libdevice itself does (a thin wrapper around the same intrinsic),
and produces strictly cleaner IR than libdevice (single rsqrt call +
select rather than phi + duplicate call).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.41%. Comparing base (c0534dd) to head (986ca42).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3149      +/-   ##
==========================================
+ Coverage   16.40%   16.41%   +0.01%     
==========================================
  Files         124      124              
  Lines        9827     9827              
==========================================
+ Hits         1612     1613       +1     
+ Misses       8215     8214       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maleadt maleadt merged commit 2fe75d6 into main May 21, 2026
2 checks passed
@maleadt maleadt deleted the tb/fastmath branch May 21, 2026 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants