Reduce usage of `libdevice`, relying more on LLVM by maleadt · Pull Request #3149 · JuliaGPU/CUDA.jl

maleadt · 2026-05-20T14:30:37Z

Building on JuliaGPU/GPUCompiler.jl#805, JuliaGPU/GPUCompiler.jl#804, JuliaGPU/GPUCompiler.jl#800, this PR aims to avoid some of the uses of libdevice's intrinsics, instead emitting vanilla LLVM IR and having GPUCompiler.jl post-process it into what we need in PTX. This has many advantages, including (potentially) better optimization, compatibility with LLVM tools like Enzyme, etc.

cc @vchuravy

GPUCompiler's `PTXFDivFastPass` handles `afn`-flagged fdiv (covering `@fastmath` per-call and the `fastmath=true` job kwarg), and NVPTX already pattern-matches plain `fdiv 1.0, x` to `rcp.rn`. The only remaining override is `FastMath.inv_fast(::AbstractFloat)`, which Julia upstream doesn't implement for floats — route through `div_fast` so the pass sees `afn`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

isfinite/isinf/isnan, signbit/copysign/abs, trunc/ceil/floor, fma, and muladd inherit from Julia. Julia emits canonical LLVM ops (`llvm.fabs`, `llvm.floor`, `llvm.copysign`, `llvm.fma`, `fmul contract + fadd contract`, etc.), and the NVPTX backend lowers them to the same single-instruction PTX the libdevice overrides used to produce after inlining. `Base.fma(::Float16,...)` is the lone exception — its `jl_have_fma` runtime call isn't recognized by GPUCompiler's `cpu_features!`, so the branch survives the optimizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pin down PTX for the ops whose `@device_override`s were dropped — abs, floor/ceil/trunc, isnan/isinf/isfinite/signbit, copysign, min/max — across {f32, f64}, plain vs. `@fastmath` where it matters, and with job-wide `fastmath=true` (which also flips f32 ops to their `.ftz` variants via `apply_fastmath!`'s `denormal-fp-math-f32` attribute). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `986ca42`	Previous: `c0534dd`	Ratio
`array/accumulate/Float32/1d`	`99469` ns	`99381` ns	`1.00`
`array/accumulate/Float32/dims=1`	`74699` ns	`74944` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1575060` ns	`1574920` ns	`1.00`
`array/accumulate/Float32/dims=2`	`140156` ns	`140278` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`652352` ns	`652596` ns	`1.00`
`array/accumulate/Int64/1d`	`116679` ns	`116586` ns	`1.00`
`array/accumulate/Int64/dims=1`	`78628` ns	`78207` ns	`1.01`
`array/accumulate/Int64/dims=1L`	`1683947` ns	`1682646` ns	`1.00`
`array/accumulate/Int64/dims=2`	`150782` ns	`150838` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`959162` ns	`959136` ns	`1.00`
`array/broadcast`	`20011` ns	`19887` ns	`1.01`
`array/construct`	`1185.9` ns	`1190.3` ns	`1.00`
`array/copy`	`16565` ns	`17076` ns	`0.97`
`array/copyto!/cpu_to_gpu`	`212210` ns	`211721` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`279922` ns	`279606` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`10401` ns	`10658` ns	`0.98`
`array/iteration/findall/bool`	`130793` ns	`131160` ns	`1.00`
`array/iteration/findall/int`	`145151` ns	`145504` ns	`1.00`
`array/iteration/findfirst/bool`	`78686` ns	`78933` ns	`1.00`
`array/iteration/findfirst/int`	`80187` ns	`80313` ns	`1.00`
`array/iteration/findmin/1d`	`63791` ns	`67478` ns	`0.95`
`array/iteration/findmin/2d`	`101586` ns	`111587` ns	`0.91`
`array/iteration/logical`	`187789` ns	`189045` ns	`0.99`
`array/iteration/scalar`	`64529` ns	`65291` ns	`0.99`
`array/permutedims/2d`	`49215` ns	`49578` ns	`0.99`
`array/permutedims/3d`	`49748` ns	`50060` ns	`0.99`
`array/permutedims/4d`	`49496` ns	`49640` ns	`1.00`
`array/random/rand/Float32`	`11940` ns	`11793` ns	`1.01`
`array/random/rand/Int64`	`23373` ns	`23510` ns	`0.99`
`array/random/rand!/Float32`	`8159.333333333333` ns	`8143.666666666667` ns	`1.00`
`array/random/rand!/Int64`	`20269` ns	`20558` ns	`0.99`
`array/random/randn/Float32`	`34507` ns	`35246` ns	`0.98`
`array/random/randn!/Float32`	`24040` ns	`24445` ns	`0.98`
`array/reductions/mapreduce/Float32/1d`	`32769` ns	`32850` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1`	`37846` ns	`38019` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`50060` ns	`50260` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`55201` ns	`55487` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2L`	`66587` ns	`66867` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`40014` ns	`39021` ns	`1.03`
`array/reductions/mapreduce/Int64/dims=1`	`40604` ns	`41069` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=1L`	`86187` ns	`86620` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`57917` ns	`57870` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`82557` ns	`82290` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`32651` ns	`32691` ns	`1.00`
`array/reductions/reduce/Float32/dims=1`	`37928` ns	`38016` ns	`1.00`
`array/reductions/reduce/Float32/dims=1L`	`50240` ns	`50354` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`55386` ns	`55481` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`66932` ns	`67234` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`39259` ns	`38804` ns	`1.01`
`array/reductions/reduce/Int64/dims=1`	`40346` ns	`40976` ns	`0.98`
`array/reductions/reduce/Int64/dims=1L`	`86048` ns	`86550` ns	`0.99`
`array/reductions/reduce/Int64/dims=2`	`57824` ns	`57935` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`82149` ns	`82488` ns	`1.00`
`array/reverse/1d`	`16626` ns	`16936` ns	`0.98`
`array/reverse/1dL`	`67580` ns	`67830` ns	`1.00`
`array/reverse/1dL_inplace`	`65280` ns	`65354` ns	`1.00`
`array/reverse/1d_inplace`	`8294.333333333334` ns	`10092` ns	`0.82`
`array/reverse/2d`	`19779` ns	`19828` ns	`1.00`
`array/reverse/2dL`	`71532` ns	`71681` ns	`1.00`
`array/reverse/2dL_inplace`	`65121` ns	`65299` ns	`1.00`
`array/reverse/2d_inplace`	`9697` ns	`9999` ns	`0.97`
`array/sorting/1d`	`2714988` ns	`2721914` ns	`1.00`
`array/sorting/2d`	`1064147` ns	`1066145` ns	`1.00`
`array/sorting/by`	`3280513` ns	`3288145` ns	`1.00`
`cuda/synchronization/context/auto`	`1097.2` ns	`1130.5` ns	`0.97`
`cuda/synchronization/context/blocking`	`894.2045454545455` ns	`895.8947368421053` ns	`1.00`
`cuda/synchronization/context/nonblocking`	`5957` ns	`6056.2` ns	`0.98`
`cuda/synchronization/stream/auto`	`968.7894736842105` ns	`976.7333333333333` ns	`0.99`
`cuda/synchronization/stream/blocking`	`791.1923076923077` ns	`793.7916666666666` ns	`1.00`
`cuda/synchronization/stream/nonblocking`	`5850.5` ns	`5977.285714285715` ns	`0.98`
`integration/byval/reference`	`143210` ns	`143365` ns	`1.00`
`integration/byval/slices=1`	`145163` ns	`145406` ns	`1.00`
`integration/byval/slices=2`	`283899` ns	`283824` ns	`1.00`
`integration/byval/slices=3`	`422144` ns	`422343` ns	`1.00`
`integration/cudadevrt`	`101658` ns	`101785` ns	`1.00`
`integration/volumerhs`	`11240669` ns	`11093917` ns	`1.01`
`kernel/indexing`	`12483` ns	`12618` ns	`0.99`
`kernel/indexing_checked`	`13360` ns	`13432` ns	`0.99`
`kernel/launch`	`2116.777777777778` ns	`2074` ns	`1.02`
`kernel/occupancy`	`698.3197278911565` ns	`695.625850340136` ns	`1.00`
`kernel/rand`	`15798` ns	`13974` ns	`1.13`
`latency/import`	`3847309915` ns	`3869060589` ns	`0.99`
`latency/precompile`	`4623850023` ns	`4623502823` ns	`1.00`
`latency/ttfp`	`4492330513` ns	`4510157748` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

gbaraldi · 2026-05-20T18:19:10Z

+# `llvm.fmuladd` (rather than Julia's default `fmul contract + fadd contract`)
+# keeps the fusion robust under vectorization (per JuliaGPU/CUDA.jl#3149).
+@device_override Base.muladd(x::Float64, y::Float64, z::Float64) =
+    ccall("llvm.fmuladd.f64", llvmcall, Cdouble, (Cdouble, Cdouble, Cdouble), x, y, z)


We should really do this upstream as well. Our mulladd could technically leak in the way we emit it.

rsqrt now uses high-level `@fastmath 1/sqrt(x)`; GPUCompiler's new PTXRSqrtFastPass lowers it to `nvvm.rsqrt.approx.{f,d}` directly. Adds a PTX FileCheck test pinning the lowering. Pin `arch=sm"80"` on the min.NaN.f32 / max.NaN.f32 PTX checks so they pass on sm_75 CI runners. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `@fastmath 1/sqrt(x)` form stamps `fast` (nnan/ninf/...) on the IR operations, which let LLVM DCE caller-side `isnan(rsqrt(x))` and `isinf(rsqrt(x))` checks before our PTXRSqrtFastPass folded the pattern — a behavior regression versus the libdevice path. Direct `ccall` to `llvm.nvvm.rsqrt.approx.{f,d}` is opaque to fast-math reasoning, matches what libdevice itself does (a thin wrapper around the same intrinsic), and produces strictly cleaner IR than libdevice (single rsqrt call + select rather than phi + duplicate call). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-21T11:13:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.41%. Comparing base (c0534dd) to head (986ca42).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3149      +/-   ##
==========================================
+ Coverage   16.40%   16.41%   +0.01%     
==========================================
  Files         124      124              
  Lines        9827     9827              
==========================================
+ Hits         1612     1613       +1     
+ Misses       8215     8214       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vchuravy reviewed May 20, 2026

View reviewed changes

Comment thread CUDACore/src/device/intrinsics/math.jl Outdated

vchuravy reviewed May 20, 2026

View reviewed changes

Comment thread CUDACore/src/device/intrinsics/math.jl

vchuravy reviewed May 20, 2026

View reviewed changes

Comment thread test/core/codegen.jl

maleadt and others added 6 commits May 20, 2026 17:15

Simplify.

30a9b7e

Improve tests.

f518896

Use FileCheck more widely.

7ab5b8c

github-actions Bot reviewed May 20, 2026

View reviewed changes

maleadt added 2 commits May 20, 2026 17:24

Fix CI.

bc47367

Address review comments.

0b0dc66

maleadt force-pushed the tb/fastmath branch from 2c63939 to 0b0dc66 Compare May 20, 2026 15:42

vchuravy reviewed May 20, 2026

View reviewed changes

Comment thread CUDACore/src/device/intrinsics/math.jl Outdated

vchuravy reviewed May 20, 2026

View reviewed changes

Comment thread CUDACore/src/device/intrinsics/math.jl

vchuravy approved these changes May 20, 2026

View reviewed changes

gbaraldi reviewed May 20, 2026

View reviewed changes

vchuravy mentioned this pull request May 20, 2026

Emit llvm.fmuladd again for muladd intrinsics JuliaLang/julia#61865

Open

maleadt mentioned this pull request May 21, 2026

PTX: add PTXRSqrtFastPass to fold afn 1/sqrt(x) to nvvm.rsqrt.approx JuliaGPU/GPUCompiler.jl#807

Merged

maleadt force-pushed the tb/fastmath branch from b0e3570 to 986ca42 Compare May 21, 2026 09:23

maleadt merged commit 2fe75d6 into main May 21, 2026
2 checks passed

maleadt deleted the tb/fastmath branch May 21, 2026 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce usage of `libdevice`, relying more on LLVM#3149

Reduce usage of `libdevice`, relying more on LLVM#3149
maleadt merged 10 commits into
mainfrom
tb/fastmath

maleadt commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

gbaraldi May 20, 2026

Uh oh!

codecov Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

maleadt commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

Uh oh!

Uh oh!

gbaraldi May 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot left a comment •

edited

Loading

codecov Bot commented May 21, 2026 •

edited

Loading