Skip to content

Switch to the new GPUArrays RNG.#767

Merged
maleadt merged 1 commit into
mainfrom
tb/rng
Apr 16, 2026
Merged

Switch to the new GPUArrays RNG.#767
maleadt merged 1 commit into
mainfrom
tb/rng

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 15, 2026

Adapts to JuliaGPU/GPUArrays.jl#707. The new RNG is generally faster than the MPS one (and much faster than the native one), so switch over for all operations.

Analysis by 🤖 below.

Performance comparison

Apple M-series GPU, Metal.@sync timing, minimum of 30 samples per call.
"ratio" is MPS / GPUArrays — values below 1.0 mean MPS is faster.

Uniform rand!

type n ratio GPUArrays MPS Kernel GA GB/s MPS GB/s K GB/s
Float16 1024 138.7 μs 133.6 μs 0.01 0.02
Float16 1048576 250.4 μs 407.9 μs 8.37 5.14
Float16 16777216 464.2 μs 1.70 ms 72.28 19.70
Float32 1024 1.08× 138.7 μs 150.0 μs 128.9 μs 0.03 0.03 0.03
Float32 1048576 1.22× 276.4 μs 336.5 μs 437.6 μs 15.17 12.46 9.59
Float32 16777216 1.32× 600.6 μs 792.5 μs 1.70 ms 111.73 84.68 39.51
UInt8 1024 1.06× 135.4 μs 143.2 μs 134.0 μs 0.01 0.01 0.01
UInt8 1048576 0.88× 225.5 μs 199.4 μs 402.4 μs 4.65 5.26 2.61
UInt8 16777216 1.09× 456.6 μs 496.5 μs 1.69 ms 36.75 33.79 9.90
Int8 1024 1.28× 125.5 μs 160.3 μs 133.5 μs 0.01 0.01 0.01
Int8 1048576 0.85× 268.8 μs 227.6 μs 344.8 μs 3.90 4.61 3.04
Int8 16777216 0.67× 493.5 μs 329.9 μs 1.68 ms 33.99 50.86 9.96
UInt16 1024 0.98× 141.4 μs 139.2 μs 144.4 μs 0.01 0.01 0.01
UInt16 1048576 1.09× 257.8 μs 281.5 μs 428.0 μs 8.13 7.45 4.90
UInt16 16777216 0.77× 620.2 μs 475.2 μs 1.76 ms 54.10 70.62 19.04
Int16 1024 0.81× 166.5 μs 134.6 μs 157.3 μs 0.01 0.02 0.01
Int16 1048576 0.99× 282.3 μs 279.4 μs 445.7 μs 7.43 7.51 4.71
Int16 16777216 0.73× 612.5 μs 448.1 μs 1.69 ms 54.78 74.88 19.86
UInt32 1024 1.06× 135.4 μs 143.9 μs 137.2 μs 0.03 0.03 0.03
UInt32 1048576 1.92× 175.5 μs 337.1 μs 435.6 μs 23.90 12.44 9.63
UInt32 16777216 1.36× 678.2 μs 922.2 μs 1.72 ms 98.96 72.77 38.93
Int32 1024 1.00× 175.0 μs 175.5 μs 168.5 μs 0.02 0.02 0.02
Int32 1048576 1.19× 288.9 μs 342.5 μs 451.4 μs 14.52 12.25 9.29
Int32 16777216 1.25× 640.1 μs 802.3 μs 1.76 ms 104.84 83.64 38.22
UInt64 1024 1.03× 174.9 μs 180.6 μs 161.8 μs 0.05 0.05 0.05
UInt64 1048576 1.33× 394.3 μs 524.9 μs 442.4 μs 21.28 15.98 18.96
UInt64 16777216 1.27× 1.14 ms 1.45 ms 1.76 ms 117.64 92.55 76.09
Int64 1024 0.95× 147.6 μs 140.7 μs 133.8 μs 0.06 0.06 0.06
Int64 1048576 1.31× 372.4 μs 488.9 μs 451.3 μs 22.53 17.16 18.59
Int64 16777216 1.22× 1.18 ms 1.45 ms 1.73 ms 113.56 92.77 77.58

Normal randn!

type n ratio GPUArrays MPS Kernel GA GB/s MPS GB/s K GB/s
Float16 1024 148.2 μs 138.4 μs 0.01 0.01
Float16 1048576 280.7 μs 327.5 μs 7.47 6.40
Float16 16777216 583.0 μs 948.3 μs 57.55 35.38
Float32 1024 1.03× 142.7 μs 146.8 μs 135.3 μs 0.03 0.03 0.03
Float32 1048576 1.38× 298.7 μs 411.1 μs 372.0 μs 14.04 10.20 11.27
Float32 16777216 1.50× 658.4 μs 986.8 μs 936.2 μs 101.93 68.01 71.68
ComplexF32 1024 150.2 μs 175.0 μs 0.05 0.05
ComplexF32 1048576 430.7 μs 542.4 μs 19.48 15.47
ComplexF32 16777216 1.09 ms 1.45 ms 123.49 92.79

Summary

Looking at the 16M-element results (where launch overhead is negligible):

  • GPUArrays wins on every type GPUArrays and MPS both support, except for the
    small signed integer types
    (Int8, Int16) where MPS is ~1.3–1.5× faster — and
    even there the absolute gap is sub-millisecond.
  • GPUArrays is universally faster for randn! (1.08–1.50×) on top of fixing
    the NaN bug.
  • GPUArrays handles types MPS can't: Float16, all complex types, Int128/UInt128.
  • KernelRNG is consistently 2–4× slower than the GPUArrays RNG.

Routing all in-place / out-of-place calls through GPUArrays simplifies the code
(no per-type dispatch table) and makes the Metal randoms path consistent with
CUDA.jl. MPS stays available behind Metal.mps_rng() for users who want it.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 27.77778% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.70%. Comparing base (b94fd4b) to head (d1047e1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/random.jl 27.77% 39 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #767      +/-   ##
==========================================
- Coverage   83.09%   80.70%   -2.39%     
==========================================
  Files          62       61       -1     
  Lines        2851     2846       -5     
==========================================
- Hits         2369     2297      -72     
- Misses        482      549      +67     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: d1047e1 Previous: b94fd4b Ratio
array/accumulate/Float32/1d 1119417 ns 1098958 ns 1.02
array/accumulate/Float32/dims=1 1563917 ns 1554708 ns 1.01
array/accumulate/Float32/dims=1L 9845771 ns 9848583.5 ns 1.00
array/accumulate/Float32/dims=2 1877792 ns 1886771 ns 1.00
array/accumulate/Float32/dims=2L 7236750 ns 7256459 ns 1.00
array/accumulate/Int64/1d 1245896 ns 1261958 ns 0.99
array/accumulate/Int64/dims=1 1841479 ns 1824291.5 ns 1.01
array/accumulate/Int64/dims=1L 11601292 ns 11664208.5 ns 0.99
array/accumulate/Int64/dims=2 2165917 ns 2170333.5 ns 1.00
array/accumulate/Int64/dims=2L 9751208 ns 10120062.5 ns 0.96
array/broadcast 608708 ns 605916 ns 1.00
array/construct 6334 ns 6292 ns 1.01
array/permutedims/2d 1170333 ns 1168125 ns 1.00
array/permutedims/3d 1677687 ns 1673084 ns 1.00
array/permutedims/4d 2388812.5 ns 2365959 ns 1.01
array/private/copy 565958 ns 545792 ns 1.04
array/private/copyto!/cpu_to_gpu 809375 ns 802916 ns 1.01
array/private/copyto!/gpu_to_cpu 811416 ns 801917 ns 1.01
array/private/copyto!/gpu_to_gpu 636333 ns 634458 ns 1.00
array/private/iteration/findall/bool 1413167 ns 1402750 ns 1.01
array/private/iteration/findall/int 1561625 ns 1564021 ns 1.00
array/private/iteration/findfirst/bool 2040000 ns 2055916 ns 0.99
array/private/iteration/findfirst/int 2066708 ns 2064479.5 ns 1.00
array/private/iteration/findmin/1d 2491959 ns 2499959 ns 1.00
array/private/iteration/findmin/2d 1775917 ns 1790791 ns 0.99
array/private/iteration/logical 2656688 ns 2631896 ns 1.01
array/private/iteration/scalar 4529875 ns 5047625 ns 0.90
array/random/rand/Float32 1164792 ns 582958 ns 2.00
array/random/rand/Int64 1326750 ns 775667 ns 1.71
array/random/rand!/Float32 924250 ns 574750 ns 1.61
array/random/rand!/Int64 874750 ns 550792 ns 1.59
array/random/randn/Float32 1068458.5 ns 1006937.5 ns 1.06
array/random/randn!/Float32 820146 ns 755666 ns 1.09
array/reductions/mapreduce/Float32/1d 1033375 ns 1029500 ns 1.00
array/reductions/mapreduce/Float32/dims=1 835917 ns 840875 ns 0.99
array/reductions/mapreduce/Float32/dims=1L 1618958.5 ns 1324000 ns 1.22
array/reductions/mapreduce/Float32/dims=2 805354 ns 860875 ns 0.94
array/reductions/mapreduce/Float32/dims=2L 1818229.5 ns 1799541 ns 1.01
array/reductions/mapreduce/Int64/1d 1552833 ns 1374875 ns 1.13
array/reductions/mapreduce/Int64/dims=1 1105583 ns 1097625 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 2027500 ns 2002854 ns 1.01
array/reductions/mapreduce/Int64/dims=2 1177312 ns 1145000 ns 1.03
array/reductions/mapreduce/Int64/dims=2L 3619958 ns 3614000 ns 1.00
array/reductions/reduce/Float32/1d 1028125 ns 1028437.5 ns 1.00
array/reductions/reduce/Float32/dims=1 831125 ns 832667 ns 1.00
array/reductions/reduce/Float32/dims=1L 1316542 ns 1318416.5 ns 1.00
array/reductions/reduce/Float32/dims=2 853563 ns 853041.5 ns 1.00
array/reductions/reduce/Float32/dims=2L 1814042 ns 1810250 ns 1.00
array/reductions/reduce/Int64/1d 1496937.5 ns 1516958 ns 0.99
array/reductions/reduce/Int64/dims=1 1116833 ns 1095375 ns 1.02
array/reductions/reduce/Int64/dims=1L 2011937.5 ns 2023499.5 ns 0.99
array/reductions/reduce/Int64/dims=2 1157459 ns 1240750 ns 0.93
array/reductions/reduce/Int64/dims=2L 4242729 ns 4233875 ns 1.00
array/shared/copy 242125 ns 252417 ns 0.96
array/shared/copyto!/cpu_to_gpu 81750 ns 80750 ns 1.01
array/shared/copyto!/gpu_to_cpu 81709 ns 80667 ns 1.01
array/shared/copyto!/gpu_to_gpu 82584 ns 83083 ns 0.99
array/shared/iteration/findall/bool 1421500 ns 1427208.5 ns 1.00
array/shared/iteration/findall/int 1558458 ns 1559875 ns 1.00
array/shared/iteration/findfirst/bool 1622000 ns 1649000 ns 0.98
array/shared/iteration/findfirst/int 1635500 ns 1672458 ns 0.98
array/shared/iteration/findmin/1d 2093458 ns 2115583 ns 0.99
array/shared/iteration/findmin/2d 1783604.5 ns 1792625 ns 0.99
array/shared/iteration/logical 2427709 ns 2292167 ns 1.06
array/shared/iteration/scalar 201250 ns 199958 ns 1.01
integration/byval/reference 1582500 ns 1544250 ns 1.02
integration/byval/slices=1 1588625 ns 1560229.5 ns 1.02
integration/byval/slices=2 2614875 ns 2598333.5 ns 1.01
integration/byval/slices=3 7820666.5 ns 8092333 ns 0.97
integration/metaldevrt 878645.5 ns 868125 ns 1.01
kernel/indexing 621958 ns 592667 ns 1.05
kernel/indexing_checked 628125 ns 598292 ns 1.05
kernel/launch 11584 ns 11791.5 ns 0.98
kernel/rand 567917 ns 570709 ns 1.00
latency/import 1417687042 ns 1425597062.5 ns 0.99
latency/precompile 25459122750 ns 25453724708 ns 1.00
latency/ttfp 2335347354.5 ns 2341177208 ns 1.00
metal/synchronization/context 19792 ns 19667 ns 1.01
metal/synchronization/stream 18833 ns 18459 ns 1.02

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt maleadt merged commit 65fac52 into main Apr 16, 2026
16 checks passed
@maleadt maleadt deleted the tb/rng branch April 16, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant