Replace Tausworthe+LCG RNG with stateless Philox4x32-10 by maleadt · Pull Request #707 · JuliaGPU/GPUArrays.jl

maleadt · 2026-04-14T19:34:29Z

Replace the hybrid Tausworthe+LCG generator with a stateless Philox4x32-10 counter-based RNG. Use @fastmath log + sincospi for Box-Muller randn!, letting each GPU backend provide optimized intrinsics via @device_override.

The generic rand! uses an immutable ElementRNG struct that implements the minimal AbstractRNG interface (just rand for UInt64 + rng_native_52). Julia's Random stdlib handles all type conversions (Float16, Complex, Bool, integers) automatically. The subcounter state lives in a stack-allocated Ref with a pointer in the struct, avoiding mutable struct heap allocation on GPU.

This was prompted by JuliaGPU/CUDA.jl#3056 / inspired on https://github.com/medyan-dev/PhiloxRNG.jl

Performance is much better than CUDA.jl's NativeRNG, so I'll probably deprecate that in favor of this.

cc @nhz2

@fastmath

Replace the hybrid Tausworthe+LCG generator with a stateless Philox4x32-10 counter-based RNG using KernelAbstractions, portable across GPU backends. rand! uses batched kernels for common types (Float16/32/64, integers, Bool, Complex) that produce multiple values per Philox call. A generic ElementRNG fallback handles all other types via Julia's Random stdlib dispatch. randn! uses specialized Box-Muller kernels with @fastmath log + sincospi, letting each backend provide optimized intrinsics via @device_override. These can't be made generic like rand! because (1) Random's Ziggurat needs table lookups requiring device overlays, (2) the generic Marsaglia polar fallback has a rejection loop causing warp divergence, and (3) direct Box-Muller produces value pairs that must be batched to stay bandwidth-bound. The old RNG(state::AbstractGPUArray) constructor and seed!(rng, ::Vector) are kept as deprecated methods for backwards compatibility with existing backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously used fldmod with typemax(UInt32) = 2^32 - 1, which wraps the counter to 1 (not 0) when it saturates. Not a correctness issue today since we only advance by 1, but the logic was confusing. Replace with a plain increment that lets UInt32 overflow naturally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously the seed was UInt32 and the Philox key's upper lane was always zero, so different 64-bit seeds colliding mod 2^32 produced identical streams. Store the seed as UInt64 and split both halves into the key's two lanes via a new philox_key helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Construct the Float64 bit-pattern directly instead of calling the expensive Float64(::UInt64) conversion and a subsequent fma. On consumer GPUs with 1:64 FP64 throughput this is a ~40x speedup for rand!(F64) on a 100M-element array (0.9 vs 38 ms on an RTX 5080, now memory-bound). Force the low mantissa bit so the result is strictly in (0, 1), which is required by Box-Muller's log(u). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the four randn_*_kernel!'s with a single randn_batched_kernel! mirroring rand_batched_kernel!, parameterized on the element type via a philox_to_normals helper (analogous to philox_to_vals). Same performance, half the kernel code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vals_per_call already returned 1 for 16-byte types; add the corresponding philox_to_vals method and extend the union so rand! dispatches to the batched kernel rather than the generic fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nhz2 · 2026-04-15T05:13:49Z

For randn it would be nice to avoid returning exactly zero (though not strictly needed). This is handled by the fixed point math in https://github.com/medyan-dev/PhiloxRNG.jl/blob/main/src/fastsincospi.jl and https://github.com/medyan-dev/PhiloxRNG.jl/blob/main/src/fastlog.jl

The fastsincospi also has a nice exact 8 way symmetry around the unit circle.

These custom functions seem fast on my RTX 3080 with Float32, but maybe in other cases the intrinsics are faster? I haven't looked much into Float64 or other hardware yet.

Both variants had the same (::Type{T}, u1, u2) signature with different output types; dispatch on Complex{T} vs T instead of using a separate name, and keep both methods next to each other at the top of the file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allowed any AbstractRNG to fill a GPUArray by generating values on the CPU and copying them over. That hides accidentally-slow code paths; let unsupported (rng, eltype) combinations fail explicitly via MethodError instead. Also drop the matching cpu_rng exercise from the test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mirrors the rand! structure: batched kernel for Float16/32/64 and their complex variants, generic ElementRNG-based kernel for every other AbstractFloat / Complex{<:AbstractFloat}. Non-batched types go through Random's generic polar Box-Muller fallback, which is GPU-safe (ziggurat is never reached because its signatures cover only Float16/32/64, which dispatch to the batched kernel). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- u01 no longer claims "(0, 1]"; the Float64 path is strictly (0, 1). - The generic rand! fallback comment listed types that have since been moved into BatchedRandTypes. - The randn! intro claimed randn! couldn't use ElementRNG; there's now a generic ElementRNG fallback too. Reframe the three-point rationale as "why the batched path exists" rather than "why there's no generic path". - Clarify the per-call batching table to cover Int128/UInt128 and Complex{Float32} explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ElementRNG and its Random stdlib dispatch methods are shared between the generic rand! and randn! fallbacks; hoist them above both kernels rather than living inside the "Generic rand! fallback" section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@inline

These @inline overrides of the generic `rand(::AbstractRNG, ::Type{T})` and `SamplerType{Complex{T}}` methods were needed when ElementRNG was a mutable struct, to keep the dispatch chain transparent enough for the Ref-to-ptr trick to stack-allocate. With ElementRNG immutable, empirical checks (LLVM inspection, BFloat16 rand and Complex{Float16} randn benchmarks) show identical generated code and identical performance whether the overrides are present or not. Base.Random defines both methods with identical bodies, so nothing is lost by removing them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Base.sincospi(::Float32) widens to Float64 internally (see base/special/trig.jl:744 `sinpi_kernel_wide`), which breaks Metal (no Float64 support) and forces FP64 emulation on other backends that hit that path. Vendor PhiloxRNG.jl's minimax polynomial: entirely Float32, consumes the Philox UInt32 output directly. Rewrite the ≤32-bit boxmuller methods to take UInt32 arguments and update philox_to_normals to match. Float64 keeps using Base.sincospi (backends with FP64 have intrinsics, and the polynomial alternative is ~8x slower on consumer GPUs — see the follow-up log commit for measurements). Note: Base.log(::Float32) has the same FP64-widening problem and is fixed in the next commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same Metal problem as sincospi in the previous commit: Base.log(::Float32) widens to Float64 internally (see base/special/log.jl:242, `Float64(2f0*f)/(2.0+f)`), which breaks Metal and forces FP64 emulation on other backends that end up on that path. Vendor PhiloxRNG.jl's Float32 polynomial log (itself ported from fdlibm's e_logf.c). Fold the u01 conversion into the first fma so the ≤32-bit boxmuller never touches Float64. Float64 still uses Base.log_fast — backends with FP64 support have intrinsics for it, and the polynomial alternative is ~8x slower on consumer GPUs (benchmarked: randn! F64 drops from ~97 GB/s with intrinsics to ~12 GB/s with polynomials on an RTX 5080). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Base's randn(rng, ::BitFloatType) is the ziggurat path that reads global wi/ki/fi tables, which aren't device-accessible — and the Float64 tables can't even be loaded on Metal. Direct randn!(rng, A) for Float16/32/64 dispatches to the batched kernel so this seemed dormant, but Base's randn(rng, Complex{T}) recurses into randn(rng, T): a randn!(rng, A::GPUArray{Complex{Float16}}) call therefore hit ziggurat via the generic path. Override randn(rng::ElementRNG, ::Type{Float{16,32,64}}) to use our Box-Muller directly. Covers any current or future caller that recurses through these types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

maleadt · 2026-04-15T07:00:40Z

Yeah I switched back to your versions because sincospi(::Float32) uses Float64, which is incompatible with some of our back-ends.

FYI, Claude also came up with a trick to improve Float64 performance:

Use bit-pattern trick for u01(Float64, UInt64) — eliminates the
Float64(::UInt64) conversion + fma that hit the 1:64 FP64 wall on
consumer GPUs. ~40× speedup for rand!(F64).

Final performance now:

Benchmark	Tausworthe (old)	PhiloxRNG (raw)	cuRAND	this PR
rand! F32	138	915	270	912
rand! F64	264^†	21	—	911
randn! F32	170	916	548	912
randn! F64	56	12	57	97

^† old Tausworthe F64 uses only 32 bits of entropy per value.

Passes SmallCrush, but I haven't run BigCrush yet.

These are internal names but not meant as private underscore-prefixed Julia convention; the leading _ was just carried over from PhiloxRNG. Rename to fast_log / fast_sincospi / SP_F32 / CP_F32 / SQRT_HALF_I32 / LOG_ODD_F32 / LOG_EVEN_F32. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> [ci skip]

`GPUArrays.RNG{AT}` now carries the destination array type, so out-of-place `rand(rng, T, dims)` / `randn(rng, T, dims)` allocate an `AT{T}` instead of falling back to Random stdlib's CPU `Array` constructor. The fill paths (`Random.rand!` / `Random.randn!` on `AnyGPUArray`) ignore `AT` and dispatch on the destination's KernelAbstractions backend, so a single RNG can fill any GPU array. A new CPU-array fallback uses `AT` to allocate a temporary, fill it on-device, then `copyto!` the destination — prevents Random's scalar-iter fallback when filling a `Vector` from a GPU RNG. Removes the `default_rng(::Type{<:AnyGPUArray})` interface: it was only called by the testsuite (now uses `GPUArrays.RNG{AT}()` directly) and by the per-backend shims (now redundant — the type parameter replaces them). JLArrays' GLOBAL_RNG ref + override go away.

similar is the documented Base interface for type-driven array construction (base/abstractarray.jl:873), and its default implementation calls T(undef, dims). Lets array types that override similar(::Type{T}, ::Dims) hook in cleanly.

Metal.jl and other backends still extend GPUArrays.default_rng even though the interface is no longer called. Without a function declaration in GPUArrays, those extension definitions fail to parse with UndefVarError. Empty function stub is enough for the transition. Also fix the RNG(state::AbstractGPUArray) deprecation: it used to call RNG() (zero-arg) which no longer exists. Now extracts AT from the state array's type and calls RNG{AT}().

Without this, e.g. `randn!(Random.TaskLocalRNG(), ::CuArray)` falls through to the stdlib scalar setindex! path. Regression from #707.

maleadt force-pushed the tb/philox_rng branch from aed6b9b to a9db6b6 Compare April 14, 2026 20:16

maleadt and others added 5 commits April 15, 2026 07:07

maleadt and others added 9 commits April 15, 2026 07:14

maleadt mentioned this pull request Apr 15, 2026

Override ^(Float, Int64) to avoid Float64 widening JuliaGPU/OpenCL.jl#428

Merged

maleadt added 3 commits April 15, 2026 10:40

maleadt marked this pull request as ready for review April 15, 2026 09:16

maleadt merged commit 5985115 into master Apr 15, 2026
18 checks passed

maleadt deleted the tb/philox_rng branch April 15, 2026 10:20

maleadt mentioned this pull request Apr 15, 2026

Switch to the new GPUArrays RNG. JuliaGPU/Metal.jl#767

Merged

maleadt added a commit that referenced this pull request Apr 16, 2026

Restore CPU-RNG to GPU-array fallback for rand!/randn!.

64c8c61

Without this, e.g. `randn!(Random.TaskLocalRNG(), ::CuArray)` falls through to the stdlib scalar setindex! path. Regression from #707.

This was referenced Apr 16, 2026

Restore CPU-RNG to GPU-array fallback for rand!/randn!. #710

Merged

On-device rand! JuliaGPU/AcceleratedKernels.jl#80

Open

maleadt added a commit that referenced this pull request Apr 16, 2026

Restore CPU-RNG to GPU-array fallback for rand!/randn!. (#710)

1c60b59

Without this, e.g. `randn!(Random.TaskLocalRNG(), ::CuArray)` falls through to the stdlib scalar setindex! path. Regression from #707.

nhz2 mentioned this pull request Apr 19, 2026

Use an extra bit of entropy in the Float64 uniform 0-1 distribution. #712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Tausworthe+LCG RNG with stateless Philox4x32-10#707

Replace Tausworthe+LCG RNG with stateless Philox4x32-10#707
maleadt merged 19 commits into
masterfrom
tb/philox_rng

maleadt commented Apr 14, 2026

Uh oh!

nhz2 commented Apr 15, 2026

Uh oh!

maleadt commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maleadt commented Apr 14, 2026

Uh oh!

nhz2 commented Apr 15, 2026

Uh oh!

maleadt commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants