Replace Tausworthe+LCG RNG with stateless Philox4x32-10#707
Conversation
Replace the hybrid Tausworthe+LCG generator with a stateless Philox4x32-10 counter-based RNG using KernelAbstractions, portable across GPU backends. rand! uses batched kernels for common types (Float16/32/64, integers, Bool, Complex) that produce multiple values per Philox call. A generic ElementRNG fallback handles all other types via Julia's Random stdlib dispatch. randn! uses specialized Box-Muller kernels with @fastmath log + sincospi, letting each backend provide optimized intrinsics via @device_override. These can't be made generic like rand! because (1) Random's Ziggurat needs table lookups requiring device overlays, (2) the generic Marsaglia polar fallback has a rejection loop causing warp divergence, and (3) direct Box-Muller produces value pairs that must be batched to stay bandwidth-bound. The old RNG(state::AbstractGPUArray) constructor and seed!(rng, ::Vector) are kept as deprecated methods for backwards compatibility with existing backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously used fldmod with typemax(UInt32) = 2^32 - 1, which wraps the counter to 1 (not 0) when it saturates. Not a correctness issue today since we only advance by 1, but the logic was confusing. Replace with a plain increment that lets UInt32 overflow naturally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the seed was UInt32 and the Philox key's upper lane was always zero, so different 64-bit seeds colliding mod 2^32 produced identical streams. Store the seed as UInt64 and split both halves into the key's two lanes via a new philox_key helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Construct the Float64 bit-pattern directly instead of calling the expensive Float64(::UInt64) conversion and a subsequent fma. On consumer GPUs with 1:64 FP64 throughput this is a ~40x speedup for rand!(F64) on a 100M-element array (0.9 vs 38 ms on an RTX 5080, now memory-bound). Force the low mantissa bit so the result is strictly in (0, 1), which is required by Box-Muller's log(u). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the four randn_*_kernel!'s with a single randn_batched_kernel! mirroring rand_batched_kernel!, parameterized on the element type via a philox_to_normals helper (analogous to philox_to_vals). Same performance, half the kernel code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vals_per_call already returned 1 for 16-byte types; add the corresponding philox_to_vals method and extend the union so rand! dispatches to the batched kernel rather than the generic fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
For The fastsincospi also has a nice exact 8 way symmetry around the unit circle. These custom functions seem fast on my RTX 3080 with Float32, but maybe in other cases the intrinsics are faster? I haven't looked much into Float64 or other hardware yet. |
Both variants had the same (::Type{T}, u1, u2) signature with different
output types; dispatch on Complex{T} vs T instead of using a separate
name, and keep both methods next to each other at the top of the file.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allowed any AbstractRNG to fill a GPUArray by generating values on the CPU and copying them over. That hides accidentally-slow code paths; let unsupported (rng, eltype) combinations fail explicitly via MethodError instead. Also drop the matching cpu_rng exercise from the test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirrors the rand! structure: batched kernel for Float16/32/64 and their
complex variants, generic ElementRNG-based kernel for every other
AbstractFloat / Complex{<:AbstractFloat}. Non-batched types go through
Random's generic polar Box-Muller fallback, which is GPU-safe (ziggurat
is never reached because its signatures cover only Float16/32/64, which
dispatch to the batched kernel).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- u01 no longer claims "(0, 1]"; the Float64 path is strictly (0, 1).
- The generic rand! fallback comment listed types that have since been
moved into BatchedRandTypes.
- The randn! intro claimed randn! couldn't use ElementRNG; there's now a
generic ElementRNG fallback too. Reframe the three-point rationale as
"why the batched path exists" rather than "why there's no generic path".
- Clarify the per-call batching table to cover Int128/UInt128 and
Complex{Float32} explicitly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ElementRNG and its Random stdlib dispatch methods are shared between the generic rand! and randn! fallbacks; hoist them above both kernels rather than living inside the "Generic rand! fallback" section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These @inline overrides of the generic `rand(::AbstractRNG, ::Type{T})` and `SamplerType{Complex{T}}` methods were needed when ElementRNG was a mutable struct, to keep the dispatch chain transparent enough for the Ref-to-ptr trick to stack-allocate. With ElementRNG immutable, empirical checks (LLVM inspection, BFloat16 rand and Complex{Float16} randn benchmarks) show identical generated code and identical performance whether the overrides are present or not. Base.Random defines both methods with identical bodies, so nothing is lost by removing them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base.sincospi(::Float32) widens to Float64 internally (see base/special/trig.jl:744 `sinpi_kernel_wide`), which breaks Metal (no Float64 support) and forces FP64 emulation on other backends that hit that path. Vendor PhiloxRNG.jl's minimax polynomial: entirely Float32, consumes the Philox UInt32 output directly. Rewrite the ≤32-bit boxmuller methods to take UInt32 arguments and update philox_to_normals to match. Float64 keeps using Base.sincospi (backends with FP64 have intrinsics, and the polynomial alternative is ~8x slower on consumer GPUs — see the follow-up log commit for measurements). Note: Base.log(::Float32) has the same FP64-widening problem and is fixed in the next commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same Metal problem as sincospi in the previous commit: Base.log(::Float32) widens to Float64 internally (see base/special/log.jl:242, `Float64(2f0*f)/(2.0+f)`), which breaks Metal and forces FP64 emulation on other backends that end up on that path. Vendor PhiloxRNG.jl's Float32 polynomial log (itself ported from fdlibm's e_logf.c). Fold the u01 conversion into the first fma so the ≤32-bit boxmuller never touches Float64. Float64 still uses Base.log_fast — backends with FP64 support have intrinsics for it, and the polynomial alternative is ~8x slower on consumer GPUs (benchmarked: randn! F64 drops from ~97 GB/s with intrinsics to ~12 GB/s with polynomials on an RTX 5080). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base's randn(rng, ::BitFloatType) is the ziggurat path that reads
global wi/ki/fi tables, which aren't device-accessible — and the Float64
tables can't even be loaded on Metal. Direct randn!(rng, A) for
Float16/32/64 dispatches to the batched kernel so this seemed dormant,
but Base's randn(rng, Complex{T}) recurses into randn(rng, T): a
randn!(rng, A::GPUArray{Complex{Float16}}) call therefore hit ziggurat
via the generic path.
Override randn(rng::ElementRNG, ::Type{Float{16,32,64}}) to use our
Box-Muller directly. Covers any current or future caller that recurses
through these types.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Yeah I switched back to your versions because FYI, Claude also came up with a trick to improve Float64 performance:
Final performance now:
† old Tausworthe F64 uses only 32 bits of entropy per value. Passes SmallCrush, but I haven't run BigCrush yet. |
These are internal names but not meant as private underscore-prefixed Julia convention; the leading _ was just carried over from PhiloxRNG. Rename to fast_log / fast_sincospi / SP_F32 / CP_F32 / SQRT_HALF_I32 / LOG_ODD_F32 / LOG_EVEN_F32. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> [ci skip]
`GPUArrays.RNG{AT}` now carries the destination array type, so out-of-place
`rand(rng, T, dims)` / `randn(rng, T, dims)` allocate an `AT{T}` instead of
falling back to Random stdlib's CPU `Array` constructor.
The fill paths (`Random.rand!` / `Random.randn!` on `AnyGPUArray`) ignore
`AT` and dispatch on the destination's KernelAbstractions backend, so a
single RNG can fill any GPU array. A new CPU-array fallback uses `AT` to
allocate a temporary, fill it on-device, then `copyto!` the destination —
prevents Random's scalar-iter fallback when filling a `Vector` from a GPU
RNG.
Removes the `default_rng(::Type{<:AnyGPUArray})` interface: it was only
called by the testsuite (now uses `GPUArrays.RNG{AT}()` directly) and by
the per-backend shims (now redundant — the type parameter replaces them).
JLArrays' GLOBAL_RNG ref + override go away.
similar is the documented Base interface for type-driven array construction
(base/abstractarray.jl:873), and its default implementation calls T(undef, dims).
Lets array types that override similar(::Type{T}, ::Dims) hook in cleanly.
Metal.jl and other backends still extend GPUArrays.default_rng even
though the interface is no longer called. Without a function declaration
in GPUArrays, those extension definitions fail to parse with
UndefVarError. Empty function stub is enough for the transition.
Also fix the RNG(state::AbstractGPUArray) deprecation: it used to call
RNG() (zero-arg) which no longer exists. Now extracts AT from the state
array's type and calls RNG{AT}().
Without this, e.g. `randn!(Random.TaskLocalRNG(), ::CuArray)` falls through to the stdlib scalar setindex! path. Regression from #707.
Without this, e.g. `randn!(Random.TaskLocalRNG(), ::CuArray)` falls through to the stdlib scalar setindex! path. Regression from #707.
Replace the hybrid Tausworthe+LCG generator with a stateless Philox4x32-10 counter-based RNG. Use
@fastmath log+sincospifor Box-Muller randn!, letting each GPU backend provide optimized intrinsics via@device_override.The generic
rand!uses an immutable ElementRNG struct that implements the minimal AbstractRNG interface (just rand for UInt64 + rng_native_52). Julia's Random stdlib handles all type conversions (Float16, Complex, Bool, integers) automatically. The subcounter state lives in a stack-allocated Ref with a pointer in the struct, avoiding mutable struct heap allocation on GPU.This was prompted by JuliaGPU/CUDA.jl#3056 / inspired on https://github.com/medyan-dev/PhiloxRNG.jl
Performance is much better than CUDA.jl's NativeRNG, so I'll probably deprecate that in favor of this.
cc @nhz2