Skip to content

Replace Tausworthe+LCG RNG with stateless Philox4x32-10#707

Merged
maleadt merged 19 commits into
masterfrom
tb/philox_rng
Apr 15, 2026
Merged

Replace Tausworthe+LCG RNG with stateless Philox4x32-10#707
maleadt merged 19 commits into
masterfrom
tb/philox_rng

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 14, 2026

Replace the hybrid Tausworthe+LCG generator with a stateless Philox4x32-10 counter-based RNG. Use @fastmath log + sincospi for Box-Muller randn!, letting each GPU backend provide optimized intrinsics via @device_override.

The generic rand! uses an immutable ElementRNG struct that implements the minimal AbstractRNG interface (just rand for UInt64 + rng_native_52). Julia's Random stdlib handles all type conversions (Float16, Complex, Bool, integers) automatically. The subcounter state lives in a stack-allocated Ref with a pointer in the struct, avoiding mutable struct heap allocation on GPU.

This was prompted by JuliaGPU/CUDA.jl#3056 / inspired on https://github.com/medyan-dev/PhiloxRNG.jl

Performance is much better than CUDA.jl's NativeRNG, so I'll probably deprecate that in favor of this.

cc @nhz2

Replace the hybrid Tausworthe+LCG generator with a stateless Philox4x32-10
counter-based RNG using KernelAbstractions, portable across GPU backends.

rand! uses batched kernels for common types (Float16/32/64, integers, Bool,
Complex) that produce multiple values per Philox call. A generic ElementRNG
fallback handles all other types via Julia's Random stdlib dispatch.

randn! uses specialized Box-Muller kernels with @fastmath log + sincospi,
letting each backend provide optimized intrinsics via @device_override.
These can't be made generic like rand! because (1) Random's Ziggurat needs
table lookups requiring device overlays, (2) the generic Marsaglia polar
fallback has a rejection loop causing warp divergence, and (3) direct
Box-Muller produces value pairs that must be batched to stay bandwidth-bound.

The old RNG(state::AbstractGPUArray) constructor and seed!(rng, ::Vector)
are kept as deprecated methods for backwards compatibility with existing
backends.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
maleadt and others added 5 commits April 15, 2026 07:07
Previously used fldmod with typemax(UInt32) = 2^32 - 1, which wraps the
counter to 1 (not 0) when it saturates. Not a correctness issue today
since we only advance by 1, but the logic was confusing. Replace with
a plain increment that lets UInt32 overflow naturally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the seed was UInt32 and the Philox key's upper lane was
always zero, so different 64-bit seeds colliding mod 2^32 produced
identical streams. Store the seed as UInt64 and split both halves
into the key's two lanes via a new philox_key helper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Construct the Float64 bit-pattern directly instead of calling the
expensive Float64(::UInt64) conversion and a subsequent fma. On consumer
GPUs with 1:64 FP64 throughput this is a ~40x speedup for rand!(F64) on
a 100M-element array (0.9 vs 38 ms on an RTX 5080, now memory-bound).

Force the low mantissa bit so the result is strictly in (0, 1), which is
required by Box-Muller's log(u).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the four randn_*_kernel!'s with a single randn_batched_kernel!
mirroring rand_batched_kernel!, parameterized on the element type via a
philox_to_normals helper (analogous to philox_to_vals). Same performance,
half the kernel code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vals_per_call already returned 1 for 16-byte types; add the
corresponding philox_to_vals method and extend the union so rand!
dispatches to the batched kernel rather than the generic fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nhz2
Copy link
Copy Markdown
Contributor

nhz2 commented Apr 15, 2026

For randn it would be nice to avoid returning exactly zero (though not strictly needed). This is handled by the fixed point math in https://github.com/medyan-dev/PhiloxRNG.jl/blob/main/src/fastsincospi.jl and https://github.com/medyan-dev/PhiloxRNG.jl/blob/main/src/fastlog.jl

The fastsincospi also has a nice exact 8 way symmetry around the unit circle.

These custom functions seem fast on my RTX 3080 with Float32, but maybe in other cases the intrinsics are faster? I haven't looked much into Float64 or other hardware yet.

maleadt and others added 9 commits April 15, 2026 07:14
Both variants had the same (::Type{T}, u1, u2) signature with different
output types; dispatch on Complex{T} vs T instead of using a separate
name, and keep both methods next to each other at the top of the file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allowed any AbstractRNG to fill a GPUArray by generating values on the
CPU and copying them over. That hides accidentally-slow code paths;
let unsupported (rng, eltype) combinations fail explicitly via
MethodError instead.

Also drop the matching cpu_rng exercise from the test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirrors the rand! structure: batched kernel for Float16/32/64 and their
complex variants, generic ElementRNG-based kernel for every other
AbstractFloat / Complex{<:AbstractFloat}. Non-batched types go through
Random's generic polar Box-Muller fallback, which is GPU-safe (ziggurat
is never reached because its signatures cover only Float16/32/64, which
dispatch to the batched kernel).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- u01 no longer claims "(0, 1]"; the Float64 path is strictly (0, 1).
- The generic rand! fallback comment listed types that have since been
  moved into BatchedRandTypes.
- The randn! intro claimed randn! couldn't use ElementRNG; there's now a
  generic ElementRNG fallback too. Reframe the three-point rationale as
  "why the batched path exists" rather than "why there's no generic path".
- Clarify the per-call batching table to cover Int128/UInt128 and
  Complex{Float32} explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ElementRNG and its Random stdlib dispatch methods are shared between the
generic rand! and randn! fallbacks; hoist them above both kernels rather
than living inside the "Generic rand! fallback" section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These @inline overrides of the generic `rand(::AbstractRNG, ::Type{T})`
and `SamplerType{Complex{T}}` methods were needed when ElementRNG was a
mutable struct, to keep the dispatch chain transparent enough for the
Ref-to-ptr trick to stack-allocate. With ElementRNG immutable, empirical
checks (LLVM inspection, BFloat16 rand and Complex{Float16} randn
benchmarks) show identical generated code and identical performance
whether the overrides are present or not. Base.Random defines both
methods with identical bodies, so nothing is lost by removing them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base.sincospi(::Float32) widens to Float64 internally (see
base/special/trig.jl:744 `sinpi_kernel_wide`), which breaks Metal
(no Float64 support) and forces FP64 emulation on other backends that
hit that path.

Vendor PhiloxRNG.jl's minimax polynomial: entirely Float32, consumes
the Philox UInt32 output directly. Rewrite the ≤32-bit boxmuller
methods to take UInt32 arguments and update philox_to_normals to
match. Float64 keeps using Base.sincospi (backends with FP64 have
intrinsics, and the polynomial alternative is ~8x slower on
consumer GPUs — see the follow-up log commit for measurements).

Note: Base.log(::Float32) has the same FP64-widening problem and is
fixed in the next commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same Metal problem as sincospi in the previous commit:
Base.log(::Float32) widens to Float64 internally (see
base/special/log.jl:242, `Float64(2f0*f)/(2.0+f)`), which breaks Metal
and forces FP64 emulation on other backends that end up on that path.

Vendor PhiloxRNG.jl's Float32 polynomial log (itself ported from
fdlibm's e_logf.c). Fold the u01 conversion into the first fma so the
≤32-bit boxmuller never touches Float64.

Float64 still uses Base.log_fast — backends with FP64 support have
intrinsics for it, and the polynomial alternative is ~8x slower on
consumer GPUs (benchmarked: randn! F64 drops from ~97 GB/s with
intrinsics to ~12 GB/s with polynomials on an RTX 5080).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base's randn(rng, ::BitFloatType) is the ziggurat path that reads
global wi/ki/fi tables, which aren't device-accessible — and the Float64
tables can't even be loaded on Metal. Direct randn!(rng, A) for
Float16/32/64 dispatches to the batched kernel so this seemed dormant,
but Base's randn(rng, Complex{T}) recurses into randn(rng, T): a
randn!(rng, A::GPUArray{Complex{Float16}}) call therefore hit ziggurat
via the generic path.

Override randn(rng::ElementRNG, ::Type{Float{16,32,64}}) to use our
Box-Muller directly. Covers any current or future caller that recurses
through these types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Apr 15, 2026

Yeah I switched back to your versions because sincospi(::Float32) uses Float64, which is incompatible with some of our back-ends.

FYI, Claude also came up with a trick to improve Float64 performance:

Use bit-pattern trick for u01(Float64, UInt64) — eliminates the
Float64(::UInt64) conversion + fma that hit the 1:64 FP64 wall on
consumer GPUs. ~40× speedup for rand!(F64).

Final performance now:

Benchmark Tausworthe (old) PhiloxRNG (raw) cuRAND this PR
rand! F32 138 915 270 912
rand! F64 264 21 911
randn! F32 170 916 548 912
randn! F64 56 12 57 97

old Tausworthe F64 uses only 32 bits of entropy per value.

Passes SmallCrush, but I haven't run BigCrush yet.

These are internal names but not meant as private underscore-prefixed
Julia convention; the leading _ was just carried over from PhiloxRNG.
Rename to fast_log / fast_sincospi / SP_F32 / CP_F32 / SQRT_HALF_I32 /
LOG_ODD_F32 / LOG_EVEN_F32.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

[ci skip]
maleadt added 3 commits April 15, 2026 10:40
`GPUArrays.RNG{AT}` now carries the destination array type, so out-of-place
`rand(rng, T, dims)` / `randn(rng, T, dims)` allocate an `AT{T}` instead of
falling back to Random stdlib's CPU `Array` constructor.

The fill paths (`Random.rand!` / `Random.randn!` on `AnyGPUArray`) ignore
`AT` and dispatch on the destination's KernelAbstractions backend, so a
single RNG can fill any GPU array. A new CPU-array fallback uses `AT` to
allocate a temporary, fill it on-device, then `copyto!` the destination —
prevents Random's scalar-iter fallback when filling a `Vector` from a GPU
RNG.

Removes the `default_rng(::Type{<:AnyGPUArray})` interface: it was only
called by the testsuite (now uses `GPUArrays.RNG{AT}()` directly) and by
the per-backend shims (now redundant — the type parameter replaces them).
JLArrays' GLOBAL_RNG ref + override go away.
similar is the documented Base interface for type-driven array construction
(base/abstractarray.jl:873), and its default implementation calls T(undef, dims).
Lets array types that override similar(::Type{T}, ::Dims) hook in cleanly.
Metal.jl and other backends still extend GPUArrays.default_rng even
though the interface is no longer called. Without a function declaration
in GPUArrays, those extension definitions fail to parse with
UndefVarError. Empty function stub is enough for the transition.

Also fix the RNG(state::AbstractGPUArray) deprecation: it used to call
RNG() (zero-arg) which no longer exists. Now extracts AT from the state
array's type and calls RNG{AT}().
@maleadt maleadt marked this pull request as ready for review April 15, 2026 09:16
@maleadt maleadt merged commit 5985115 into master Apr 15, 2026
18 checks passed
@maleadt maleadt deleted the tb/philox_rng branch April 15, 2026 10:20
maleadt added a commit that referenced this pull request Apr 16, 2026
Without this, e.g. `randn!(Random.TaskLocalRNG(), ::CuArray)` falls through
to the stdlib scalar setindex! path. Regression from #707.
maleadt added a commit that referenced this pull request Apr 16, 2026
Without this, e.g. `randn!(Random.TaskLocalRNG(), ::CuArray)` falls through
to the stdlib scalar setindex! path. Regression from #707.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants