Skip to content

Add PTXFDivFastPass to lower fdiv fast to NVPTX approximate division#800

Draft
vchuravy wants to merge 1 commit into
mainfrom
vc/ptx_fast_div
Draft

Add PTXFDivFastPass to lower fdiv fast to NVPTX approximate division#800
vchuravy wants to merge 1 commit into
mainfrom
vc/ptx_fast_div

Conversation

@vchuravy
Copy link
Copy Markdown
Member

Overarching goal is to move the fast math handling from CUDA.jl to the GPUCompiler backend.

The LLVM NVPTX backend handles fdiv fast for Float32 (→ div.approx.ftz.f32)
but has no fast path for Float64. This IR-level pass covers both:

  • Float32: replaces fdiv with __nv_fast_fdividef (libdevice)
  • Float64: replaces fdiv with rcp.approx.ftz.d + Newton refinement,
    matching CUDA.jl's inv_fast(::Float64) algorithm

The pass fires when the instruction carries the afn fast-math flag (set by
@fastmath) or when target.fastmath=true. It follows the NVVMReflectPass
pattern already in ptx.jl.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

The LLVM NVPTX backend handles fdiv fast for Float32 (→ div.approx.ftz.f32)
but has no fast path for Float64. This IR-level pass covers both:
- Float32: replaces fdiv with __nv_fast_fdividef (libdevice)
- Float64: replaces fdiv with rcp.approx.ftz.d + Newton refinement,
  matching CUDA.jl's inv_fast(::Float64) algorithm

The pass fires when the instruction carries the afn fast-math flag (set by
@fastmath) or when target.fastmath=true. It follows the NVVMReflectPass
pattern already in ptx.jl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant