Skip to content

Add AMD GPU support via the HIP backend#400

Open
jeffdaily wants to merge 1 commit into
mumax:masterfrom
jeffdaily:moat-port
Open

Add AMD GPU support via the HIP backend#400
jeffdaily wants to merge 1 commit into
mumax:masterfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

This adds an additive, opt-in HIP backend so mumax3 runs on AMD GPUs, alongside the existing CUDA path. The default CUDA build is behaviorally unchanged.

How it works

Each generated CUDA *_wrapper.go gains a //go:build !hip constraint (true in the default build); under -tags hip the matching generated *_wrapper_hip.go compiles instead. The HIP wrappers are cuda2go-generated and committed, exactly like the existing CUDA wrappers, so users build without the device toolchain.

The CUDA path embeds PTX, a forward-compatible virtual ISA the driver finalizes at load. The faithful AMD analog is amdgcnspirv: hipcc --genco --offload-arch=amdgcnspirv emits one generic SPIR-V image that the ROCm runtime finalizes for the present GPU at hipModuleLoadData time. The backend embeds a single generic image per kernel, so there is no per-arch code-object matrix and no arch-detection loader: one build runs on any supported GPU with no rebuild for new hardware (no CUDA_CC for the HIP build).

Device math is reached through the HIP driver API (the analog of the CUDA driver API already used here); cuRAND/cuFFT are served by hipRAND/hipFFT. Two AMD wave-semantics fixes were needed: a reduction that drops the unrolled 32-lane tail in favor of an all-__syncthreads tree (correct on both wave64 CDNA and wave32 RDNA), and an atomicCAS-based fmaxabs (HIP drops int atomicMax on coarse-grained memory). The device kernels are otherwise unchanged.

Building

go install -tags hip github.com/mumax/3/...

Needs a ROCm install (HIP runtime + headers; cgo defaults to /opt/rocm). The committed device images mean no hipcc is required for a plain build. The README documents this alongside the CUDA build.

Validation

Tested on real GPUs across three architectures:

GPU arch wave OS ROCm
Instinct MI250X gfx90a 64 Linux 7.2.1
Radeon Pro W7800 gfx1100 32 Linux 7.2.1
Radeon RX 9070 XT gfx1201 32 Windows 7.14
go test -tags hip -count=1 ./cuda/ ./cuda/cu/      # 8/8, 12/12 PASS
go test -tags hip -count=1 ./data/... ./httpfs/...  # regression PASS
mumax3 ... test/standardproblem4.mx3   # M.Average within 1e-5 (gate 1e-3)
mumax3 ... test/standardproblem5.mx3   # mx/my/mz within 1e-4

standardproblem4 agrees to ~5e-7 between gfx90a and gfx1100; cross-arch results match to 5 significant figures.

This work was authored with assistance from an AI coding assistant (Claude).

Adds an additive, opt-in HIP backend so mumax3 runs on AMD GPUs, alongside the
existing CUDA path. Build it with `go install -tags hip github.com/mumax/3/...`
(or `make BACKEND=hip`).

The default CUDA build is behaviorally unchanged: each generated CUDA
*_wrapper.go gains only a `//go:build !hip` constraint (true in the default
build), so it is excluded only under `-tags hip`, where the matching generated
*_wrapper_hip.go compiles instead. As with the existing CUDA wrappers, the HIP
wrappers are cuda2go-generated and committed, so users build without the device
toolchain.

Embedding model: the CUDA path embeds PTX, a forward-compatible virtual ISA the
driver finalizes (JITs) at load. The faithful AMD analog is amdgcnspirv: hipcc
--genco --offload-arch=amdgcnspirv emits one generic SPIR-V image that the ROCm
runtime finalizes for the present GPU at hipModuleLoadData time. The HIP backend
embeds a single generic image per kernel and needs no per-arch code-object
matrix and no arch-detection loader: one image runs on any supported gfx arch
with no rebuild for new GPUs, so CUDA_CC does not drive the HIP build.

Device math is reached through the HIP driver API (the analog of the CUDA driver
API already used here); cuRAND/cuFFT are served by hipRAND/hipFFT. Two AMD
wave-semantics fixes were needed: a reduction that drops the unrolled 32-lane
tail in favor of an all-__syncthreads tree (correct on both wave64 CDNA and
wave32 RDNA), and an atomicCAS-based fmaxabs (HIP drops int atomicMax on
coarse-grained memory). The device .cu/.cuh kernels are otherwise unchanged:
amdgcnspirv defines __HIP_PLATFORM_AMD__ so the existing AMD guards resolve, and
no kernel assumes a compile-time wave width, so none needed a dynamic-warpSize
fix.

To review: cuda/Makefile (BACKEND=hip compiles one amdgcnspirv image per kernel)
and cuda/cuda2go.go (the hip template embeds one base64 blob and a <name>_image)
drive the codegen; the regenerated cuda/*_wrapper_hip.go follow mechanically.
cuda/fatbin_hip.go hands the blob straight to ModuleLoadData. The cuda/cu and
cuda/cufft *_hip.go files are the HIP driver-API plumbing; engine/*_hip.go is the
device-name reporting. README.md documents the AMD build alongside the CUDA one.
The bulk of the diff is the generated *_wrapper_hip.go files, mechanical like
their committed CUDA counterparts.

Hardware validated:
  Instinct MI250X    gfx90a  (CDNA2, wave64)  Linux    ROCm 7.2.1
  Radeon Pro W7800   gfx1100 (RDNA3, wave32)  Linux    ROCm 7.2.1
  Radeon RX 9070 XT  gfx1201 (RDNA4, wave32)  Windows  ROCm 7.14 (TheRock)

Test Plan:

    cd cuda && make wrappers BACKEND=hip && cd ..
    go install -tags hip github.com/mumax/3/...
    go test -tags hip -count=1 ./cuda/ ./cuda/cu/
    go test -tags hip -count=1 ./data/... ./httpfs/...
    mumax3 -paranoid=false -cache /tmp -http "" test/standardproblem4.mx3
    mumax3 -paranoid=false -cache /tmp -http "" test/standardproblem5.mx3

Results on all three GPUs: ./cuda/ 8/8 PASS, ./cuda/cu/ 12/12 PASS (TestModule
on the generic image, cufft FFT1D PASS), non-GPU regression PASS.
standardproblem4 M.Average() within 1e-5 (gate 1e-3), gfx90a vs gfx1100 agree to
~5e-7; standardproblem5 mx/my/mz within 1e-4, cross-arch match to 5 significant
figures. Boot log reports "using generic amdgcnspirv image" -- the single
generic image loads, no gfx-specific code-object selection remains.

This work was authored with assistance from an AI coding assistant (Claude).
jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 11, 2026
@JLeliaert JLeliaert requested a review from JonathanMaes June 11, 2026 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant