Add AMD GPU support via the HIP backend#400
Open
jeffdaily wants to merge 1 commit into
Open
Conversation
Adds an additive, opt-in HIP backend so mumax3 runs on AMD GPUs, alongside the
existing CUDA path. Build it with `go install -tags hip github.com/mumax/3/...`
(or `make BACKEND=hip`).
The default CUDA build is behaviorally unchanged: each generated CUDA
*_wrapper.go gains only a `//go:build !hip` constraint (true in the default
build), so it is excluded only under `-tags hip`, where the matching generated
*_wrapper_hip.go compiles instead. As with the existing CUDA wrappers, the HIP
wrappers are cuda2go-generated and committed, so users build without the device
toolchain.
Embedding model: the CUDA path embeds PTX, a forward-compatible virtual ISA the
driver finalizes (JITs) at load. The faithful AMD analog is amdgcnspirv: hipcc
--genco --offload-arch=amdgcnspirv emits one generic SPIR-V image that the ROCm
runtime finalizes for the present GPU at hipModuleLoadData time. The HIP backend
embeds a single generic image per kernel and needs no per-arch code-object
matrix and no arch-detection loader: one image runs on any supported gfx arch
with no rebuild for new GPUs, so CUDA_CC does not drive the HIP build.
Device math is reached through the HIP driver API (the analog of the CUDA driver
API already used here); cuRAND/cuFFT are served by hipRAND/hipFFT. Two AMD
wave-semantics fixes were needed: a reduction that drops the unrolled 32-lane
tail in favor of an all-__syncthreads tree (correct on both wave64 CDNA and
wave32 RDNA), and an atomicCAS-based fmaxabs (HIP drops int atomicMax on
coarse-grained memory). The device .cu/.cuh kernels are otherwise unchanged:
amdgcnspirv defines __HIP_PLATFORM_AMD__ so the existing AMD guards resolve, and
no kernel assumes a compile-time wave width, so none needed a dynamic-warpSize
fix.
To review: cuda/Makefile (BACKEND=hip compiles one amdgcnspirv image per kernel)
and cuda/cuda2go.go (the hip template embeds one base64 blob and a <name>_image)
drive the codegen; the regenerated cuda/*_wrapper_hip.go follow mechanically.
cuda/fatbin_hip.go hands the blob straight to ModuleLoadData. The cuda/cu and
cuda/cufft *_hip.go files are the HIP driver-API plumbing; engine/*_hip.go is the
device-name reporting. README.md documents the AMD build alongside the CUDA one.
The bulk of the diff is the generated *_wrapper_hip.go files, mechanical like
their committed CUDA counterparts.
Hardware validated:
Instinct MI250X gfx90a (CDNA2, wave64) Linux ROCm 7.2.1
Radeon Pro W7800 gfx1100 (RDNA3, wave32) Linux ROCm 7.2.1
Radeon RX 9070 XT gfx1201 (RDNA4, wave32) Windows ROCm 7.14 (TheRock)
Test Plan:
cd cuda && make wrappers BACKEND=hip && cd ..
go install -tags hip github.com/mumax/3/...
go test -tags hip -count=1 ./cuda/ ./cuda/cu/
go test -tags hip -count=1 ./data/... ./httpfs/...
mumax3 -paranoid=false -cache /tmp -http "" test/standardproblem4.mx3
mumax3 -paranoid=false -cache /tmp -http "" test/standardproblem5.mx3
Results on all three GPUs: ./cuda/ 8/8 PASS, ./cuda/cu/ 12/12 PASS (TestModule
on the generic image, cufft FFT1D PASS), non-GPU regression PASS.
standardproblem4 M.Average() within 1e-5 (gate 1e-3), gfx90a vs gfx1100 agree to
~5e-7; standardproblem5 mx/my/mz within 1e-4, cross-arch match to 5 significant
figures. Boot log reports "using generic amdgcnspirv image" -- the single
generic image loads, no gfx-specific code-object selection remains.
This work was authored with assistance from an AI coding assistant (Claude).
jeffdaily
added a commit
to jeffdaily/moat
that referenced
this pull request
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds an additive, opt-in HIP backend so mumax3 runs on AMD GPUs, alongside the existing CUDA path. The default CUDA build is behaviorally unchanged.
How it works
Each generated CUDA
*_wrapper.gogains a//go:build !hipconstraint (true in the default build); under-tags hipthe matching generated*_wrapper_hip.gocompiles instead. The HIP wrappers arecuda2go-generated and committed, exactly like the existing CUDA wrappers, so users build without the device toolchain.The CUDA path embeds PTX, a forward-compatible virtual ISA the driver finalizes at load. The faithful AMD analog is amdgcnspirv:
hipcc --genco --offload-arch=amdgcnspirvemits one generic SPIR-V image that the ROCm runtime finalizes for the present GPU athipModuleLoadDatatime. The backend embeds a single generic image per kernel, so there is no per-arch code-object matrix and no arch-detection loader: one build runs on any supported GPU with no rebuild for new hardware (noCUDA_CCfor the HIP build).Device math is reached through the HIP driver API (the analog of the CUDA driver API already used here); cuRAND/cuFFT are served by hipRAND/hipFFT. Two AMD wave-semantics fixes were needed: a reduction that drops the unrolled 32-lane tail in favor of an all-
__syncthreadstree (correct on both wave64 CDNA and wave32 RDNA), and anatomicCAS-basedfmaxabs(HIP drops intatomicMaxon coarse-grained memory). The device kernels are otherwise unchanged.Building
Needs a ROCm install (HIP runtime + headers; cgo defaults to
/opt/rocm). The committed device images mean nohipccis required for a plain build. The README documents this alongside the CUDA build.Validation
Tested on real GPUs across three architectures:
standardproblem4 agrees to ~5e-7 between gfx90a and gfx1100; cross-arch results match to 5 significant figures.
This work was authored with assistance from an AI coding assistant (Claude).