Skip to content

hipMallocFromPoolAsync broken on ROCm 7.2 #901

@luraess

Description

@luraess

It seems AMDGPU.jl fails to allocate any device memory, as reported in #869 and lately while having a CI runner using ROCm 7.2 container.

Description

On ROCm 7.2 (AMD Radeon RX 7800 XT / gfx1101), any attempt to allocate device memory via
AMDGPU.jl fails immediately. Even the simplest operation crashes:

julia> using AMDGPU
julia> AMDGPU.ROCArray([1, 2, 3])
# → error / silent NULL pointer

Possible cause

AMDGPU.jl currently uses hipMallocFromPoolAsync exclusively for all HIPBuffer allocations.
On ROCm 7.2 this function appears to be broken (returns a NULL pointer or errors) for at least
some GPU/driver combinations, with no fallback path.

This should be confirmed by:

using AMDGPU, AMDGPU.HIP

dev          = AMDGPU.device()
stream       = AMDGPU.HIPStream(); # pretty printing fix in #900 
default_pool = HIP.default_memory_pool(dev)

ptr_ref = Ref{Ptr{Cvoid}}()
HIP.hipMallocFromPoolAsync(ptr_ref, 24, default_pool, stream)
HIP.wait(stream)
println("ptr: $(ptr_ref[])")   # returns NULL on affected systems

While the synchronous fallback works correctly:

ptr_ref = Ref{Ptr{Cvoid}}()
HIP.hipMalloc(ptr_ref, 24)
println("ptr: $(ptr_ref[])")   # non-NULL, works fine

Environment

ROCm version 7.2 (rocm/dev-ubuntu-24.04:7.2-complete)
GPU AMD Radeon RX 7800 XT (DID 0x747e, gfx1101)
Julia 1.12
AMDGPU.jl master

Possible fix

Branch lr/rocm7 adds a per-device broken flag with an automatic explicit fallback to hipMalloc:

  • A global POOL_ALLOC_BROKEN::LockedObject(Set{Int64}) tracks device IDs where pool
    allocation has failed.
  • On first allocation, hipMallocFromPoolAsync is attempted. If it returns NULL (or exhausts
    all retry phases), the device is added to POOL_ALLOC_BROKEN and all subsequent allocations
    on that device use synchronous hipMalloc.
  • HIPBuffer gains a pool_alloc::Bool field so the correct free function is used
    (hipFreeAsync for pool buffers, hipFree otherwise) — mixing them causes memory corruption.

The change is in
src/runtime/memory/hip.jl.

I would be happy to get any insights and ideally others with access to AMD GPU to test. I will give it a try asap on MI300a, once I finalise the setup there.

Disclaimer: I iterated on the issue with Claude code. The main purpose is debugging and ideally we can work towards a proper fix after input from others.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions