Skip to content

hipFile: Ensure HIP Runtime is initialized before calling into hipAmdFileRead/hipAmdFileWrite#223

Merged
kurtmcmillan merged 4 commits into
developfrom
kumcmill/rocm-20222
Mar 23, 2026
Merged

hipFile: Ensure HIP Runtime is initialized before calling into hipAmdFileRead/hipAmdFileWrite#223
kurtmcmillan merged 4 commits into
developfrom
kumcmill/rocm-20222

Conversation

@kurtmcmillan
Copy link
Copy Markdown
Collaborator

@kurtmcmillan kurtmcmillan commented Mar 17, 2026

Motivation

Avoid causing a SEGFAULT in the HIP Runtime.

AIHIPFILE-157

Technical Details

A storage partner recently hit a SEGFAULT when testing hipFile with vllm. vllm performs IO within a thread and the first call into the HIP Runtime is hipAmdFileRead/hipAmdFileWrite through the fastpath backend. Since these HIP Runtime APIs are private, it wasn't anticipated that they would need to perform any initialization. Library initialization checks are being added to hipAmdFileRead/hipAmdFileWrite (ROCm/rocm-systems#4068) This change is a temporary while there is the possibility that hipFile will run with an unpatched HIP Runtime. ROCM-20222

@kurtmcmillan kurtmcmillan marked this pull request as ready for review March 17, 2026 20:49
Copilot AI review requested due to automatic review settings March 17, 2026 20:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an explicit HIP runtime initialization step to the Fastpath backend before calling the private hipAmdFileRead/hipAmdFileWrite APIs, to avoid a HIP runtime segfault when those are the first HIP calls on a newly created thread.

Changes:

  • Add a Hip::hipInit() wrapper method and a corresponding MHip mock method.
  • Add a per-thread initialization guard in Fastpath::io() that calls hipInit() once per thread.
  • Update Fastpath unit tests to expect an initialization call.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/amd_detail/backend/fastpath.cpp Adds per-thread HIP runtime init guard in Fastpath::io()
src/amd_detail/hip.h Extends Hip interface with hipInit()
src/amd_detail/hip.cpp Implements Hip::hipInit() via ::hipInit(0)
test/amd_detail/mhip.h Adds hipInit() mock to MHip
test/amd_detail/fastpath.cpp Updates Fastpath IO tests to expect hipInit()
CHANGELOG.md Documents the Fastpath initialization safeguard
Comments suppressed due to low confidence (1)

test/amd_detail/fastpath.cpp:346

  • The new EXPECT_CALL(mhip, hipInit); expectation is set in every expect_io() call, but the production code uses a thread_local guard and will only call hipInit() once per thread for the whole test process. Since gtest typically runs these parametrized tests sequentially on the same thread, after the first Fastpath().io(...) call hipInit() will no longer be invoked and later tests will fail due to an unsatisfied expectation. Adjust the tests to allow hipInit() to be called 0 or 1 times (or restructure to run the IO on a fresh thread / reset the per-thread init state in a test-only way).
struct FastpathIoParam : public FastpathTestBase, public TestWithParam<IoType> {

    StrictMock<MHip> mhip;

    void expect_io()
    {
        EXPECT_CALL(mhip, hipInit);
        EXPECT_CALL(*mbuffer, getBuffer).WillOnce(Return(DEFAULT_BUFFER_ADDR));
        EXPECT_CALL(*mbuffer, getLength).WillOnce(Return(DEFAULT_BUFFER_LENGTH));
        EXPECT_CALL(*mfile, getUnbufferedFd).WillOnce(Return(DEFAULT_UNBUFFERED_FD));
    }

    void expect_io(optional<int> fd, void *bufptr, size_t buflen)
    {
        EXPECT_CALL(mhip, hipInit);
        EXPECT_CALL(*mbuffer, getBuffer).WillOnce(Return(bufptr));
        EXPECT_CALL(*mbuffer, getLength).WillOnce(Return(buflen));
        EXPECT_CALL(*mfile, getUnbufferedFd).WillOnce(Return(fd));
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/amd_detail/backend/fastpath.cpp Outdated
hipFileRead & hipFileWrite use private/hidden HIP Runtime APIs which,
initially, did not ensure that the HIP Runtime was initialized which resulted
in a SEGFAULT.

This change ensures that the HIP Runtime is initialized when
hipFileRead/hipFileWrite are the first HIP API calls of a thread.
Copilot AI review requested due to automatic review settings March 23, 2026 21:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a defensive HIP runtime initialization step in the AMD fastpath backend to prevent a crash when hipAmdFileRead/hipAmdFileWrite are the first HIP calls made on a newly-created thread (e.g., IO initiated from a worker thread).

Changes:

  • Add a Hip::hipInit() wrapper and mock support for it.
  • Call hipInit() once per thread in Fastpath::io() before invoking hipAmdFileRead/hipAmdFileWrite.
  • Update fastpath unit tests and document the behavior in the changelog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/amd_detail/backend/fastpath.cpp Adds per-thread HIP runtime initialization in Fastpath::io() prior to fastpath HIP IO calls.
src/amd_detail/hip.h Extends the Hip wrapper interface with hipInit().
src/amd_detail/hip.cpp Implements Hip::hipInit() via ::hipInit(0) with existing error handling.
test/amd_detail/mhip.h Adds hipInit to the HIP mock interface.
test/amd_detail/fastpath.cpp Adjusts test expectations to include HIP init on valid IO paths (but currently conflicts with per-thread one-time behavior).
CHANGELOG.md Notes the new fastpath HIP runtime initialization check.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/amd_detail/fastpath.cpp
@kurtmcmillan kurtmcmillan merged commit 1744b22 into develop Mar 23, 2026
45 checks passed
@kurtmcmillan
Copy link
Copy Markdown
Collaborator Author

I looked into writing a system test for this change but ran into a few hurdles

  1. When AIS_TESTING is defined (whenever tests are built) then some checks are not cached. This makes a system test that reproduces the error impossible to write because HIP API calls that would normally be cached in non-test builds are no longer cached and trigger the HIP runtime to be initialized even when hipFileRead/Write are the first HIP API calls of a new thread.
  2. Stats server is initialized as soon as the process starts. This is why we have AIS_TESTING disable the static on some variables. Stats server calls into configuration and gets the stats level. On non-test builds, this value is cached and not recomputed. The static is disabled in test builds so that unit tests can test the functionality of getting the stats value. If static wasn't disabled in test builds then we would not be able to test this.

We may want to look at not performing static initialization in the future.

@kurtmcmillan kurtmcmillan deleted the kumcmill/rocm-20222 branch March 23, 2026 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants