Skip to content

[Issue] RDMA PushRequest protocol mismatch causes ibv_post_send failure and data_num != 3 assertion on non-GDR (Ascend NPU) environment #46

@Dong-Jiahuan

Description

@Dong-Jiahuan

Summary

While porting StepMesh to a non-CUDA, Ascend NPU (ARM64) environment, Push operations crash consistently.
Initially the run failed with:

ibv_post_send failed.

But after further instrumentation, the crash was traced to the following protocol-layer assertion:

rdma_transport.h: PS_CHECK_EQ(buffer_ctx->data_num, 3)

This confirms the failure is not caused by RDMA verbs, but by an incorrect PushRequest data layout on the sender side.


Environment

  • Hardware: Ascend NPU (ARM64)

  • Architecture: aarch64

  • StepMesh version: master (2025-01)

  • RDMA: roce

  • No CUDA / no GDR support


How I Identified the Real Root Cause

The initial crash happened inside ibv_post_send(). To rule out Inline SEND issues (EINVAL is common if inline size > device limit), I modified RDMAWriteWithImm:

if (inline_write) { wr.send_flags |= IBV_SEND_INLINE; }

// Force-disable inline (important for ruling out EINVAL)
wr.send_flags &= ~IBV_SEND_INLINE;

if (prev_wr == nullptr) {
PS_CHECK_EQ(ibv_post_send(qp, &wr, &bad_wr), 0);
}
else {
prev_wr->next = ≀
PS_CHECK_EQ(ibv_post_send(qp, prev_wr, &bad_wr), 0);
}

After disabling inline writes, the failure moved deterministically to:

rdma_transport.h: PS_CHECK_EQ(buffer_ctx->data_num, 3)

This shows:

  1. RDMA verbs and QP are functional.

  2. Inline was not the real cause.

  3. The true bug is that the PushRequest sender is constructing an incorrect number of data segments.

Thus the crash is purely a protocol mismatch, not a hardware issue.


Expected Behavior

Per StepMesh internal RDMA protocol:

PushRequest (worker → server) data_num = 3 segment[0] = keys segment[1] = vals segment[2] = lens

RecvPushRequest enforces this strictly:

PS_CHECK_EQ(buffer_ctx->data_num, 3);

rdma_transport


Actual Behavior on NPU Port

Sender constructs 1 or 2 segments, not 3.
As a result:

  1. msg_buf->data.size() is wrong

  2. SendRendezvousBegin sends a wrong data_num to peer

  3. Receiver fails the assertion (data_num != 3)

  4. Before disabling inline, this propagated as ibv_post_send errors (invalid WR state)


Root Cause

StepMesh currently relies on implicit, scattered assumptions for how many data segments each message type should contain.

The real contract is:

Message Type | Expected data_num | Required Segments -- | -- | -- PushRequest | 3 | keys / vals / lens PullRequest | 2 | keys / empty vals PullResponse | 3 | keys / vals / lens PushResponse | 0 | none

For PushRequest (non-GDR), the NPU sender must construct exactly 3 segments.
Because this convention is not centralized or validated, it is easy for non-GPU backends to violate it and cause fatal RDMA failures.

Attachments

Relevant source locations:

rdma_transport.h – receives PushRequest and asserts data_num == 3

rdma_van.h – converts msg.data into MessageBuffer

rdma_utils.h – Rendezvous structures

van.cc – SendMsg() code path
Closing

This issue is not RDMA-hardware related.
After disabling inline writes, the crash clearly originates from:

PushRequest sender not constructing 3 data segments.

A centralized protocol definition or normalization step would fix the issue and make StepMesh portable beyond CUDA/GDR environments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions