-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Summary
While porting StepMesh to a non-CUDA, Ascend NPU (ARM64) environment, Push operations crash consistently.
Initially the run failed with:
ibv_post_send failed.
But after further instrumentation, the crash was traced to the following protocol-layer assertion:
rdma_transport.h: PS_CHECK_EQ(buffer_ctx->data_num, 3)
This confirms the failure is not caused by RDMA verbs, but by an incorrect PushRequest data layout on the sender side.
Environment
-
Hardware: Ascend NPU (ARM64)
-
Architecture: aarch64
-
StepMesh version: master (2025-01)
-
RDMA: roce
-
No CUDA / no GDR support
How I Identified the Real Root Cause
The initial crash happened inside ibv_post_send(). To rule out Inline SEND issues (EINVAL is common if inline size > device limit), I modified RDMAWriteWithImm:
if (inline_write) { wr.send_flags |= IBV_SEND_INLINE; }// Force-disable inline (important for ruling out EINVAL)
wr.send_flags &= ~IBV_SEND_INLINE;
if (prev_wr == nullptr) {
PS_CHECK_EQ(ibv_post_send(qp, &wr, &bad_wr), 0);
} else {
prev_wr->next = ≀
PS_CHECK_EQ(ibv_post_send(qp, prev_wr, &bad_wr), 0);
}
After disabling inline writes, the failure moved deterministically to:
rdma_transport.h: PS_CHECK_EQ(buffer_ctx->data_num, 3)
This shows:
-
RDMA verbs and QP are functional.
-
Inline was not the real cause.
-
The true bug is that the PushRequest sender is constructing an incorrect number of data segments.
Thus the crash is purely a protocol mismatch, not a hardware issue.
Expected Behavior
Per StepMesh internal RDMA protocol:
PushRequest (worker → server) data_num = 3 segment[0] = keys segment[1] = vals segment[2] = lens
RecvPushRequest enforces this strictly:
PS_CHECK_EQ(buffer_ctx->data_num, 3);
rdma_transport
Actual Behavior on NPU Port
Sender constructs 1 or 2 segments, not 3.
As a result:
-
msg_buf->data.size()is wrong -
SendRendezvousBeginsends a wrongdata_numto peer -
Receiver fails the assertion (
data_num != 3) -
Before disabling inline, this propagated as
ibv_post_senderrors (invalid WR state)
Root Cause
StepMesh currently relies on implicit, scattered assumptions for how many data segments each message type should contain.
The real contract is:
For PushRequest (non-GDR), the NPU sender must construct exactly 3 segments.
Because this convention is not centralized or validated, it is easy for non-GPU backends to violate it and cause fatal RDMA failures.
Relevant source locations:
rdma_transport.h – receives PushRequest and asserts data_num == 3
rdma_van.h – converts msg.data into MessageBuffer
rdma_utils.h – Rendezvous structures
van.cc – SendMsg() code path
Closing
This issue is not RDMA-hardware related.
After disabling inline writes, the crash clearly originates from:
PushRequest sender not constructing 3 data segments.
A centralized protocol definition or normalization step would fix the issue and make StepMesh portable beyond CUDA/GDR environments.