-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Labels
enhancementNew feature or requestNew feature or request
Description
After deep dive into the code, we notice that each worker will call one kv_.SendMsg(msg, i) for each Server. And in function, we notice that the msg is only send to endpoint that has same local gpu id.
void SendMsg(Message& msg, int dst) {
int group_server_rank = dst;
int instance_server_id = postoffice_->GroupServerRankToInstanceID(
group_server_rank, instance_idx_);
msg.meta.app_id = obj_->app_id();
msg.meta.customer_id = obj_->customer_id();
msg.meta.recver = instance_server_id;
postoffice_->van()->Send(msg);
}
When I change the instance_server_id to other GPU index like below
int instance_server_id = postoffice_->GroupServerRankToInstanceID(
group_server_rank, (instance_idx_+1)%postoffice_->group_size());
It got error log about rdma transport failure likes below
terminate called after throwing an instance of 'dmlc::Error'
what(): [03:11:48] /infrawaves/StepMesh/src/./rdma_van.h:777: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
Work Request Flushed Error 5 140016533081976 249 0 OTHER postoffice ptr: 0x56251093a210
Stack trace returned 6 entries:
[bt] (0) /infrawaves/StepMesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x53daf) [0x7f5a41101daf]
[bt] (1) /infrawaves/StepMesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x540e3) [0x7f5a411020e3]
[bt] (2) /infrawaves/StepMesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(ps::RDMAVan::PollCQ()+0x4ed) [0x7f5a411708dd]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5c3b6b0253]
[bt] (4) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f5c9b2aaac3]
[bt] (5) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f5c9b33c850]
terminate called after throwing an instance of '
dmlc::Error'
what(): [03:11:48] /infrawaves/StepMesh/src/./rdma_van.h:777: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
So I wonder what is the reason cause this error and if StepMesh only conect endpoints between gpu with same local gpu id. If not, why is the small change caues the system abort?
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request