Skip to content

[Question] Why is the communication between Worker and Server Restricted to GPU with same local id #31

@Zhangmj0621

Description

@Zhangmj0621

After deep dive into the code, we notice that each worker will call one kv_.SendMsg(msg, i) for each Server. And in function, we notice that the msg is only send to endpoint that has same local gpu id.

void SendMsg(Message& msg, int dst) {
    int group_server_rank = dst;
    int instance_server_id = postoffice_->GroupServerRankToInstanceID(
        group_server_rank, instance_idx_);

    msg.meta.app_id = obj_->app_id();
    msg.meta.customer_id = obj_->customer_id();
    msg.meta.recver = instance_server_id;
    postoffice_->van()->Send(msg);
  }

When I change the instance_server_id to other GPU index like below

int instance_server_id = postoffice_->GroupServerRankToInstanceID(
        group_server_rank, (instance_idx_+1)%postoffice_->group_size());

It got error log about rdma transport failure likes below

terminate called after throwing an instance of 'dmlc::Error'
  what():  [03:11:48] /infrawaves/StepMesh/src/./rdma_van.h:777: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status 
Work Request Flushed Error 5 140016533081976 249 0 OTHER postoffice ptr: 0x56251093a210

Stack trace returned 6 entries:
[bt] (0) /infrawaves/StepMesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x53daf) [0x7f5a41101daf]
[bt] (1) /infrawaves/StepMesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x540e3) [0x7f5a411020e3]
[bt] (2) /infrawaves/StepMesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(ps::RDMAVan::PollCQ()+0x4ed) [0x7f5a411708dd]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5c3b6b0253]
[bt] (4) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f5c9b2aaac3]
[bt] (5) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f5c9b33c850]

terminate called after throwing an instance of '
dmlc::Error'
  what():  [03:11:48] /infrawaves/StepMesh/src/./rdma_van.h:777: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status 

So I wonder what is the reason cause this error and if StepMesh only conect endpoints between gpu with same local gpu id. If not, why is the small change caues the system abort?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions