[Question] Why is the communication between Worker and Server Restricted to GPU with same local id

After deep dive into the code, we notice that each worker will call one kv_.SendMsg(msg, i) for each Server. And in function, we notice that the msg is only send to endpoint that has same local gpu id.

```
void SendMsg(Message& msg, int dst) {
    int group_server_rank = dst;
    int instance_server_id = postoffice_->GroupServerRankToInstanceID(
        group_server_rank, instance_idx_);

    msg.meta.app_id = obj_->app_id();
    msg.meta.customer_id = obj_->customer_id();
    msg.meta.recver = instance_server_id;
    postoffice_->van()->Send(msg);
  }
```

When I change the instance_server_id to other GPU index like below

```
int instance_server_id = postoffice_->GroupServerRankToInstanceID(
        group_server_rank, (instance_idx_+1)%postoffice_->group_size());
```

It got error log about rdma transport failure likes below

```
terminate called after throwing an instance of 'dmlc::Error'
  what():  [03:11:48] /infrawaves/StepMesh/src/./rdma_van.h:777: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status 
Work Request Flushed Error 5 140016533081976 249 0 OTHER postoffice ptr: 0x56251093a210

Stack trace returned 6 entries:
[bt] (0) /infrawaves/StepMesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x53daf) [0x7f5a41101daf]
[bt] (1) /infrawaves/StepMesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x540e3) [0x7f5a411020e3]
[bt] (2) /infrawaves/StepMesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(ps::RDMAVan::PollCQ()+0x4ed) [0x7f5a411708dd]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5c3b6b0253]
[bt] (4) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f5c9b2aaac3]
[bt] (5) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f5c9b33c850]

terminate called after throwing an instance of '
dmlc::Error'
  what():  [03:11:48] /infrawaves/StepMesh/src/./rdma_van.h:777: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status 
```

So I wonder what is the reason cause this error and if StepMesh only conect endpoints between gpu with same local gpu id. If not, why is the small change caues the system abort?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Why is the communication between Worker and Server Restricted to GPU with same local id #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Why is the communication between Worker and Server Restricted to GPU with same local id #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions