-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
I run run_single_gpu.sh example, but got this issues:
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 29.27.177.139
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:601: Bind to [role=scheduler, id=1, ip=29.27.177.139, port=8123, is_recovery=0, aux_id=-1, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 1, My_Node=1
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7162
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7163
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7164
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 1 with Transport=RDMA QP_NUM 7164
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7165
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 1 with Transport=RDMA QP_NUM 7165
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:562: interface and ip from env: enp194s0f1 (29.27.177.139)
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 29.27.177.139
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:601: Bind to [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=-1, num_ports=1]
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 1, My_Node=32767
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:562: interface and ip from env: enp194s0f1 (29.27.177.139)
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 29.27.177.139
[[16:07:3216:07:32] ] workerserver 00 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc::198601: : qp created: pd=Bind to 0x4001dc001030[role=server, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=-1, num_ports=1] , cq=
0x4001dc001160[, qp=16:07:327166]
server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 1, My_Node=32767
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7167
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030[ , cq=16:07:320x4001dc001160] , qp=scheduler7168
0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7169
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7170
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 32767 with Transport=RDMA QP_NUM 7169
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7171
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 32767 with Transport=RDMA QP_NUM 7171
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 32767 OnConnected to 1
could not set CPU affinity: gpu 0-> cpu18
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=[0x40030000116016:07:32, qp=] 7172worker
0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 32767 OnConnected to 1
[could not set CPU affinity: gpu 0-> cpu17[
16:07:3216:07:32] ] schedulerworker 00 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc::731701: : rdmardma 132767 received: sent: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:312: AddNode (1/2): [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 32767 with Transport=RDMA QP_NUM 7172
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7173
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 32767 with Transport=RDMA QP_NUM 7173
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 32767 OnConnected to 1
could not set CPU affinity: gpu 0-> cpu13
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 32767
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 32767
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 32767
[16:07:32[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:] 1097server: 10 OnConnected to 32767/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h
:1097: 32767 OnConnected to 1
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
could not set CPU affinity: gpu 0-> cpu12
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc[:16:07:32701] : schedulerrdma 032767 sent: /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0731
: rdma 1 received: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:211: rank detected for node [role=server, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:211: rank detected for node [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:262: assign id=8 to node [role=server, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 8, My_Node=1
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7174
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7175
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7176
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 32767 OnConnect to 1 with Transport=RDMA QP_NUM 7176
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7177
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 32767 OnConnect to 1 with Transport=RDMA QP_NUM 7177
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 8
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 32767 OnConnected to 1
[16:07:32[16:07:32] server 0 ] /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.hscheduler: 10970: 32767/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h OnConnected to :11097
: 1 OnConnected to 8
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:262: assign id=9 to node [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 9, My_Node=1
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7178
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7179
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7180
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 32767 OnConnect to 1 with Transport=RDMA QP_NUM 7180
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7181
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 32767 OnConnect to 1 with Transport=RDMA QP_NUM 7181
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 9
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 32767 OnConnected to 1
[[16:07:3316:07:33] ] schedulerworker 00 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h::10971097: : 132767 OnConnected to OnConnected to 91
[[16:07:3316:07:33] ] schedulerworker 00 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc::701731: : rdmardma 132767 sent: received: ? => 9. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] [role=worker, id=9, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] [role=scheduler, id=1, ip=29.27.177.139, port=8123, is_recovery=0, aux_id=-1, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 01 => 32767. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] [role=worker, id=9, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] [role=scheduler, id=1, ip=29.27.177.139, port=8123, is_recovery=0, aux_id=-1, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] worker [[016:07:3316:07:33 ] ] /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.hschedulerserver: 17800: Connecting to Node /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h8::, My_Node=7014359: :
rdmaSendRendezvousReply (GDR Server): meta_len= 19441, data_len= sent: 0
? => 8. Meta: request=0, timestamp=1, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] [role=worker, id=9, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] [role=scheduler, id=1, ip=29.27.177.139, port=8123, is_recovery=0, aux_id=-1, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc[:16:07:33286] : serverThe scheduler is connected to 10 workers and 1/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h servers:
490: GDR Server Reply: meta_addr=400354010000, meta_rkey=247dfb, data_addr=0, data_rkey=0
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:731: rdma 32767 received: 1 => 32767. Meta: request=0, timestamp=1, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] [role=worker, id=9, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] [role=scheduler, id=1, ip=29.27.177.139, port=8123, is_recovery=0, aux_id=-1, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 8, My_Node=8
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7182
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:701: rdma 1 sent: [16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:731: rdma 1 received: ? => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
1 => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:385: Instance barrier count for 7 : 1
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7183
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7184
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7185
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7186
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 8 OnConnect to 9 with Transport=RDMA QP_NUM 7186
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7187
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 8 OnConnect to 9 with Transport=RDMA QP_NUM 7187
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 9 OnConnected to 8
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7188
[[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 9 OnConnected to 8
16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 9, My_Node=9
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7189
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 8 OnConnect to 8 with Transport=RDMA QP_NUM 7188
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7190
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 8 OnConnect to 8 with Transport=RDMA QP_NUM 7190
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 8 OnConnected to 9
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 8 OnConnected to 9
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 9, My_Node=8
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7191
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7192
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7193
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 9 OnConnect to 9 with Transport=RDMA QP_NUM 7192
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7194
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 9 OnConnect to 9 with Transport=RDMA QP_NUM 7194
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7195
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7196
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 9 OnConnect to 8 with Transport=RDMA QP_NUM 7196
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 1, My_Node=9
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 8 OnConnected to 9
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7197
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 9 OnConnect to 8 with Transport=RDMA QP_NUM 7197
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 9 OnConnected to 8
[[16:07:33] worker[ 16:07:330] server/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h :01097 : /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h9: OnConnected to 10978:
8 OnConnected to 9
16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 1, My_Node=8
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7198
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7199
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7200
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7201
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7202
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 9 with Transport=RDMA QP_NUM 7201
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7203
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 9 with Transport=RDMA QP_NUM 7203
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 9 OnConnected to 1
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7204
[[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 9 OnConnected to 1
16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:520: W[9] is connected to others
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 8 with Transport=RDMA QP_NUM 7204
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7205
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 8 with Transport=RDMA QP_NUM 7205
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 8 OnConnected to 1
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 9
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 9
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 8
[16:07:33[16:07:33] scheduler 0] server/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h :01097 : /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h1: OnConnected to 10978:
8 OnConnected to 1
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:520: S[8] is connected to others
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:[70116:07:33: ] rdmascheduler 90 sent: ? => 1. Meta: request=1, timestamp=1, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc
:731: rdma 1 received: 9 => 1. Meta: request=1, timestamp=1, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:385: Instance barrier count for 7 : 2
[16:07:33] server 0 [/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc16:07:33:] 701scheduler: rdma0 8/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc sent: :731? => 1. Meta: request=1, timestamp=1, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0:
rdma 1 received: 8 => 1. Meta: request=1, timestamp=1, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:385: Instance barrier count for 7 : 3
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:701: rdma 1[ sent: 16:07:33? => 9. Meta: request=0, timestamp=3, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0]
worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:[731[16:07:33: 16:07:33] rdma] scheduler server 9 0 received: 0 1 => 9. Meta: request=0, timestamp=3, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc
/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h::701435: : rdmaSendRendezvousReply (GDR Server): meta_len= 2881, data_len= sent: 0? => 8. Meta: request=0, timestamp=4, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0
[[16:07:3316:07:33] ] serverscheduler 00 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc::490701: : GDR Server Reply: meta_addr=rdma400354030000 , meta_rkey=13c252a sent: , data_addr=? => 1. Meta: request=0, timestamp=5, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 00
, data_rkey=[016:07:33
] scheduler[ 16:07:330] server/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc :0731 : /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.ccrdma: 7311: received: rdma1 => 1. Meta: request=0, timestamp=5, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0
8 received: 1 => 8. Meta: request=0, timestamp=4, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:701: rdma 1 sent: [? => 1. Meta: request=1, timestamp=6, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:731: rdma 1 received: 1 => 1. Meta: request=1, timestamp=6, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:385: Instance barrier count for 7 : 1
KVWorker instance_idx,0
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/kv_app.h:116: Enable worker zero-copy pull
could not set CPU affinity: gpu 0-> cpu19[
16:07:33[] [16:07:33server16:07:33] ] workerscheduler 000 /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/af_tensor_app.h/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:::447701731: : : AFTensorServer runs on gpu rdmardma0
91 sent: received: ? => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=6 }. NOT DATA MSG!, Slave QP Count: 09 => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=6 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:385: Instance barrier count for 6 : 1
could not set CPU affinity: gpu 0-> cpu11
[[16:07:33] [server16:07:33 ] 0scheduler /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/af_tensor_app.h0: 604/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc: :Start ResponseWorker 7310:
rdma16:07:33 ] 1server received: 8 => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=6 }. NOT DATA MSG!, Slave QP Count: 00
[/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc16:07:33:] 701scheduler: rdma0 8/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc sent: :? => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=6 }. NOT DATA MSG!, Slave QP Count: 0385
: Instance barrier count for 6 : 2
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:701: rdma[ 16:07:331] sent: worker? => 9. Meta: request=0, timestamp=7, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0
0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:731: rdma[[ 16:07:3316:07:339] ] received: schedulerserver1 => 9. Meta: request=0, timestamp=7, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0
00 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h::701435: : rdmaSendRendezvousReply (GDR Server): meta_len= 2881, data_len= sent: 0? => 8. Meta: request=0, timestamp=8, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:490: GDR Server Reply: meta_addr=400354050000, meta_rkey=3c454b, data_addr=0, data_rkey=0
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:731: rdma 8 received: 1 => 8. Meta: request=0, timestamp=8, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/af_tensor_app.h:249: ZPush_ addr: 0x4001db800000 val_len: 32768
[[16:07:3316:07:33] ] workerserver 00 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h::701435: : rdmaSendRendezvousReply (GDR Server): meta_len= 2969, data_len= sent: 32776? => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, sid=0, head=1, key=0, dtype={ UINT64 OTHER }, Slave QP Count: 0 Body: { CPU(0)->CPU(0) data_size=[8,32768,] }
[16:07:33] worker [016:07:33 ] /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/af_tensor_app.hserver: 2490: ZPush_ addr: 0x/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h4001db808000: val_len: 45032768:
Alloc new gpu buffer: key=0, size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:452: GPU buffer allocated: key=0, addr=70377401233648, size=32768
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:701: rdma 9 sent: ? => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, sid=0, head=2, key=1, dtype={ UINT64 OTHER }, Slave QP Count: 0 Body: { CPU(0)->CPU(0) data_size=[8,32768,] }
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/af_tensor_app.h:249: ZPush_ addr: 0x4001db810000 val_len: 32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:202: Allocated new GPU buffer for key=0 addr=0x40034bc08000 size=32768
[[16:07:3316:07:33] ] serverworker 00 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc::490701: : GDR Server Reply: meta_addr=rdma400354070000 , meta_rkey=93c494d sent: , data_addr=? => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=1, sid=0, head=2, key=2, dtype={ UINT64 OTHER }, Slave QP Count: 0 Body: { CPU(0)->CPU(0) data_size=[8,32768,] }40034bc08000
, data_rkey=3c5358
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:435: SendRendezvousReply (GDR Server): meta_len=296, data_len=32776
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h[:16:07:33450] : workerAlloc new gpu buffer: key= 10, size= 32768/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc
:[70116:07:33: ] rdmaserver 90 sent: ? => 8. Meta: request=1, timestamp=3, app_id=0, customer_id=0, simple_app=0, push=0, sid=0, head=3, key=0, dtype={ UINT64 OTHER }, Slave QP Count: 0 Body: { CPU(0)->CPU(0) data_size=[8,32768,] }/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h
:452: GPU buffer allocated: key=1, addr=70377401235776, size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:202: Allocated new GPU buffer for key=1 addr=0x40034bc10000 size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:490: GDR Server Reply: meta_addr=400354090000, meta_rkey=3c5a5e, data_addr=40034bc10000, data_rkey=3c5d62
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:435: SendRendezvousReply (GDR Server): meta_len=296, data_len=32776
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:450: Alloc new gpu buffer: key=2, size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:452: GPU buffer allocated: key=2, addr=70377401237008, size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:202: Allocated new GPU buffer for key=2 addr=0x40034bc18000 size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:490: GDR Server Reply: meta_addr=4003540b0000, meta_rkey=3c5f64, data_addr=40034bc18000, data_rkey=3c656b
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/dmlc/logging.h:301: [16:07:33] /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:785: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
local protection error 4 0 81 128 RECV postoffice ptr: 0xaaaecfb860f0
Stack trace returned 6 entries:
[bt] (0) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(+0x63a00) [0x400193063a00]
[bt] (1) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(+0x63d14) [0x400193063d14]
[bt] (2) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(ps::RDMAVan::PollCQ()+0x2b0) [0x4001930c44a0]
[bt] (3) /home/bingxing2/apps/anaconda/2021.11/envs/py310torch251cu121/lib/libstdc++.so.6(+0xdd5ec) [0x4000fa91d5ec]
[bt] (4) /usr/lib64/libpthread.so.0(+0x87ac) [0x4000d81e87ac]
[bt] (5) /usr/lib64/libc.so.6(+0xd60fc) [0x4000d84260fc]
terminate called after throwing an instance of 'dmlc::Error'
what(): [16:07:33] /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:785: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status
local protection error 4 0 81 128 RECV postoffice ptr: 0xaaaecfb860f0
Stack trace returned 6 entries:
[bt] (0) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(+0x63a00) [0x400193063a00]
[bt] (1) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(+0x63d14) [0x400193063d14]
[bt] (2) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(ps::RDMAVan::PollCQ()+0x2b0) [0x4001930c44a0]
[bt] (3) /home/bingxing2/apps/anaconda/2021.11/envs/py310torch251cu121/lib/libstdc++.so.6(+0xdd5ec) [0x4000fa91d5ec]
[bt] (4) /usr/lib64/libpthread.so.0(+0x87ac) [0x4000d81e87ac]
[bt] (5) /usr/lib64/libc.so.6(+0xd60fc) [0x4000d84260fc]
/var/spool/slurmd/job1063841/slurm_script: line 57: 3072967 Aborted DMLC_ROLE=worker numactl -m 0 python3 $THIS_DIR/$BIN.py $@
+ cleanup
+ echo 'kill all testing process of ps lite for user scx7753'
my script is
function cleanup() {
echo "kill all testing process of ps lite for user $USER" # 定义清理函数 强行杀掉所有包含 test_remote_moe, test_fserver名字的
# pkill -9 -f test_bench
pkill -9 -f test_remote_moe
pkill -9 -f test_fserver
sleep 1
}
trap cleanup EXIT #在脚本退出时执行cleanup
# cleanup
# common setup
export ROLE=joint
export RNIC=enp194s0f1
export CUDA_VISIBLE_DEVICES=0,1,2,3
# common setup
export BIN=${BIN:-test_fserver}
# export DMLC_INTERFACE=${RNIC:-brainpf_bond0}
export SCHEDULER_IP=$(ip -o -4 addr | grep ${RNIC} | awk '{print $4}' | cut -d'/' -f1)
export DMLC_NUM_WORKER=1
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=$SCHEDULER_IP # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_INTERFACE=auto
# export STEPMESH_BIND_CPU_CORE=1
export DMLC_NODE_HOST=${SCHEDULER_IP}
# export DMLC_INTERFACE=auto
export STEPMESH_SPLIT_QP_LAG=0
export STEPMESH_BIND_CPU_CORE=1
export STEPMESH_GPU=0
export PS_VERBOSE=2
DMLC_ROLE=scheduler numactl -m 0 python3 $THIS_DIR/$BIN.py &
export STEPMESH_CPU_START_OFFSET=10
DMLC_ROLE=server numactl -m 0 python3 $THIS_DIR/$BIN.py $@ &
# DMLC_ROLE=worker python3 $THIS_DIR/$BIN.py $@ &
# export STEPMESH_DROP_RATE=1
export STEPMESH_CPU_START_OFFSET=15
DMLC_ROLE=worker numactl -m 0 python3 $THIS_DIR/$BIN.py $@
wait
My machine is a A100 platform with 4 GPUs per node. Here is the information of the NIC:
$ibdev2netdev
mlx5_0 port 1 ==> enp194s0f0 (Up)
mlx5_1 port 1 ==> enp194s0f1 (Up)
mlx5_2 port 1 ==> enp226s0f0 (Up)
mlx5_3 port 1 ==> enp226s0f1 (Up)
$ lsmod | grep nvidia_peermem
nvidia_peermem 262144 0
ib_core 589824 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 57081856 109 nvidia_uvm,nvidia_peermem,nvidia_modeset
$ ip addr show enp194s0f1
4: enp194s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4500 qdisc mq state UP group default qlen 1000
link/ether 08:c0:eb:89:8a:fb brd ff:ff:ff:ff:ff:ff
inet 29.27.26.104/16 brd 29.27.255.255 scope global noprefixroute enp194s0f1
valid_lft forever preferred_lft forever
And I use ib_write_bw to test RDMA. It passed the test.
Metadata
Metadata
Assignees
Labels
No labels