Skip to content

issues of single node example #44

@menggerSherry

Description

@menggerSherry

I run run_single_gpu.sh example, but got this issues:

[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 29.27.177.139
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:601: Bind to [role=scheduler, id=1, ip=29.27.177.139, port=8123, is_recovery=0, aux_id=-1, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 1, My_Node=1
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7162
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7163
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7164
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 1 with Transport=RDMA QP_NUM 7164
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7165
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 1 with Transport=RDMA QP_NUM 7165
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:562:  interface and ip from env: enp194s0f1 (29.27.177.139)
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 29.27.177.139
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:601: Bind to [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=-1, num_ports=1]
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 1, My_Node=32767
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:562:  interface and ip from env: enp194s0f1 (29.27.177.139)
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 29.27.177.139
[[16:07:3216:07:32] ] workerserver  00  /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc::198601: : qp created: pd=Bind to 0x4001dc001030[role=server, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=-1, num_ports=1] , cq=
0x4001dc001160[, qp=16:07:327166] 
server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 1, My_Node=32767
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7167
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030[ , cq=16:07:320x4001dc001160] , qp=scheduler7168 
0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7169
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7170
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 32767 with Transport=RDMA QP_NUM 7169
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7171
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 32767 with Transport=RDMA QP_NUM 7171
[16:07:32] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 32767 OnConnected to 1
could not set CPU affinity: gpu 0-> cpu18
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=[0x40030000116016:07:32, qp=] 7172worker
 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 32767 OnConnected to 1
[could not set CPU affinity: gpu 0-> cpu17[
16:07:3216:07:32] ] schedulerworker  00  /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc::731701: : rdmardma  132767	received: 	sent: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0

[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:312: AddNode (1/2): [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 32767 with Transport=RDMA QP_NUM 7172
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7173
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 32767 with Transport=RDMA QP_NUM 7173
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 32767 OnConnected to 1
could not set CPU affinity: gpu 0-> cpu13
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 32767
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 32767
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 32767
[16:07:32[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:] 1097server:  10 OnConnected to  32767/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h
:1097: 32767 OnConnected to 1
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
could not set CPU affinity: gpu 0-> cpu12
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc[:16:07:32701] : schedulerrdma  032767 	sent: /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0731
: rdma 1	received: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:211: rank detected for node [role=server, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:211: rank detected for node [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:262: assign id=8 to node [role=server, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 8, My_Node=1
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7174
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7175
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7176
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 32767 OnConnect to 1 with Transport=RDMA QP_NUM 7176
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7177
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 32767 OnConnect to 1 with Transport=RDMA QP_NUM 7177
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 8
[16:07:32] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 32767 OnConnected to 1
[16:07:32[16:07:32] server 0 ] /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.hscheduler: 10970:  32767/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h OnConnected to :11097
: 1 OnConnected to 8
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:262: assign id=9 to node [role=worker, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1]
[16:07:32] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 9, My_Node=1
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7178
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7179
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7180
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 32767 OnConnect to 1 with Transport=RDMA QP_NUM 7180
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7181
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 32767 OnConnect to 1 with Transport=RDMA QP_NUM 7181
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 9
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 32767 OnConnected to 1
[[16:07:3316:07:33] ] schedulerworker  00  /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h::10971097: : 132767 OnConnected to  OnConnected to 91

[[16:07:3316:07:33] ] schedulerworker  00  /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc::701731: : rdmardma  132767	sent: 	received: ? => 9. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] [role=worker, id=9, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] [role=scheduler, id=1, ip=29.27.177.139, port=8123, is_recovery=0, aux_id=-1, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 01 => 32767. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] [role=worker, id=9, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] [role=scheduler, id=1, ip=29.27.177.139, port=8123, is_recovery=0, aux_id=-1, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0

[16:07:33] worker [[016:07:3316:07:33 ] ] /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.hschedulerserver:  17800:   Connecting to Node /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h8::, My_Node=7014359: : 
rdmaSendRendezvousReply (GDR Server): meta_len= 19441, data_len=	sent: 0
? => 8. Meta: request=0, timestamp=1, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] [role=worker, id=9, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] [role=scheduler, id=1, ip=29.27.177.139, port=8123, is_recovery=0, aux_id=-1, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc[:16:07:33286] : serverThe scheduler is connected to  10 workers and  1/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h servers:
490: GDR Server Reply: meta_addr=400354010000, meta_rkey=247dfb, data_addr=0, data_rkey=0
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:731: rdma 32767	received: 1 => 32767. Meta: request=0, timestamp=1, control={ cmd=ADD_NODE, node={ [role=server, id=8, ip=29.27.177.139, port=53921, is_recovery=0, aux_id=0, num_ports=1] [role=worker, id=9, ip=29.27.177.139, port=58679, is_recovery=0, aux_id=0, num_ports=1] [role=scheduler, id=1, ip=29.27.177.139, port=8123, is_recovery=0, aux_id=-1, num_ports=1] } }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 8, My_Node=8
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7182
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:701: rdma 1	sent: [16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:731: rdma 1	received: ? => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
1 => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:385: Instance barrier count for 7 : 1
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7183
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7184
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7185
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7186
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 8 OnConnect to 9 with Transport=RDMA QP_NUM 7186
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7187
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 8 OnConnect to 9 with Transport=RDMA QP_NUM 7187
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 9 OnConnected to 8
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7188
[[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 9 OnConnected to 8
16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 9, My_Node=9
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7189
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 8 OnConnect to 8 with Transport=RDMA QP_NUM 7188
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7190
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 8 OnConnect to 8 with Transport=RDMA QP_NUM 7190
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 8 OnConnected to 9
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 8 OnConnected to 9
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 9, My_Node=8
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7191
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7192
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7193
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 9 OnConnect to 9 with Transport=RDMA QP_NUM 7192
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7194
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 9 OnConnect to 9 with Transport=RDMA QP_NUM 7194
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7195
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7196
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 9 OnConnect to 8 with Transport=RDMA QP_NUM 7196
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 1, My_Node=9
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 8 OnConnected to 9
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7197
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 9 OnConnect to 8 with Transport=RDMA QP_NUM 7197
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 9 OnConnected to 8
[[16:07:33] worker[ 16:07:330]  server/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h :01097 : /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h9: OnConnected to 10978: 
8 OnConnected to 9
16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:178: Connecting to Node 1, My_Node=8
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7198
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7199
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x4001dc001030 , cq=0x4001dc001160, qp=7200
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7201
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x40034c001030 , cq=0x40034c001160, qp=7202
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 9 with Transport=RDMA QP_NUM 7201
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7203
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 9 with Transport=RDMA QP_NUM 7203
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 9 OnConnected to 1
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7204
[[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 9 OnConnected to 1
16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:520: W[9] is connected to others
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 8 with Transport=RDMA QP_NUM 7204
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:198: qp created: pd=0x400300001030 , cq=0x400300001160, qp=7205
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1014: 1 OnConnect to 8 with Transport=RDMA QP_NUM 7205
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 8 OnConnected to 1
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 9
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 9
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:1097: 1 OnConnected to 8
[16:07:33[16:07:33] scheduler 0]  server/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h :01097 : /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h1: OnConnected to 10978: 
8 OnConnected to 1
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:148: Initialized BackendMemoryAllocator for GPU 0 with pd 0x40034c001030
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:520: S[8] is connected to others
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:[70116:07:33: ] rdmascheduler  90	sent:  ? => 1. Meta: request=1, timestamp=1, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc
:731: rdma 1	received: 9 => 1. Meta: request=1, timestamp=1, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:385: Instance barrier count for 7 : 2
[16:07:33] server 0 [/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc16:07:33:] 701scheduler:  rdma0  8/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc	sent: :731? => 1. Meta: request=1, timestamp=1, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0: 
rdma 1	received: 8 => 1. Meta: request=1, timestamp=1, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:385: Instance barrier count for 7 : 3
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:701: rdma 1[	sent: 16:07:33? => 9. Meta: request=0, timestamp=3, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0] 
worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:[731[16:07:33: 16:07:33] rdma] scheduler server 9 0	received: 0 1 => 9. Meta: request=0, timestamp=3, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc
/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h::701435: : rdmaSendRendezvousReply (GDR Server): meta_len= 2881, data_len=	sent: 0? => 8. Meta: request=0, timestamp=4, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0

[[16:07:3316:07:33] ] serverscheduler  00  /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc::490701: : GDR Server Reply: meta_addr=rdma400354030000 , meta_rkey=13c252a	sent: , data_addr=? => 1. Meta: request=0, timestamp=5, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 00
, data_rkey=[016:07:33
] scheduler[ 16:07:330]  server/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc :0731 : /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.ccrdma: 7311: 	received: rdma1 => 1. Meta: request=0, timestamp=5, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0 
8	received: 1 => 8. Meta: request=0, timestamp=4, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:701: rdma 1	sent: [? => 1. Meta: request=1, timestamp=6, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:731: rdma 1	received: 1 => 1. Meta: request=1, timestamp=6, control={ cmd=INSTANCE_BARRIER, barrier_group=7 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:385: Instance barrier count for 7 : 1
KVWorker instance_idx,0
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/kv_app.h:116: Enable worker zero-copy pull
could not set CPU affinity: gpu 0-> cpu19[
16:07:33[] [16:07:33server16:07:33]  ] workerscheduler  000   /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/af_tensor_app.h/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:::447701731: : : AFTensorServer runs on gpu rdmardma0  
91	sent: 	received: ? => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=6 }. NOT DATA MSG!, Slave QP Count: 09 => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=6 }. NOT DATA MSG!, Slave QP Count: 0

[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:385: Instance barrier count for 6 : 1
could not set CPU affinity: gpu 0-> cpu11
[[16:07:33] [server16:07:33 ] 0scheduler  /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/af_tensor_app.h0: 604/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc: :Start ResponseWorker 7310: 
rdma16:07:33 ] 1server	received:  8 => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=6 }. NOT DATA MSG!, Slave QP Count: 00
 [/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc16:07:33:] 701scheduler:  rdma0  8/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc	sent: :? => 1. Meta: request=1, timestamp=2, control={ cmd=INSTANCE_BARRIER, barrier_group=6 }. NOT DATA MSG!, Slave QP Count: 0385
: Instance barrier count for 6 : 2
[16:07:33] scheduler 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:701: rdma[ 16:07:331] 	sent: worker? => 9. Meta: request=0, timestamp=7, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0 
0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:731: rdma[[ 16:07:3316:07:339] ] 	received: schedulerserver1 => 9. Meta: request=0, timestamp=7, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0  
00  /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h::701435: : rdmaSendRendezvousReply (GDR Server): meta_len= 2881, data_len=	sent: 0? => 8. Meta: request=0, timestamp=8, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0

[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:490: GDR Server Reply: meta_addr=400354050000, meta_rkey=3c454b, data_addr=0, data_rkey=0
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:731: rdma 8	received: 1 => 8. Meta: request=0, timestamp=8, control={ cmd=INSTANCE_BARRIER, barrier_group=0 }. NOT DATA MSG!, Slave QP Count: 0
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/af_tensor_app.h:249: ZPush_ addr: 0x4001db800000 val_len: 32768
[[16:07:3316:07:33] ] workerserver  00  /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h::701435: : rdmaSendRendezvousReply (GDR Server): meta_len= 2969, data_len=	sent: 32776? => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, sid=0, head=1, key=0, dtype={ UINT64 OTHER }, Slave QP Count: 0 Body: { CPU(0)->CPU(0) data_size=[8,32768,] }

[16:07:33] worker [016:07:33 ] /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/af_tensor_app.hserver: 2490:  ZPush_ addr: 0x/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h4001db808000: val_len: 45032768: 
Alloc new gpu buffer: key=0, size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:452: GPU buffer allocated: key=0, addr=70377401233648, size=32768
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc:701: rdma 9	sent: ? => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, sid=0, head=2, key=1, dtype={ UINT64 OTHER }, Slave QP Count: 0 Body: { CPU(0)->CPU(0) data_size=[8,32768,] }
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/ps/af_tensor_app.h:249: ZPush_ addr: 0x4001db810000 val_len: 32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:202: Allocated new GPU buffer for key=0 addr=0x40034bc08000 size=32768
[[16:07:3316:07:33] ] serverworker  00  /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc::490701: : GDR Server Reply: meta_addr=rdma400354070000 , meta_rkey=93c494d	sent: , data_addr=? => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=1, sid=0, head=2, key=2, dtype={ UINT64 OTHER }, Slave QP Count: 0 Body: { CPU(0)->CPU(0) data_size=[8,32768,] }40034bc08000
, data_rkey=3c5358
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:435: SendRendezvousReply (GDR Server): meta_len=296, data_len=32776
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h[:16:07:33450] : workerAlloc new gpu buffer: key= 10, size= 32768/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/van.cc
:[70116:07:33: ] rdmaserver  90	sent:  ? => 8. Meta: request=1, timestamp=3, app_id=0, customer_id=0, simple_app=0, push=0, sid=0, head=3, key=0, dtype={ UINT64 OTHER }, Slave QP Count: 0 Body: { CPU(0)->CPU(0) data_size=[8,32768,] }/home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h
:452: GPU buffer allocated: key=1, addr=70377401235776, size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:202: Allocated new GPU buffer for key=1 addr=0x40034bc10000 size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:490: GDR Server Reply: meta_addr=400354090000, meta_rkey=3c5a5e, data_addr=40034bc10000, data_rkey=3c5d62
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:435: SendRendezvousReply (GDR Server): meta_len=296, data_len=32776
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:450: Alloc new gpu buffer: key=2, size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:452: GPU buffer allocated: key=2, addr=70377401237008, size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./././rdma_utils.h:202: Allocated new GPU buffer for key=2 addr=0x40034bc18000 size=32768
[16:07:33] server 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/././rdma_transport.h:490: GDR Server Reply: meta_addr=4003540b0000, meta_rkey=3c5f64, data_addr=40034bc18000, data_rkey=3c656b
[16:07:33] worker 0 /home/bingxing2/home/scx7753/mxy/StepMesh-main/include/dmlc/logging.h:301: [16:07:33] /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:785: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status 
local protection error 4 0 81 128 RECV postoffice ptr: 0xaaaecfb860f0

Stack trace returned 6 entries:
[bt] (0) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(+0x63a00) [0x400193063a00]
[bt] (1) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(+0x63d14) [0x400193063d14]
[bt] (2) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(ps::RDMAVan::PollCQ()+0x2b0) [0x4001930c44a0]
[bt] (3) /home/bingxing2/apps/anaconda/2021.11/envs/py310torch251cu121/lib/libstdc++.so.6(+0xdd5ec) [0x4000fa91d5ec]
[bt] (4) /usr/lib64/libpthread.so.0(+0x87ac) [0x4000d81e87ac]
[bt] (5) /usr/lib64/libc.so.6(+0xd60fc) [0x4000d84260fc]


terminate called after throwing an instance of 'dmlc::Error'
  what():  [16:07:33] /home/bingxing2/home/scx7753/mxy/StepMesh-main/src/./rdma_van.h:785: Check failed: wc[i].status == IBV_WC_SUCCESS Failed status 
local protection error 4 0 81 128 RECV postoffice ptr: 0xaaaecfb860f0

Stack trace returned 6 entries:
[bt] (0) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(+0x63a00) [0x400193063a00]
[bt] (1) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(+0x63d14) [0x400193063d14]
[bt] (2) /home/bingxing2/home/scx7753/mxy/StepMesh-main/fserver_lib.cpython-310-aarch64-linux-gnu.so(ps::RDMAVan::PollCQ()+0x2b0) [0x4001930c44a0]
[bt] (3) /home/bingxing2/apps/anaconda/2021.11/envs/py310torch251cu121/lib/libstdc++.so.6(+0xdd5ec) [0x4000fa91d5ec]
[bt] (4) /usr/lib64/libpthread.so.0(+0x87ac) [0x4000d81e87ac]
[bt] (5) /usr/lib64/libc.so.6(+0xd60fc) [0x4000d84260fc]


/var/spool/slurmd/job1063841/slurm_script: line 57: 3072967 Aborted                 DMLC_ROLE=worker numactl -m 0 python3 $THIS_DIR/$BIN.py $@
+ cleanup
+ echo 'kill all testing process of ps lite for user scx7753'

my script is

function cleanup() {
    echo "kill all testing process of ps lite for user $USER" # 定义清理函数 强行杀掉所有包含 test_remote_moe, test_fserver名字的
    # pkill -9 -f test_bench
    pkill -9 -f test_remote_moe
    pkill -9 -f test_fserver
    sleep 1
}
trap cleanup EXIT #在脚本退出时执行cleanup
# cleanup



# common setup
export ROLE=joint
export RNIC=enp194s0f1
export CUDA_VISIBLE_DEVICES=0,1,2,3      
# common setup
export BIN=${BIN:-test_fserver}
# export DMLC_INTERFACE=${RNIC:-brainpf_bond0}
export SCHEDULER_IP=$(ip -o -4 addr | grep ${RNIC} | awk '{print $4}' | cut -d'/' -f1)
export DMLC_NUM_WORKER=1
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=$SCHEDULER_IP  # scheduler's RDMA interface IP 
export DMLC_PS_ROOT_PORT=8123     # scheduler's port (can random choose)
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_INTERFACE=auto
# export STEPMESH_BIND_CPU_CORE=1

export DMLC_NODE_HOST=${SCHEDULER_IP}
# export DMLC_INTERFACE=auto
export STEPMESH_SPLIT_QP_LAG=0
export STEPMESH_BIND_CPU_CORE=1
export STEPMESH_GPU=0
export PS_VERBOSE=2

DMLC_ROLE=scheduler numactl -m 0 python3 $THIS_DIR/$BIN.py &
export STEPMESH_CPU_START_OFFSET=10
DMLC_ROLE=server numactl -m 0 python3 $THIS_DIR/$BIN.py $@ &
# DMLC_ROLE=worker python3 $THIS_DIR/$BIN.py $@ &
# export STEPMESH_DROP_RATE=1
export STEPMESH_CPU_START_OFFSET=15
DMLC_ROLE=worker numactl -m 0 python3 $THIS_DIR/$BIN.py $@

wait

My machine is a A100 platform with 4 GPUs per node. Here is the information of the NIC:

$ibdev2netdev
mlx5_0 port 1 ==> enp194s0f0 (Up)
mlx5_1 port 1 ==> enp194s0f1 (Up)
mlx5_2 port 1 ==> enp226s0f0 (Up)
mlx5_3 port 1 ==> enp226s0f1 (Up)

$ lsmod | grep nvidia_peermem
nvidia_peermem        262144  0
ib_core               589824  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              57081856  109 nvidia_uvm,nvidia_peermem,nvidia_modeset

$ ip addr show enp194s0f1
4: enp194s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4500 qdisc mq state UP group default qlen 1000
    link/ether 08:c0:eb:89:8a:fb brd ff:ff:ff:ff:ff:ff
    inet 29.27.26.104/16 brd 29.27.255.255 scope global noprefixroute enp194s0f1
       valid_lft forever preferred_lft forever

And I use ib_write_bw to test RDMA. It passed the test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions