This repository was archived by the owner on Sep 18, 2025. It is now read-only.
Description Hello,
My program hangs when using string as the tag.
Client code
flag = 0
async def main ():
global flag
ep = await ucp .create_endpoint (host , port )
arr = torch .zeros ((n_bytes , ) , dtype = torch .float32 , device = 'cuda' )
print ("Send Original torch tensor" )
await ep .send (arr , tag = str (flag )) # send the real message
flag += 1
print ("Receive Incremented torch tensor" )
resp = torch .empty_like (arr )
await ep .recv (resp , tag = str (flag )) # receive the echo
await ep .close ()
assert torch .allclose (resp , arr + 1 )
Server code
flag = 0
async def send (ep : ucp .Endpoint ):
global flag
arr = torch .empty ((n_bytes , ), dtype = dtype , device = "cuda" )
await ep .recv (arr , tag = str (flag ))
assert torch .count_nonzero (arr ).item () == 0
print ("Received torch tensor" )
flag += 1
arr += 1
print ("Sending incremented torch tensor" )
await ep .send (arr , tag = str (flag ))
await ep .close ()
lf .close ()
Replacing the above str(flag) with flag resolves the hanging.
The debug log of client indicates that it tries to resend the message
Send Original torch tensor
[1718732246.687108] [n121-014-226:3530201:0] mpool.c:281 UCX DEBUG mpool ud_tx_skb: allocated chunk 0x7fc112000018 of 6291432 bytes with 1489 elements
[1718732247.208728] [n121-014-226:3530201:a] ud_ep.c:93 UCX DEBUG ep: 0x70e97a0 ca drop@cwnd = 2 in flight: 1
[1718732247.208755] [n121-014-226:3530201:a] ud_ep.c:1427 UCX DEBUG ep(0x70e97a0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1718732247.208762] [n121-014-226:3530201:a] ud_ep.c:1433 UCX DEBUG ep(0x70e97a0): resending completed
[1718732266.929299] [n121-014-226:3530201:a] ud_ep.c:93 UCX DEBUG ep: 0x70e97a0 ca drop@cwnd = 3 in flight: 1
[1718732266.929326] [n121-014-226:3530201:a] ud_ep.c:1427 UCX DEBUG ep(0x70e97a0): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1718732266.929338] [n121-014-226:3530201:a] ud_ep.c:1433 UCX DEBUG ep(0x70e97a0): resending completed
[1718732287.450339] [n121-014-226:3530201:a] ud_ep.c:93 UCX DEBUG ep: 0x70e97a0 ca drop@cwnd = 3 in flight: 1
[1718732287.450370] [n121-014-226:3530201:a] ud_ep.c:1427 UCX DEBUG ep(0x70e97a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1718732287.450398] [n121-014-226:3530201:a] ud_ep.c:1433 UCX DEBUG ep(0x70e97a0): resending completed
The server waits after allocating
[1718732226.689047] [n123-017-156:3656205:0] ud_ep.c:406 UCX DEBUG created ep ep=0x5623d098d000 iface=0x5623d09e1460 id=0
[1718732226.689052] [n123-017-156:3656205:0] wireup_ep.c:483 UCX DEBUG ep 0x7f0083f40000: wireup_ep 0x5623d0f50930 created next_ep 0x5623d098d000 to <no debug data> using ud_mlx5/mlx5_5:1
[1718732226.689091] [n123-017-156:3656205:0] async.c:231 UCX DEBUG added async handler 0x5623d0f50c30 [id=1000035 ref 1] ???() to hash
[1718732226.689106] [n123-017-156:3656205:0] ud_ep.c:691 UCX DEBUG mlx5_5:1/RoCE slid 0 qpn 0x11186 epid 0 connected to ::ffff:192.168.6.226pkey 0xffff qpn 0x10fff epid 0
[1718732226.689113] [n123-017-156:3656205:0] ib_iface.c:796 UCX DEBUG iface 0x5623d09e1460: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.168.6.226 sgid_index=3 traffic_class=106
[1718732226.690859] [n123-017-156:3656205:0] mpool.c:281 UCX DEBUG mpool ud_recv_skb: allocated chunk 0x7efba1000018 of 20971496 bytes with 4964 elements
[1718732226.690904] [n123-017-156:3656205:0] wireup_ep.c:415 UCX DEBUG ep 0x7f0083f40000: destroy wireup ep 0x5623d0f89980
[1718732226.690908] [n123-017-156:3656205:0] wireup_ep.c:415 UCX DEBUG ep 0x7f0083f40000: destroy wireup ep 0x5623d0681c10
[1718732226.690913] [n123-017-156:3656205:0] wireup_ep.c:415 UCX DEBUG ep 0x7f0083f40000: destroy wireup ep 0x5623d0f50930
[1718732226.864187] [n123-017-156:3656205:0] mpool.c:281 UCX DEBUG mpool ucp_am_bufs: allocated chunk 0x5623d1e13bf4 of 24660 bytes with 128 elements
[1718732246.685229] [n123-017-156:3656205:0] mpool.c:281 UCX DEBUG mpool ud_tx_skb: allocated chunk 0x7f0072c00018 of 6291432 bytes with 1489 elements
Reactions are currently unavailable
Hello,
My program hangs when using string as the tag.
Client code
Server code
Replacing the above
str(flag)withflagresolves the hanging.The debug log of client indicates that it tries to resend the message
The server waits after allocating