Skip to content
This repository was archived by the owner on Sep 18, 2025. It is now read-only.
This repository was archived by the owner on Sep 18, 2025. It is now read-only.

[BUG] Hanging when using string as tag #1049

@cctry

Description

@cctry

Hello,

My program hangs when using string as the tag.

Client code

flag = 0
async def main():
    global flag
    ep = await ucp.create_endpoint(host, port)
    arr = torch.zeros((n_bytes, ) , dtype=torch.float32, device='cuda')
    print("Send Original torch tensor")
    await ep.send(arr, tag=str(flag))  # send the real message
    flag += 1

    print("Receive Incremented torch tensor")
    resp = torch.empty_like(arr)
    await ep.recv(resp, tag=str(flag))  # receive the echo
    await ep.close()
    assert torch.allclose(resp, arr + 1)

Server code

flag = 0
async def send(ep: ucp.Endpoint):
    global flag
    arr = torch.empty((n_bytes, ), dtype=dtype, device="cuda")
    await ep.recv(arr, tag=str(flag))
    assert torch.count_nonzero(arr).item() == 0
    print("Received torch tensor")
    flag += 1

    arr += 1
    print("Sending incremented torch tensor")
    await ep.send(arr, tag=str(flag))
    await ep.close()
    lf.close()

Replacing the above str(flag) with flag resolves the hanging.

The debug log of client indicates that it tries to resend the message

Send Original torch tensor
[1718732246.687108] [n121-014-226:3530201:0]           mpool.c:281  UCX  DEBUG mpool ud_tx_skb: allocated chunk 0x7fc112000018 of 6291432 bytes with 1489 elements
[1718732247.208728] [n121-014-226:3530201:a]           ud_ep.c:93   UCX  DEBUG ep: 0x70e97a0 ca drop@cwnd = 2 in flight: 1
[1718732247.208755] [n121-014-226:3530201:a]           ud_ep.c:1427 UCX  DEBUG ep(0x70e97a0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1718732247.208762] [n121-014-226:3530201:a]           ud_ep.c:1433 UCX  DEBUG ep(0x70e97a0): resending completed
[1718732266.929299] [n121-014-226:3530201:a]           ud_ep.c:93   UCX  DEBUG ep: 0x70e97a0 ca drop@cwnd = 3 in flight: 1
[1718732266.929326] [n121-014-226:3530201:a]           ud_ep.c:1427 UCX  DEBUG ep(0x70e97a0): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1718732266.929338] [n121-014-226:3530201:a]           ud_ep.c:1433 UCX  DEBUG ep(0x70e97a0): resending completed
[1718732287.450339] [n121-014-226:3530201:a]           ud_ep.c:93   UCX  DEBUG ep: 0x70e97a0 ca drop@cwnd = 3 in flight: 1
[1718732287.450370] [n121-014-226:3530201:a]           ud_ep.c:1427 UCX  DEBUG ep(0x70e97a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1718732287.450398] [n121-014-226:3530201:a]           ud_ep.c:1433 UCX  DEBUG ep(0x70e97a0): resending completed

The server waits after allocating

[1718732226.689047] [n123-017-156:3656205:0]           ud_ep.c:406  UCX  DEBUG   created ep ep=0x5623d098d000 iface=0x5623d09e1460 id=0
[1718732226.689052] [n123-017-156:3656205:0]       wireup_ep.c:483  UCX  DEBUG   ep 0x7f0083f40000: wireup_ep 0x5623d0f50930 created next_ep 0x5623d098d000 to <no debug data> using ud_mlx5/mlx5_5:1
[1718732226.689091] [n123-017-156:3656205:0]           async.c:231  UCX  DEBUG   added async handler 0x5623d0f50c30 [id=1000035 ref 1] ???() to hash
[1718732226.689106] [n123-017-156:3656205:0]           ud_ep.c:691  UCX  DEBUG   mlx5_5:1/RoCE slid 0 qpn 0x11186 epid 0 connected to ::ffff:192.168.6.226pkey 0xffff  qpn 0x10fff epid 0
[1718732226.689113] [n123-017-156:3656205:0]        ib_iface.c:796  UCX  DEBUG   iface 0x5623d09e1460: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.168.6.226 sgid_index=3 traffic_class=106
[1718732226.690859] [n123-017-156:3656205:0]           mpool.c:281  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7efba1000018 of 20971496 bytes with 4964 elements
[1718732226.690904] [n123-017-156:3656205:0]       wireup_ep.c:415  UCX  DEBUG ep 0x7f0083f40000: destroy wireup ep 0x5623d0f89980
[1718732226.690908] [n123-017-156:3656205:0]       wireup_ep.c:415  UCX  DEBUG ep 0x7f0083f40000: destroy wireup ep 0x5623d0681c10
[1718732226.690913] [n123-017-156:3656205:0]       wireup_ep.c:415  UCX  DEBUG ep 0x7f0083f40000: destroy wireup ep 0x5623d0f50930
[1718732226.864187] [n123-017-156:3656205:0]           mpool.c:281  UCX  DEBUG mpool ucp_am_bufs: allocated chunk 0x5623d1e13bf4 of 24660 bytes with 128 elements
[1718732246.685229] [n123-017-156:3656205:0]           mpool.c:281  UCX  DEBUG mpool ud_tx_skb: allocated chunk 0x7f0072c00018 of 6291432 bytes with 1489 elements

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions