Skip to content
This repository was archived by the owner on Sep 18, 2025. It is now read-only.
This repository was archived by the owner on Sep 18, 2025. It is now read-only.

High Inter-node Latency for Small Messages for UCX-Py using InfiniBand #563

@aamirshafi

Description

@aamirshafi

We are trying to reproduce host-based (numpy objects) UCX-Py numbers shown on slide#22 of GTC 2019 talk (https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9679-ucx-python-a-flexible-communication-library-for-python-applications.pdf) but are getting much higher numbers for latency. We are seeing around 54 us for 4 byte message with latency-bound mode. With throughput-bound mode, the latency is around 82 us. The same benchmark at the UCX level was reporting a number of 2 to 3 us.

The latency here seems to be on the higher-side. What could be the reason for this?

Some details on the test setup.

These are two nodes connected via IB. The benchmark is https://github.com/rapidsai/ucx-py/blob/branch-0.15/benchmarks/local-send-recv.py. We only modified this to get latency numbers (see latency.patch.txt).

Exact commands are as follows:

UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --reuse-alloc --server-only --n-iter 10000 --object_type numpy

Server Running at X:Y
Client:

UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --client-only --server-address X --reuse-alloc --port Y --n-iter 10000 --object_type numpy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions