We are trying to reproduce host-based (numpy objects) UCX-Py numbers shown on slide#22 of GTC 2019 talk (https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9679-ucx-python-a-flexible-communication-library-for-python-applications.pdf) but are getting much higher numbers for latency. We are seeing around 54 us for 4 byte message with latency-bound mode. With throughput-bound mode, the latency is around 82 us. The same benchmark at the UCX level was reporting a number of 2 to 3 us.
The latency here seems to be on the higher-side. What could be the reason for this?
Some details on the test setup.
These are two nodes connected via IB. The benchmark is https://github.com/rapidsai/ucx-py/blob/branch-0.15/benchmarks/local-send-recv.py. We only modified this to get latency numbers (see latency.patch.txt).
Exact commands are as follows:
UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --reuse-alloc --server-only --n-iter 10000 --object_type numpy
Server Running at X:Y
Client:
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --client-only --server-address X --reuse-alloc --port Y --n-iter 10000 --object_type numpy
We are trying to reproduce host-based (numpy objects) UCX-Py numbers shown on slide#22 of GTC 2019 talk (https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9679-ucx-python-a-flexible-communication-library-for-python-applications.pdf) but are getting much higher numbers for latency. We are seeing around 54 us for 4 byte message with latency-bound mode. With throughput-bound mode, the latency is around 82 us. The same benchmark at the UCX level was reporting a number of 2 to 3 us.
The latency here seems to be on the higher-side. What could be the reason for this?
Some details on the test setup.
These are two nodes connected via IB. The benchmark is https://github.com/rapidsai/ucx-py/blob/branch-0.15/benchmarks/local-send-recv.py. We only modified this to get latency numbers (see latency.patch.txt).
Exact commands are as follows:
UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --reuse-alloc --server-only --n-iter 10000 --object_type numpy
Server Running at X:Y
Client:
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --client-only --server-address X --reuse-alloc --port Y --n-iter 10000 --object_type numpy