Skip to content

Add detailed metrics stating the exact source _and_ destination of communicated data #996

@pentschev

Description

@pentschev

Summary

Detailed metrics of communications are important to determine the specific transport layers that are being used, allowing us to identify optimal configurations for compute nodes based on their topology and to prioritize appropriate optimization routes.

Option 1: Debug information from UCX(X) requests

UCX can provide per-request debug information post-transfers, containing details such as the transport layer used and the bandwidth achieved. This information could be directly plugged into the UCXX Communicator and aggregated in statistics for all requests without modifications in call sites.

This is currently blocked as it requires rapidsai/ucxx#437.

Option 2: Scaffolding in RapidsMPF call sites

Collectives that utilize a Communicator may collect statistics on whether it's sending data from host/device, or receiving data to host/device. However, to provide a full picture this is insufficient given the transport layer can be dramatically altered given the remote source/destination memory resource. To provide information that can be used to infer the transport layer used, information of both send and receive memory resources are required.

To realize the condition above, we would require additional scaffolding to retrieve information of either the receiver memory resource type (on the sender side), the sender memory resource type (on the receiver side), or both. This could probably be added as metadata before sending the message and retrieved by the receiver to add ultimately generate statistics on the receiver side. To provide statistics on the sender side this would be more complicated, as it would require an additional message informing of the memory resource type used after the transfer completes. Regardless of the chosen approach, this would require introducing additional complexity at all Communicator::send()/Communicator::recv*() call sites.

This option would also require knowledge of the system running and UCX configurations, since transport layers are affected by those and are not easily retrievable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions