Communication and compute on separate Streams do not overlap

Cross-posting [this issue](https://github.com/intel/intel-extension-for-pytorch/issues/599) from `ipex`, in case the `torch-ccl` team is not aware of it.

Key issues:
* Compute and collective communications do not overlap on intel GPU devices
* Collectives block the host thread, rather than launching a kernel and immediately returning (as on NVIDIA devices)

The pytorch profiler traces highlight the issues (copied from the other thread):

## A100 Trace

<img width="1491" alt="nvidia_a100_trace" src="https://github.com/intel/torch-ccl/assets/44747910/f86b7311-1734-4091-b8f4-4d2f04ed4e81">

Non-blocking kernel launch and comms/compute overlap.

## Intel Max 1550 Trace

<img width="1491" alt="intel_1550_trace" src="https://github.com/intel/torch-ccl/assets/44747910/08bafa4a-e1d6-407f-a0c8-7952feecf0b4">

Blocking kernel launch and no comms/compute overlap.

See the other thread for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Communication and compute on separate Streams do not overlap #64

A100 Trace

Intel Max 1550 Trace

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Communication and compute on separate Streams do not overlap #64

Description

A100 Trace

Intel Max 1550 Trace

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions