[feat] Add HCCL tensor transport for RDT#21
Conversation
CLA Signature Guide@KaisennHu , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
|
/check-cla |
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
7d52991 to
8732aa7
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
8732aa7 to
0c0828e
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
0c0828e to
54a2e6e
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
54a2e6e to
5fe39a1
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
5fe39a1 to
161ef28
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
1 similar comment
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
7ebce6d to
dc6769b
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
dc6769b to
1837cad
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
1837cad to
b7c0480
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
b7c0480 to
2f9d321
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
2f9d321 to
eda04eb
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
eda04eb to
3315690
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
3315690 to
548d921
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
16380b6 to
a30d0ec
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
|
I would suggest a "version compatible design". for older ray versions (without register_collective_backend), hccl and hccl rdt will not be activated but yr and hixl can still be used. if ray>=2.56, then all apis are available @KaisennHu |
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| import ray | ||
| from ray.util import collective | ||
| from ray_ascend.collective import HCCLGroup | ||
| from ray_ascend.collective.hccl_collective_group import HCCLGroup |
There was a problem hiding this comment.
import HCCLGroup is not necessary. just create_collective_group
| def backend(cls) -> Backend: | ||
| """Return the backend type for this group.""" | ||
| return Backend.HCCL | ||
| return "HCCL" |
There was a problem hiding this comment.
here automatically setattr. so here Backend.HCCL should work
|
|
||
| | Ray Version | YR Transport | HCCL Collective | HCCL Tensor Transport (RDT) | | ||
| |-------------|-------------|-----------------|-----------------------------| | ||
| | 2.55 | ✅ | ❌ | ❌ | |
There was a problem hiding this comment.
change 2.55.0
to
>=2.55.0, <2.56.0
to avoid ambuguity
| @@ -0,0 +1,234 @@ | |||
| import pytest | |||
There was a problem hiding this comment.
since this test works, how about deleting the pure hccl tests?
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| try: | ||
| from ray.util.collective.backend_registry import register_collective_backend | ||
| except ImportError as e: | ||
| import ray | ||
|
|
||
| raise RuntimeError( | ||
| f"register_hccl_collective_backend requires Ray >= 2.56, " | ||
| f"but Ray {ray.__version__} is installed. " | ||
| f"Please upgrade: pip install 'ray>=2.56'" | ||
| ) from e | ||
|
|
||
| from .collective.hccl_collective_group import HCCLGroup | ||
|
|
||
| register_collective_backend("HCCL", HCCLGroup) |
There was a problem hiding this comment.
should we check if hccl is installed? like,
raise ImportError(
"YR tensor transport requires the [yr] extra dependency. "
"Please install it with: pip install ray-ascend[yr]"
) from eThere was a problem hiding this comment.
Being able to import torch_npu means that hccl is definitely installed.
| ) | ||
|
|
||
|
|
||
| class HCCLTensorTransport(CollectiveTensorTransport): |
There was a problem hiding this comment.
should we implement get_communicator_metadata()?
There was a problem hiding this comment.
not needed. It is aligned with NcclTensorTransport.
There was a problem hiding this comment.
We need to update ray-ascend version
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
|
|
||
| if importlib.util.find_spec("torch_npu") is None: | ||
| raise ImportError("torch_npu not found") | ||
| ctypes.CDLL("libhccl.so") |
There was a problem hiding this comment.
where is libhccl.so? In ascend-toolkit/?
There was a problem hiding this comment.
should we remind users to set PATH?
There was a problem hiding this comment.
Yes, libhccl.so is indeed part of the ascend-toolkit (CANN). Users are already reminded via check_backend_availability.
There was a problem hiding this comment.
we should update the type annotation of tensor
dpj135
left a comment
There was a problem hiding this comment.
LGTM. There are still some comments.
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Description
HCCLTensorTransportto integrate HCCL with RDT tensor transport flow and enable explicit registrationregister_collective_backendAPIregister_hccl_collective_backend,register_hccl_tensor_transport, andregister_yr_tensor_transportAPISelf-Check Result
register_collective_backendAPI:Related issues
Closes #9 #3