Skip to content

[feat] Add HCCL tensor transport for RDT#21

Open
KaisennHu wants to merge 8 commits into
Ascend:masterfrom
KaisennHu:feat/rdt-hccl
Open

[feat] Add HCCL tensor transport for RDT#21
KaisennHu wants to merge 8 commits into
Ascend:masterfrom
KaisennHu:feat/rdt-hccl

Conversation

@KaisennHu
Copy link
Copy Markdown
Collaborator

@KaisennHu KaisennHu commented Feb 14, 2026

Description

  • Implement HCCLTensorTransport to integrate HCCL with RDT tensor transport flow and enable explicit registration
  • Add HCCL collective backend tests by using register_collective_backend API
  • Add tests covering HCCL tensor transport usage in RDT
  • Provide register_hccl_collective_backend, register_hccl_tensor_transport, and register_yr_tensor_transport API
  • Update the usage doc

Self-Check Result

  • HCCL collective backend tests by using register_collective_backend API:
[root@devserver-bms-54 ray-ascend]# pytest -v /home/hhc/ray_wp/ray-ascend/tests/collective/test_hccl_via_registry.py
====================================================================== test session starts =======================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
Test order randomisation NOT enabled. Enable with --random-order or --random-order-bucket=<bucket_type>
rootdir: /home/hhc/ray_wp/ray-ascend
configfile: pyproject.toml
plugins: cov-6.2.1, hydra-core-1.3.2, jaxtyping-0.3.2, anyio-4.8.0, random-order-1.2.0, mock-3.14.1, typeguard-4.3.0
collected 6 items                                                                                                                                                

tests/collective/test_hccl_via_registry.py::test_allreduce PASSED                                                                                          [ 16%]
tests/collective/test_hccl_via_registry.py::test_broadcast PASSED                                                                                          [ 33%]
tests/collective/test_hccl_via_registry.py::test_allgather PASSED                                                                                          [ 50%]
tests/collective/test_hccl_via_registry.py::test_reduce PASSED                                                                                             [ 66%]
tests/collective/test_hccl_via_registry.py::test_reducescatter PASSED                                                                                      [ 83%]
tests/collective/test_hccl_via_registry.py::test_send_recv PASSED                                                                                          [100%]

======================================================================== warnings summary ========================================================================
tests/collective/test_hccl_via_registry.py::test_allreduce
  /usr/local/python3.10.16/lib/python3.10/site-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================= tests coverage =========================================================================
________________________________________________________ coverage: platform linux, python 3.10.16-final-0 ________________________________________________________

Name                                                      Stmts   Miss  Cover   Missing
---------------------------------------------------------------------------------------
ray_ascend/__init__.py                                        0      0   100%
ray_ascend/_version.py                                       13     13     0%   4-34
ray_ascend/collective/__init__.py                             2      0   100%
ray_ascend/collective/hccl_collective_group.py              334    259    22%   30, 34-36, 65-83, 94-102, 118-119, 122-123, 126-130, 136-146, 150-162, 167, 173-181, 197-222, 241-274, 291-315, 326-333, 347-374, 393-426, 441-461, 476-496, 499-512, 524-554, 558-560, 572-585, 589-615, 625-636, 640-642, 647-653, 668-677
ray_ascend/direct_transport/__init__.py                       3      3     0%   1-8
ray_ascend/direct_transport/hccl_tensor_transport.py          9      9     0%   1-22
ray_ascend/direct_transport/yr_tensor_transport.py          116    116     0%   1-253
ray_ascend/direct_transport/yr_tensor_transport_util.py      44     44     0%   1-91
---------------------------------------------------------------------------------------
TOTAL                                                       521    444    15%
Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml
============================================================ 6 passed, 1 warning in 60.06s (0:01:00) =============================================================
  • HCCL tensor transport tests:
[root@devserver-bms-54 ray-ascend]# pytest -v /home/hhc/ray_wp/ray-ascend/tests/direct_transport/test_hccl_tensor_transport.py
====================================================================== test session starts =======================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
Test order randomisation NOT enabled. Enable with --random-order or --random-order-bucket=<bucket_type>
rootdir: /home/hhc/ray_wp/ray-ascend
configfile: pyproject.toml
plugins: cov-6.2.1, hydra-core-1.3.2, jaxtyping-0.3.2, anyio-4.8.0, random-order-1.2.0, mock-3.14.1, typeguard-4.3.0
collected 1 item                                                                                                                                                 

tests/direct_transport/test_hccl_tensor_transport.py::test_hccl_tensor_transport PASSED                                                                    [100%]

======================================================================== warnings summary ========================================================================
tests/direct_transport/test_hccl_tensor_transport.py::test_hccl_tensor_transport
  /usr/local/python3.10.16/lib/python3.10/site-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================= tests coverage =========================================================================
________________________________________________________ coverage: platform linux, python 3.10.16-final-0 ________________________________________________________

Name                                                      Stmts   Miss  Cover   Missing
---------------------------------------------------------------------------------------
ray_ascend/__init__.py                                        0      0   100%
ray_ascend/_version.py                                       13     13     0%   4-34
ray_ascend/collective/__init__.py                             2      0   100%
ray_ascend/collective/hccl_collective_group.py              334    259    22%   30, 34-36, 65-83, 94-102, 118-119, 122-123, 126-130, 136-146, 150-162, 167, 173-181, 197-222, 241-274, 291-315, 326-333, 347-374, 393-426, 441-461, 476-496, 499-512, 524-554, 558-560, 572-585, 589-615, 625-636, 640-642, 647-653, 668-677
ray_ascend/direct_transport/__init__.py                       3      0   100%
ray_ascend/direct_transport/hccl_tensor_transport.py          9      2    78%   21-22
ray_ascend/direct_transport/yr_tensor_transport.py          116     82    29%   47-49, 52, 56, 60, 66-104, 107-125, 140-152, 160-173, 185, 194-220, 238-248
ray_ascend/direct_transport/yr_tensor_transport_util.py      44     22    50%   9-10, 22-23, 46, 49, 53-55, 58-59, 62-63, 66, 71, 79, 82-83, 86-87, 90-91
---------------------------------------------------------------------------------------
TOTAL                                                       521    378    27%
Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml
============================================================ 1 passed, 1 warning in 68.25s (0:01:08) =============================================================

Related issues

Closes #9 #3

@ascend-robot
Copy link
Copy Markdown

CLA Signature Guide

@KaisennHu , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit Reason
[7d52991 [feat]: Support hccl tensor tra...](7d52991) the email used in the commit is not linked to a signed CLA!
please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

@KaisennHu
Copy link
Copy Markdown
Collaborator Author

/check-cla

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

1 similar comment
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@tianyi-ge
Copy link
Copy Markdown
Collaborator

I would suggest a "version compatible design". for older ray versions (without register_collective_backend), hccl and hccl rdt will not be activated but yr and hixl can still be used. if ray>=2.56, then all apis are available @KaisennHu

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Comment thread docs/user_guide/hccl_collective.md Outdated
import ray
from ray.util import collective
from ray_ascend.collective import HCCLGroup
from ray_ascend.collective.hccl_collective_group import HCCLGroup
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import HCCLGroup is not necessary. just create_collective_group

def backend(cls) -> Backend:
"""Return the backend type for this group."""
return Backend.HCCL
return "HCCL"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here automatically setattr. so here Backend.HCCL should work

Comment thread README.md Outdated

| Ray Version | YR Transport | HCCL Collective | HCCL Tensor Transport (RDT) |
|-------------|-------------|-----------------|-----------------------------|
| 2.55 | ✅ | ❌ | ❌ |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change 2.55.0
to
>=2.55.0, <2.56.0
to avoid ambuguity

@@ -0,0 +1,234 @@
import pytest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this test works, how about deleting the pure hccl tests?

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Comment thread ray_ascend/__init__.py
Comment on lines +121 to +134
try:
from ray.util.collective.backend_registry import register_collective_backend
except ImportError as e:
import ray

raise RuntimeError(
f"register_hccl_collective_backend requires Ray >= 2.56, "
f"but Ray {ray.__version__} is installed. "
f"Please upgrade: pip install 'ray>=2.56'"
) from e

from .collective.hccl_collective_group import HCCLGroup

register_collective_backend("HCCL", HCCLGroup)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check if hccl is installed? like,

raise ImportError(
            "YR tensor transport requires the [yr] extra dependency. "
            "Please install it with: pip install ray-ascend[yr]"
        ) from e

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being able to import torch_npu means that hccl is definitely installed.

)


class HCCLTensorTransport(CollectiveTensorTransport):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we implement get_communicator_metadata()?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed. It is aligned with NcclTensorTransport.

Comment thread README.md Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update ray-ascend version

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted.

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍


if importlib.util.find_spec("torch_npu") is None:
raise ImportError("torch_npu not found")
ctypes.CDLL("libhccl.so")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is libhccl.so? In ascend-toolkit/?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remind users to set PATH?

Copy link
Copy Markdown
Collaborator Author

@KaisennHu KaisennHu Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, libhccl.so is indeed part of the ascend-toolkit (CANN). Users are already reminded via check_backend_availability.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should update the type annotation of tensor

Copy link
Copy Markdown
Collaborator

@dpj135 dpj135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There are still some comments.

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feat] support hccl tensor transport

6 participants