Ascend · tianyi-ge · Jun 1, 2026 · Mar 5, 2026 · May 30, 2026 · May 30, 2026
diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml
@@ -4,9 +4,6 @@ name: pre-commit
 # No need to avoid / cancel lightweight pre-commit jobs
 on:
   pull_request:
-  push:
-    branches:
-      - master
 
 # Declare permissions just read content.
 permissions:

diff --git a/README.md b/README.md
@@ -38,12 +38,6 @@ For performance benchmarks, see the [Performance Benchmark Report](./docs/develo
   - torch == 2.7.1, torch-npu == 2.7.1.post1
   - Ray (same version as ray-ascend)
 
-## Version
-
-| Version   | Release Type             | Doc |
-| --------- | ------------------------ | --- |
-| 0.54.0rc1 | Latest Release Candidate |     |
-
 ## Quick Start
 
 ### Installation
@@ -57,9 +51,14 @@ pip install "ray-ascend[yr]"
 ```python
 import ray
 from ray.util import collective
-from ray_ascend.collective import HCCLGroup
+from ray_ascend import register_hccl_collective_backend
+
+register_hccl_collective_backend()
 
-ray.register_collective_backend("HCCL", HCCLGroup)
+@ray.remote(resources={"NPU": 1})
+class RayActor:
+    def __init__(self):
+        register_hccl_collective_backend()
 
 collective.create_collective_group(
     actors,
@@ -79,16 +78,15 @@ collective.broadcast(tensor, src_rank=0, group_name="my_group")
 import ray
 import torch
 from ray.util.collective import create_collective_group
-from ray.experimental import register_tensor_transport
-from ray_ascend.collective import HCCLGroup
-from ray_ascend.direct_transport import HCCLTensorTransport
-
-ray.register_collective_backend("HCCL", HCCLGroup)
-register_tensor_transport("HCCL", ["npu"], HCCLTensorTransport, torch.Tensor)
+from ray_ascend import register_hccl_tensor_transport
 
+register_hccl_tensor_transport()
 
 @ray.remote(resources={"NPU": 1})
 class RayActor:
+    def __init__(self):
+        register_hccl_tensor_transport()
+
     @ray.method(tensor_transport="HCCL")
     def random_tensor(self):
         return torch.zeros(1024, device="npu")
@@ -135,6 +133,13 @@ npu_tensor = ray.get(sender.transfer_npu_tensor_via_hccs.remote())
 cpu_tensor = ray.get(sender.transfer_cpu_tensor_via_rdma.remote())
 ```
 
+## Ray Version Compatibility
+
+| Ray Version | YR Transport | HCCL Collective | HCCL Tensor Transport (RDT) |
+|-------------|-------------|-----------------|-----------------------------|
+| >=2.55, <2.56 | ✅        | ❌              | ❌                           |
+| >= 2.56     | ✅          | ✅              | ✅                           |
+
 ## Contributing
 
 See [CONTRIBUTING](./CONTRIBUTING.md) and [developer guide](https://ascend.github.io/ray-ascend/developer_guide/) for more details—a step-by-step guide to help

diff --git a/docs/user_guide/api_reference.md b/docs/user_guide/api_reference.md
@@ -1,66 +1,48 @@
 # API Reference
 
-> _Last updated: 03/24/2026_
+> _Last updated: 05/30/2026_
 
-## HCCLGroup
+## register_hccl_collective_backend
 
-The main class for HCCL collective communication.
-
-### Constructor
+Register HCCL collective backend for Ray. Requires Ray >= 2.56.
 
 ```python
-HCCLGroup(world_size: int, rank: int, group_name: str)
-```
+from ray_ascend import register_hccl_collective_backend
 
-### Methods
-
-| Method                                                      | Description                     |
-| ----------------------------------------------------------- | ------------------------------- |
-| `broadcast(tensor, broadcast_options)`                      | Broadcast tensor from root rank |
-| `allreduce(tensor, allreduce_options)`                      | All-reduce tensor across group  |
-| `allgather(tensor_list, tensor, allgather_options)`         | Gather tensors from all ranks   |
-| `reduce(tensor, reduce_options)`                            | Reduce tensor to root rank      |
-| `reducescatter(tensor, tensor_list, reducescatter_options)` | Reduce and scatter              |
-| `send(tensor, send_options)`                                | Send tensor to peer             |
-| `recv(tensor, recv_options)`                                | Receive tensor from peer        |
-| `barrier(barrier_options)`                                  | Synchronize all ranks           |
-| `destroy_group()`                                           | Clean up communicator resources |
+register_hccl_collective_backend()
+```
 
-## YRTensorTransport
+Must be called in both the driver process and each actor's `__init__`.
 
-The main class for YR direct tensor transport.
+## register_hccl_tensor_transport
 
-### Constructor
+Register HCCL backend and tensor transport for Ray. Requires Ray >= 2.56.
 
 ```python
-YRTensorTransport()
-```
-
-### Environment Variables
+from ray_ascend import register_hccl_tensor_transport
 
-| Variable            | Description                                         |
-| ------------------- | --------------------------------------------------- |
-| `YR_DS_WORKER_HOST` | Host address of the YR DataSystem worker (required) |
-| `YR_DS_WORKER_PORT` | Port of the YR DataSystem worker (required)         |
-
-### Methods
+register_hccl_tensor_transport()
+```
 
-| Method                                                                            | Description                               |
-| --------------------------------------------------------------------------------- | ----------------------------------------- |
-| `tensor_transport_backend()`                                                      | Returns "YR"                              |
-| `is_one_sided()`                                                                  | Returns True (one-sided communication)    |
-| `get_ds_client(device_type)`                                                      | Get or create the DataSystem client       |
-| `actor_has_tensor_transport(actor)`                                               | Check if actor has YR transport available |
-| `extract_tensor_transport_metadata(obj_id, gpu_object)`                           | Extract metadata for transport            |
-| `recv_multiple_tensors(obj_id, tensor_transport_metadata, communicator_metadata)` | Receive tensors                           |
-| `garbage_collect(obj_id, tensor_transport_meta)`                                  | Clean up resources                        |
+Must be called in both the driver process and each actor's `__init__`.
 
-### Registration
+## register_yr_tensor_transport
 
-Register YR tensor transport with:
+Register YR tensor transport for Ray and initialize YR backend.
 
 ```python
 from ray_ascend import register_yr_tensor_transport
 
 register_yr_tensor_transport(["npu", "cpu"])
 ```
+
+Must be called in both the driver process and each actor's `__init__`.
+
+### Environment Variables
+
+| Variable               | Default     | Description                                 |
+| ---------------------- | ----------- | ------------------------------------------- |
+| `YR_DS_INIT_MODE`      | `metastore` | Initialization mode (`metastore` or `etcd`) |
+| `YR_DS_WORKER_PORT`    | `31501`     | YR DS worker port                           |
+| `YR_DS_METASTORE_PORT` | `2379`      | Metastore service port                      |
+| `YR_DS_ETCD_ADDRESS`   | -           | Etcd address (required for etcd mode)       |
diff --git a/docs/user_guide/best_practices.md b/docs/user_guide/best_practices.md
@@ -9,7 +9,7 @@
 1. **Device Consistency**: Ensure all tensors used in collective operations reside on
    the same NPU device that was used during communicator initialization.
 
-1. **Group Cleanup**: Always call `destroy_group()` when done to free communicator
+1. **Group Cleanup**: Clean up collective group resources when done to free communicator
    resources.
 
 1. **Rank Coordination**: All ranks must participate in collective operations in the
@@ -47,7 +47,7 @@
 **Problem**: "Collective ops must use the same device as communicator initialization"
 
 **Solution**: Ensure the tensor you're passing is on the same NPU device that was
-current when the `HCCLGroup` was created.
+current when the collective group was created.
 
 ### YR Transport Issues
 

diff --git a/docs/user_guide/hccl_collective.md b/docs/user_guide/hccl_collective.md
@@ -1,10 +1,12 @@
 # HCCL Collective Communication
 
-> _Last updated: 03/24/2026_
+> _Last updated: 05/30/2026_
 
 ray-ascend provides HCCL (Huawei Collective Communication Library) support for
 distributed collective operations across Ray actors.
 
+> **Note**: HCCL collective backend requires Ray >= 2.56.
+
 ## Available Collective Operations
 
 - **broadcast**: Send data from one rank to all ranks
@@ -15,44 +17,30 @@ distributed collective operations across Ray actors.
 - **send/recv**: Point-to-point communication
 - **barrier**: Synchronize all ranks
 
-## Quick Example: HCCL Collective Group
+## Quick Example
 
 ```python
 import ray
+import torch
 from ray.util import collective
-from ray_ascend.collective import HCCLGroup
+from ray_ascend import register_hccl_collective_backend
 
-# Initialize Ray
 ray.init()
+register_hccl_collective_backend()
 
-# Register the HCCL backend
-ray.register_collective_backend("HCCL", HCCLGroup)
-
-# Create actors with NPU resources
 @ray.remote(resources={"NPU": 1})
 class Worker:
     def __init__(self):
-        import torch
-        import torch_npu
-        self.device = torch.npu.current_device()
-
-    def setup_group(self, world_size, rank, group_name):
-        self.group = HCCLGroup(world_size, rank, group_name)
+        register_hccl_collective_backend()
 
     def do_allreduce(self, data):
-        import torch
         tensor = torch.tensor(data, dtype=torch.float32).npu()
-        self.group.allreduce(tensor)
+        collective.allreduce(tensor, group_name="my_hccl_group")
         return tensor.cpu().tolist()
 
-    def destroy(self):
-        self.group.destroy_group()
-
-# Create workers
 world_size = 2
 actors = [Worker.remote() for _ in range(world_size)]
 
-# Create collective group
 collective.create_collective_group(
     actors,
     world_size,
@@ -61,42 +49,59 @@ collective.create_collective_group(
     group_name="my_hccl_group",
 )
 
-# Perform allreduce
 results = ray.get([
     actors[i].do_allreduce.remote([1.0 * (i + 1), 2.0 * (i + 1)])
     for i in range(world_size)
 ])
-print("Allreduce results:", results)  # Both should show [3.0, 6.0]
+print("Allreduce results:", results)
 
-# Cleanup
-ray.get([actor.destroy.remote() for actor in actors])
 ray.shutdown()
 ```
 
-## Using Ray's Collective API
+## Point-to-Point Communication
 
-You can also use Ray's high-level collective API:
+HCCL supports send/recv operations between specific ranks in a collective group:
 
 ```python
 import ray
+import torch
 from ray.util import collective
-from ray_ascend.collective import HCCLGroup
+from ray_ascend import register_hccl_collective_backend
 
 ray.init()
-ray.register_collective_backend("HCCL", HCCLGroup)
+register_hccl_collective_backend()
 
 @ray.remote(resources={"NPU": 1})
 class Worker:
-    def broadcast_tensor(self, src_rank=0):
-        import torch
-        tensor = torch.ones(10).npu() if self.rank == src_rank else torch.zeros(10).npu()
-        collective.broadcast(tensor, src_rank=src_rank, group_name="my_hccl_group")
+    def __init__(self):
+        register_hccl_collective_backend()
+
+    def send_tensor(self, data, dst_rank):
+        tensor = torch.tensor(data, dtype=torch.float32).npu()
+        collective.send(tensor, dst_rank=dst_rank, group_name="p2p_group")
+
+    def recv_tensor(self, shape, src_rank):
+        tensor = torch.zeros(shape, dtype=torch.float32).npu()
+        collective.recv(tensor, src_rank=src_rank, group_name="p2p_group")
         return tensor.cpu().tolist()
 
-# Create and setup group...
+world_size = 2
+actors = [Worker.remote() for _ in range(world_size)]
 
-# Each actor broadcasts in SPMD manner
-results = ray.get([actor.broadcast_tensor.remote() for actor in actors])
+collective.create_collective_group(
+    actors,
+    world_size,
+    list(range(world_size)),
+    backend="HCCL",
+    group_name="p2p_group",
+)
+
+# Rank 0 sends to rank 1
+ray.get(actors[0].send_tensor.remote([7.0, 8.0, 9.0], dst_rank=1))
+result = ray.get(actors[1].recv_tensor.remote((3,), src_rank=0))
+print("Received:", result)  # [7.0, 8.0, 9.0]
+
+ray.shutdown()
 ```
 
 ## Supported Tensor Types

diff --git a/docs/user_guide/hccl_transport.md b/docs/user_guide/hccl_transport.md
@@ -0,0 +1,56 @@
+# HCCL Tensor Transport
+
+> _Last updated: 05/30/2026_
+
+HCCL tensor transport enables zero-copy transfer of NPU tensors between Ray actors via
+HCCS (Huawei Cache Coherence System).
+
+> **Note**: HCCL tensor transport requires Ray >= 2.56.
+
+## Quick Example
+
+```python
+import ray
+import torch
+from ray.util.collective import create_collective_group
+from ray_ascend import register_hccl_tensor_transport
+
+ray.init()
+register_hccl_tensor_transport()
+
+@ray.remote(resources={"NPU": 1})
+class RayActor:
+    def __init__(self):
+        register_hccl_tensor_transport()
+
+    @ray.method(tensor_transport="HCCL")
+    def random_tensor(self):
+        return torch.zeros(1024, device="npu")
+
+    def sum(self, tensor: torch.Tensor):
+        return torch.sum(tensor)
+
+sender, receiver = RayActor.remote(), RayActor.remote()
+group = create_collective_group([sender, receiver], backend="HCCL")
+
+tensor = sender.random_tensor.remote()
+result = receiver.sum.remote(tensor)
+print(ray.get(result))
+
+ray.shutdown()
+```
+
+## How It Works
+
+`register_hccl_tensor_transport()` registers both the HCCL collective backend and the
+HCCL tensor transport. It must be called in the driver process and in each actor's
+`__init__`.
+
+Under the hood, HCCL tensor transport uses Ray's `CollectiveTensorTransport`
+infrastructure, which reuses the HCCL collective communicator for point-to-point tensor
+transfers. A collective group must be created between the sender and receiver actors
+before using `@ray.method(tensor_transport="HCCL")`.
+
+## Supported Device Types
+
+- **NPU**: Tensors on Ascend NPU devices (via HCCS)
diff --git a/docs/user_guide/index.md b/docs/user_guide/index.md
@@ -38,7 +38,8 @@ pip install "ray-ascend[yr]"
 
 - [Installation](installation.md): Detailed installation and setup instructions
 - [HCCL Collective Communication](hccl_collective.md): Collective operations guide
-- [YR Direct Transport](yr_transport.md): Tensor transport guide
+- [HCCL Tensor Transport](hccl_transport.md): NPU tensor transport via HCCS
+- [YR Direct Transport](yr_transport.md): CPU/NPU tensor transport via RDMA/HCCS
 - [API Reference](api_reference.md): Complete API documentation
 - [Best Practices](best_practices.md): Best practices, troubleshooting, and FAQ