Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,6 @@ name: pre-commit
# No need to avoid / cancel lightweight pre-commit jobs
on:
pull_request:
push:
branches:
- master

# Declare permissions just read content.
permissions:
Expand Down
33 changes: 19 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,6 @@ For performance benchmarks, see the [Performance Benchmark Report](./docs/develo
- torch == 2.7.1, torch-npu == 2.7.1.post1
- Ray (same version as ray-ascend)

## Version

| Version | Release Type | Doc |
| --------- | ------------------------ | --- |
| 0.54.0rc1 | Latest Release Candidate | |

## Quick Start

### Installation
Expand All @@ -57,9 +51,14 @@ pip install "ray-ascend[yr]"
```python
import ray
from ray.util import collective
from ray_ascend.collective import HCCLGroup
from ray_ascend import register_hccl_collective_backend

register_hccl_collective_backend()

ray.register_collective_backend("HCCL", HCCLGroup)
@ray.remote(resources={"NPU": 1})
class RayActor:
def __init__(self):
register_hccl_collective_backend()

collective.create_collective_group(
actors,
Expand All @@ -79,16 +78,15 @@ collective.broadcast(tensor, src_rank=0, group_name="my_group")
import ray
import torch
from ray.util.collective import create_collective_group
from ray.experimental import register_tensor_transport
from ray_ascend.collective import HCCLGroup
from ray_ascend.direct_transport import HCCLTensorTransport

ray.register_collective_backend("HCCL", HCCLGroup)
register_tensor_transport("HCCL", ["npu"], HCCLTensorTransport, torch.Tensor)
from ray_ascend import register_hccl_tensor_transport

register_hccl_tensor_transport()

@ray.remote(resources={"NPU": 1})
class RayActor:
def __init__(self):
register_hccl_tensor_transport()

@ray.method(tensor_transport="HCCL")
def random_tensor(self):
return torch.zeros(1024, device="npu")
Expand Down Expand Up @@ -135,6 +133,13 @@ npu_tensor = ray.get(sender.transfer_npu_tensor_via_hccs.remote())
cpu_tensor = ray.get(sender.transfer_cpu_tensor_via_rdma.remote())
```

## Ray Version Compatibility

| Ray Version | YR Transport | HCCL Collective | HCCL Tensor Transport (RDT) |
|-------------|-------------|-----------------|-----------------------------|
| >=2.55, <2.56 | ✅ | ❌ | ❌ |
| >= 2.56 | ✅ | ✅ | ✅ |

## Contributing

See [CONTRIBUTING](./CONTRIBUTING.md) and [developer guide](https://ascend.github.io/ray-ascend/developer_guide/) for more details—a step-by-step guide to help
Expand Down
70 changes: 26 additions & 44 deletions docs/user_guide/api_reference.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,48 @@
# API Reference

> _Last updated: 03/24/2026_
> _Last updated: 05/30/2026_

## HCCLGroup
## register_hccl_collective_backend

The main class for HCCL collective communication.

### Constructor
Register HCCL collective backend for Ray. Requires Ray >= 2.56.

```python
HCCLGroup(world_size: int, rank: int, group_name: str)
```
from ray_ascend import register_hccl_collective_backend

### Methods

| Method | Description |
| ----------------------------------------------------------- | ------------------------------- |
| `broadcast(tensor, broadcast_options)` | Broadcast tensor from root rank |
| `allreduce(tensor, allreduce_options)` | All-reduce tensor across group |
| `allgather(tensor_list, tensor, allgather_options)` | Gather tensors from all ranks |
| `reduce(tensor, reduce_options)` | Reduce tensor to root rank |
| `reducescatter(tensor, tensor_list, reducescatter_options)` | Reduce and scatter |
| `send(tensor, send_options)` | Send tensor to peer |
| `recv(tensor, recv_options)` | Receive tensor from peer |
| `barrier(barrier_options)` | Synchronize all ranks |
| `destroy_group()` | Clean up communicator resources |
register_hccl_collective_backend()
```

## YRTensorTransport
Must be called in both the driver process and each actor's `__init__`.

The main class for YR direct tensor transport.
## register_hccl_tensor_transport

### Constructor
Register HCCL backend and tensor transport for Ray. Requires Ray >= 2.56.

```python
YRTensorTransport()
```

### Environment Variables
from ray_ascend import register_hccl_tensor_transport

| Variable | Description |
| ------------------- | --------------------------------------------------- |
| `YR_DS_WORKER_HOST` | Host address of the YR DataSystem worker (required) |
| `YR_DS_WORKER_PORT` | Port of the YR DataSystem worker (required) |

### Methods
register_hccl_tensor_transport()
```

| Method | Description |
| --------------------------------------------------------------------------------- | ----------------------------------------- |
| `tensor_transport_backend()` | Returns "YR" |
| `is_one_sided()` | Returns True (one-sided communication) |
| `get_ds_client(device_type)` | Get or create the DataSystem client |
| `actor_has_tensor_transport(actor)` | Check if actor has YR transport available |
| `extract_tensor_transport_metadata(obj_id, gpu_object)` | Extract metadata for transport |
| `recv_multiple_tensors(obj_id, tensor_transport_metadata, communicator_metadata)` | Receive tensors |
| `garbage_collect(obj_id, tensor_transport_meta)` | Clean up resources |
Must be called in both the driver process and each actor's `__init__`.

### Registration
## register_yr_tensor_transport

Register YR tensor transport with:
Register YR tensor transport for Ray and initialize YR backend.

```python
from ray_ascend import register_yr_tensor_transport

register_yr_tensor_transport(["npu", "cpu"])
```

Must be called in both the driver process and each actor's `__init__`.

### Environment Variables

| Variable | Default | Description |
| ---------------------- | ----------- | ------------------------------------------- |
| `YR_DS_INIT_MODE` | `metastore` | Initialization mode (`metastore` or `etcd`) |
| `YR_DS_WORKER_PORT` | `31501` | YR DS worker port |
| `YR_DS_METASTORE_PORT` | `2379` | Metastore service port |
| `YR_DS_ETCD_ADDRESS` | - | Etcd address (required for etcd mode) |
4 changes: 2 additions & 2 deletions docs/user_guide/best_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
1. **Device Consistency**: Ensure all tensors used in collective operations reside on
the same NPU device that was used during communicator initialization.

1. **Group Cleanup**: Always call `destroy_group()` when done to free communicator
1. **Group Cleanup**: Clean up collective group resources when done to free communicator
resources.

1. **Rank Coordination**: All ranks must participate in collective operations in the
Expand Down Expand Up @@ -47,7 +47,7 @@
**Problem**: "Collective ops must use the same device as communicator initialization"

**Solution**: Ensure the tensor you're passing is on the same NPU device that was
current when the `HCCLGroup` was created.
current when the collective group was created.

### YR Transport Issues

Expand Down
77 changes: 41 additions & 36 deletions docs/user_guide/hccl_collective.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# HCCL Collective Communication

> _Last updated: 03/24/2026_
> _Last updated: 05/30/2026_

ray-ascend provides HCCL (Huawei Collective Communication Library) support for
distributed collective operations across Ray actors.

> **Note**: HCCL collective backend requires Ray >= 2.56.

## Available Collective Operations

- **broadcast**: Send data from one rank to all ranks
Expand All @@ -15,44 +17,30 @@ distributed collective operations across Ray actors.
- **send/recv**: Point-to-point communication
- **barrier**: Synchronize all ranks

## Quick Example: HCCL Collective Group
## Quick Example

```python
import ray
import torch
from ray.util import collective
from ray_ascend.collective import HCCLGroup
from ray_ascend import register_hccl_collective_backend

# Initialize Ray
ray.init()
register_hccl_collective_backend()

# Register the HCCL backend
ray.register_collective_backend("HCCL", HCCLGroup)

# Create actors with NPU resources
@ray.remote(resources={"NPU": 1})
class Worker:
def __init__(self):
import torch
import torch_npu
self.device = torch.npu.current_device()

def setup_group(self, world_size, rank, group_name):
self.group = HCCLGroup(world_size, rank, group_name)
register_hccl_collective_backend()

def do_allreduce(self, data):
import torch
tensor = torch.tensor(data, dtype=torch.float32).npu()
self.group.allreduce(tensor)
collective.allreduce(tensor, group_name="my_hccl_group")
return tensor.cpu().tolist()

def destroy(self):
self.group.destroy_group()

# Create workers
world_size = 2
actors = [Worker.remote() for _ in range(world_size)]

# Create collective group
collective.create_collective_group(
actors,
world_size,
Expand All @@ -61,42 +49,59 @@ collective.create_collective_group(
group_name="my_hccl_group",
)

# Perform allreduce
results = ray.get([
actors[i].do_allreduce.remote([1.0 * (i + 1), 2.0 * (i + 1)])
for i in range(world_size)
])
print("Allreduce results:", results) # Both should show [3.0, 6.0]
print("Allreduce results:", results)

# Cleanup
ray.get([actor.destroy.remote() for actor in actors])
ray.shutdown()
```

## Using Ray's Collective API
## Point-to-Point Communication

You can also use Ray's high-level collective API:
HCCL supports send/recv operations between specific ranks in a collective group:

```python
import ray
import torch
from ray.util import collective
from ray_ascend.collective import HCCLGroup
from ray_ascend import register_hccl_collective_backend

ray.init()
ray.register_collective_backend("HCCL", HCCLGroup)
register_hccl_collective_backend()

@ray.remote(resources={"NPU": 1})
class Worker:
def broadcast_tensor(self, src_rank=0):
import torch
tensor = torch.ones(10).npu() if self.rank == src_rank else torch.zeros(10).npu()
collective.broadcast(tensor, src_rank=src_rank, group_name="my_hccl_group")
def __init__(self):
register_hccl_collective_backend()

def send_tensor(self, data, dst_rank):
tensor = torch.tensor(data, dtype=torch.float32).npu()
collective.send(tensor, dst_rank=dst_rank, group_name="p2p_group")

def recv_tensor(self, shape, src_rank):
tensor = torch.zeros(shape, dtype=torch.float32).npu()
collective.recv(tensor, src_rank=src_rank, group_name="p2p_group")
return tensor.cpu().tolist()

# Create and setup group...
world_size = 2
actors = [Worker.remote() for _ in range(world_size)]

# Each actor broadcasts in SPMD manner
results = ray.get([actor.broadcast_tensor.remote() for actor in actors])
collective.create_collective_group(
actors,
world_size,
list(range(world_size)),
backend="HCCL",
group_name="p2p_group",
)

# Rank 0 sends to rank 1
ray.get(actors[0].send_tensor.remote([7.0, 8.0, 9.0], dst_rank=1))
result = ray.get(actors[1].recv_tensor.remote((3,), src_rank=0))
print("Received:", result) # [7.0, 8.0, 9.0]

ray.shutdown()
```

## Supported Tensor Types
Expand Down
56 changes: 56 additions & 0 deletions docs/user_guide/hccl_transport.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# HCCL Tensor Transport

> _Last updated: 05/30/2026_

HCCL tensor transport enables zero-copy transfer of NPU tensors between Ray actors via
HCCS (Huawei Cache Coherence System).

> **Note**: HCCL tensor transport requires Ray >= 2.56.

## Quick Example

```python
import ray
import torch
from ray.util.collective import create_collective_group
from ray_ascend import register_hccl_tensor_transport

ray.init()
register_hccl_tensor_transport()

@ray.remote(resources={"NPU": 1})
class RayActor:
def __init__(self):
register_hccl_tensor_transport()

@ray.method(tensor_transport="HCCL")
def random_tensor(self):
return torch.zeros(1024, device="npu")

def sum(self, tensor: torch.Tensor):
return torch.sum(tensor)

sender, receiver = RayActor.remote(), RayActor.remote()
group = create_collective_group([sender, receiver], backend="HCCL")

tensor = sender.random_tensor.remote()
result = receiver.sum.remote(tensor)
print(ray.get(result))

ray.shutdown()
```

## How It Works

`register_hccl_tensor_transport()` registers both the HCCL collective backend and the
HCCL tensor transport. It must be called in the driver process and in each actor's
`__init__`.

Under the hood, HCCL tensor transport uses Ray's `CollectiveTensorTransport`
infrastructure, which reuses the HCCL collective communicator for point-to-point tensor
transfers. A collective group must be created between the sender and receiver actors
before using `@ray.method(tensor_transport="HCCL")`.

## Supported Device Types

- **NPU**: Tensors on Ascend NPU devices (via HCCS)
3 changes: 2 additions & 1 deletion docs/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ pip install "ray-ascend[yr]"

- [Installation](installation.md): Detailed installation and setup instructions
- [HCCL Collective Communication](hccl_collective.md): Collective operations guide
- [YR Direct Transport](yr_transport.md): Tensor transport guide
- [HCCL Tensor Transport](hccl_transport.md): NPU tensor transport via HCCS
- [YR Direct Transport](yr_transport.md): CPU/NPU tensor transport via RDMA/HCCS
- [API Reference](api_reference.md): Complete API documentation
- [Best Practices](best_practices.md): Best practices, troubleshooting, and FAQ

Expand Down
Loading
Loading