FlagCX is part of FlagOS, a unified, open-source AI system software stack that aims to foster an open technology ecosystem by seamlessly integrating various models, systems and chips. By "develop once, migrate across various chips", FlagOS aims to unlock the full computational potential of hardware, break down the barriers between different chip software stacks, and effectively reduce migration costs.
FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community.
FlagCX leverages native collective communication libraries to provide full single-chip communication support across platforms. Beyond its native x-CCL integrations, FlagCX introduces original device-buffer IPC and device-buffer RDMA technologies, enabling high-performance P2P operations for both cross-chip and single-chip scenarios. These mechanisms can be seamlessly combined with native x-CCL backends to deliver optimized performance for cross-chip collective communications.
The following table summarizes the currently supported communication backends and their corresponding capabilities.
| Backend | NCCL | IXCCL | CNCL | MCCL | XCCL | DUCCL | HCCL | MUSACCL | RCCL | TCCL |
|---|---|---|---|---|---|---|---|---|---|---|
| Mode | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero |
| send | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| recv | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| broadcast | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| gather | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ☓/☓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| scatter | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| reduce | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| allreduce | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| allgather | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| reducescatter | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| alltoall | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| alltoallv | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
| group ops | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/✓ | ✓/✓ | ✓/✓ |
Note that Homo and Hetero modes refer to communications among homogeneous and heterogeneous clusters. All native collective communications libraries can be referenced through the links below (in alphabetic order):
- CNCL, Cambricon Communications Library.
- DUCCL, DU Collective Communications Library.
- HCCL, Ascend Communications Library.
- IXCCL, Iluvatar Corex Collective Communications Library.
- MCCL, Metax Collective Communications Library.
- MUSACCL, Musa Collective Communications Library.
- NCCL, NVIDIA Collective Communications Library.
- RCCL, ROCm Communication Collectives Library.
- TCCL, TsingMicro Communication Collectives Library.
- XCCL, Kunlunxin XPU Collective Communications Library.
Additionally, FlagCX supports three collective communication libraries for host-side communication:
- BOOTSTRAP: Host-side communication library built using the FlagCX
bootstrapcomponent. - GLOO: Gloo Collective Communications Library.
- MPI: Message Passing Interface (MPI) standard.
FlagCX integrates with upper-layer applications such as PyTorch and
PaddlePaddle.
The table below lists the frameworks supported by FlagCX and their related communication operations,
where the batch_XXX and XXX_coalesced ops refer to the usage of group primitives.
| Framework | PyTorch | PaddlePaddle |
|---|---|---|
| send | ✓ | ✓ |
| recv | ✓ | ✓ |
| all_gather | ✓ | ✓ |
| all_gather_into_tensor_coalesced | ✓ (in order, no aggregation) | ☓ |
| all_reduce | ✓ | ✓ |
| all_reduce_coalesced | ✓ (in order, no aggregation) | ☓ |
| all_to_all | ✓ | ✓ |
| all_to_all_single | ✓ | ✓ |
| barrier | ✓ | ✓ |
| batch_isend_irecv | ✓ | ✓ |
| broadcast | ✓ | ✓ |
| gather | ✓ | ✓ |
| reduce | ✓ | ✓ |
| reduce_scatter | ✓ | ✓ |
| reduce_scatter_tensor_coalesced | ✓ (in order, no aggregation) | ☓ |
| scatter | ✓ | ✓ |
Note that PyTorch support is enabled via the FlagCX Torch plugin, which provides native integration with the PyTorch distributed backend. This plugin has undergone comprehensive validation across diverse communication backends and hardware platforms, ensuring robust functionality, consistent performance, and compatibility in multi-chip heterogeneous environments.
| FlagCX Backend | NCCL | IXCCL | CNCL | MCCL | XCCL | DUCCL | HCCL | MUSACCL | RCCL |
|---|---|---|---|---|---|---|---|---|---|
| PyTorch Support | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Tip
To enable heterogeneous cross-chip communication using the PyTorch DDP FlagCX backend, it is recommended to use identical PyTorch versions across all nodes. Mismatched versions may lead to initialization failures during process group setup. Helpful advice for doing things better or more easily.
Please check the guides on building, testing the software:
After building and testing FlagCX, you can start training models using upper-layer deep learning frameworks such as PyTorch or PaddlePaddle using FlagCX as the communication backend. We provide detailed user guides for both homogeneous and heterogeneous training across different hardware platforms. Please refer to the docs below:
-
We warmly welcome community contributions to help expand and strengthen the validation matrix.
-
Join our Discussion Channel
This project is licensed under the Apache License (Version 2.0).
