feat(experimental): integrate Ray RDT for weight syncing#1305
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the Ray Direct Transport (RDT) weight update backend, which utilizes one-sided RDMA (YR for NPU, NIXL for GPU) for weight synchronization between training and inference workers. Key additions include new HTTP endpoints for both services, a scheduler bridge for the inference service, and an FSDP adapter for the training service. Feedback highlights a potential bug in response handling within the gateway, opportunities to reduce code duplication in parameter unfusing logic, and a suggestion to make an internal dispatch method private to prevent API misuse.
770271d to
d3547aa
Compare
980b657 to
44cbb72
Compare
50217db to
b380f11
Compare
|
Hi @garrett4wade, the |
e963618 to
c0ff6b3
Compare
| ) | ||
|
|
||
| WEIGHT_UPDATE_BACKEND_ENV = "AREAL_WEIGHT_UPDATE_BACKEND" | ||
| BACKEND_AWEX = "awex" |
There was a problem hiding this comment.
This environment variable is currently only used in scheduler.py, while others are still hard-code. It is recommended to unify them.
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
|
Hi @garrett4wade @sitabulaixizawaluduo, I benchmarked AWEX (118ms) and RDT (307ms) backend performance: (AReaL) 20260521-16:46:57.646 WeightUpdateController INFO: Connected pair 'test_megatron_dp_e2e' (mode=awex, colocate=False)
(AReaL) 20260521-16:46:57.762 AwexSGLangAdapter INFO: [Awex-IW-Timing] prep=0.0ms | get_params=1.9ms | build_ops=3.0ms | nccl_recv=103.2ms | copy_non_contiguous=0.0ms | barrier=0.1ms | total=108.4ms
(AReaL) 20260521-16:46:57.762 AwexMegatronAdapter INFO: [Awex-TW-Timing] prep=0.0ms | get_params=89.5ms | build_ops=5.3ms | nccl_send=13.3ms | barrier=0.2ms | total=108.4ms
(AReaL) 20260521-16:46:57.767 WeightUpdateGateway INFO: Weight update completed for pair 'test_megatron_dp_e2e' v1 (118.0ms)
(AReaL) 20260526-09:51:14.894 WeightUpdateGateway INFO: Connected RDT pair 'test_rdt_megatron_dp_e2e'
(AReaL) 20260526-09:51:14.895 WeightUpdateController INFO: Connected pair 'test_rdt_megatron_dp_e2e' (mode=rdt, colocate=False)
(WeightTransportActor pid=4050813) 2026-05-26 09:51:11 NIXL INFO _api.py:369 Backend UCX was instantiated
(WeightTransportActor pid=4050813) 2026-05-26 09:51:11 NIXL INFO _api.py:247 Initialized NIXL agent: a8fba89883cffc63ebcb4ca924000000
(AReaL) 20260526-09:51:14.987 RDTTWBlueprint INFO: [RDT-TW-Timing] get_params=34.7ms | slice_ipc=10.7ms | store_handles=36.2ms | total=81.7ms
(AReaL) 20260526-09:51:14.987 RDTTWBlueprint INFO: [RDT-TW] Prepared weights for pair 'test_rdt_megatron_dp_e2e' v1
(AReaL) 20260526-09:51:14.993 RDTSGLangAdapter INFO: TransferPlan: send_ranks=[1], tw_indices=[0], infer_world_size=1
(AReaL) 20260526-09:51:14.993 RDTSGLangAdapter INFO: [RDT-IW] Pulling from TW shards [0] for pair 'test_rdt_megatron_dp_e2e' v1
(AReaL) 20260526-09:51:14.996 RDTSGLangAdapter INFO: [RDT-IW] Submitted 1 RPCs, calling ray.get...
(AReaL) 20260526-09:51:15.169 RDTSGLangAdapter INFO: [RDT-IW] Unpacked 1 buffers, total tensors=310, total_bytes=1136.9MB, unpack_time=4.3ms
(AReaL) 20260526-09:51:15.179 RDTSGLangAdapter INFO: Applied TransferPlan: 0 non-contiguous pairs handled
(AReaL) 20260526-09:51:15.198 RDTSGLangAdapter INFO: [RDT-IW-Timing] prep=0.2ms | rpc_submit=2.6ms | ray_get=168.5ms | unpack=4.3ms | apply_model=11.0ms | cleanup=18.9ms | total=205.5ms
(AReaL) 20260526-09:51:15.202 WeightUpdateGateway INFO: RDT timing breakdown: tw_prepare_ipc_handles=87.3ms, iw_pull_weights=211.9ms
(AReaL) 20260526-09:51:15.206 WeightUpdateGateway INFO: Weight update completed for pair 'test_rdt_megatron_dp_e2e' v1 (307.4ms)RDT currently underperforms because TW and IW run as standalone processes rather than native Ray Actors, introducing the following overhead:
RDT offers plug-and-play flexibility for dynamic node scaling without extra communication groups, provided that IW and TW are implemented as Ray Actors. |
Description
This PR implements the RDT (Ray Direct Transport) weight syncing backend
Core changes:
Key features:
Related Issue
Fixes #1243
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-pr