[feat] Support metastore mode for Yuanrong backend init#74
Conversation
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
3bcdf8b to
852c27e
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Adds a new “metastore mode” startup path for the Yuanrong backend so transfer_queue.init() can bring up Yuanrong datasystem without managing an external etcd, targeting multi-node deployments where the metastore runs inside the datasystem worker.
Changes:
- Added metastore connectivity polling and refactored Yuanrong worker startup into helper functions.
- Refactored Yuanrong
auto_initto support both metastore mode and the existing etcd-based mode. - Updated Yuanrong cleanup logic in
close()and extended default config withmetastore_mode/metastore_address.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
transfer_queue/interface.py |
Implements metastore readiness waiting, metastore-mode worker startup (head/worker), refactors etcd/dscli startup & cleanup paths. |
transfer_queue/config.yaml |
Adds metastore-related config keys and removes the documented host field in the Yuanrong defaults. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
852c27e to
33bed8a
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
33bed8a to
89332c6
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
89332c6 to
bbae4c1
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
bbae4c1 to
be6fe9f
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
be6fe9f to
c6b54ea
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
c6b54ea to
fa6db11
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
fa6db11 to
655a553
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
655a553 to
747ecb0
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
747ecb0 to
7073b4a
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| bundles = [] | ||
| for node in ordered_nodes: | ||
| node_ip = node["NodeManagerAddress"] | ||
| bundles.append({"CPU": 0.1, f"node:{node_ip}": 0.001}) |
There was a problem hiding this comment.
does it require users to start ray with --resources='{"node:xxx": 1}'? Verl users do not likely set this config
51c689d to
2f9bf95
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
2f9bf95 to
8a1c985
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
8a1c985 to
5679d7d
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| cmd.extend(worker_args.split()) | ||
|
|
||
| node_type = "head node" if is_head else "worker node" | ||
| logger.info(f"Starting Yuanrong datasystem ({node_type}, metastore mode) at {worker_address}") |
There was a problem hiding this comment.
I would suggest printing worker_args here, so the user know their config is correct
|
Could we add |
| stop_exceptions = [] | ||
| # Stop worker nodes (all except head node 0) first | ||
| if len(worker_actors) > 1: | ||
| stop_refs = [actor.stop.remote() for actor in worker_actors[1:]] |
There was a problem hiding this comment.
Why we need to split the stop for head worker and other workers? To make sure the metastore on the head worker will not raise any error or warnings due to the heartbeat loss of worker?
| import ray | ||
| from omegaconf import DictConfig | ||
|
|
||
| from transfer_queue.storage.clients.yuanrong_client import get_local_ip_addresses |
There was a problem hiding this comment.
We can move get_local_ip_addresses into this file
5679d7d to
c0155af
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
c0155af to
54c6b1f
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
54c6b1f to
583171a
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
583171a to
c608bc6
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
c608bc6 to
15bb28b
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
15bb28b to
18c392a
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
tianyi-ge
left a comment
There was a problem hiding this comment.
generally looking good. yuanrong example under docs/ should be updated after this pr
|
Please also update the config in |
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
18c392a to
04fd717
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Description
Simplify Yuanrong backend initialization to exclusively support metastore mode, removing the external etcd dependency. This refactor uses Ray's native cluster discovery and placement groups to manage distributed Yuanrong datasystem workers.
Changes
- Removed: etcd_address, host, port, metastore_mode, metastore_address
- Kept: auto_init, worker_port, metastore_port
- Added: worker_args for additional dscli start arguments
- Host IPs are now auto-detected from ray.nodes() via NodeManagerAddress
- Replace ~100 lines of inline initialization with single function call to initialize_yuanrong_backend(conf)
- Simplify close() to single call to cleanup_yuanrong_resources(value)
- Remove unused imports: shutil, get_local_ip_addresses, etcd-related functions
Related issues
Fixes #50