-
Notifications
You must be signed in to change notification settings - Fork 213
Fix issue related to Mooncake Backend data corruption in CI #1114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi magicYang1573! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
|
/build |
|
/ok to test 544596c |
|
the test failure looks relevant |
| mkdir build && cd build && \ | ||
| cmake .. -DBUILD_SHARED_LIBS=ON && \ | ||
| make -j2 && \ | ||
| cmake .. -DBUILD_SHARED_LIBS=ON -DUSE_CUDA=ON&& \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| cmake .. -DBUILD_SHARED_LIBS=ON -DUSE_CUDA=ON&& \ | |
| cmake .. -DBUILD_SHARED_LIBS=ON -DUSE_CUDA=ON && \ |
|
@magicYang1573 Looking good to me. |
|
@magicYang1573 casn you pls check CI issues? |
|
/build |
|
/ok to test 544596c |
| const std::string &remote_conn_info) { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| auto segment_id = openSegment(engine_, remote_conn_info.c_str()); | ||
| auto segment_id = openSegmentNoCache(engine_, remote_conn_info.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to merge this change this week if we want it to make it into NIXL 0.9.0. But there are some CI issues due to the build and test scripts changes. I suggest considering opening a PR with a minimal change that fixes the memory corruption (I suppose this line) to be able to merge faster and changing the tests separately, if possible
What?
Fix data corruption bug in mooncake backend. Mooncake backend test in CI is passed now.
Why?
Mooncake backend cannot pass NIXL CI due to data corruption issue.
How?
When NIXL opens segment in Mooncake TransferEngine, the metadata in the last segment allocation and deallocation is not released. An additional flush is added.