Skip to content

Conversation

@magicYang1573
Copy link

What?

Fix data corruption bug in mooncake backend. Mooncake backend test in CI is passed now.

Why?

Mooncake backend cannot pass NIXL CI due to data corruption issue.

How?

When NIXL opens segment in Mooncake TransferEngine, the metadata in the last segment allocation and deallocation is not released. An additional flush is added.

@magicYang1573 magicYang1573 requested a review from a team as a code owner December 12, 2025 07:51
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi magicYang1573! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@pull-request-size pull-request-size bot added size/M and removed size/S labels Dec 12, 2025
@ovidiusm
Copy link
Contributor

/build

@ovidiusm
Copy link
Contributor

/ok to test 544596c

@brminich
Copy link
Contributor

the test failure looks relevant

[2025-12-15T09:38:58.042Z] + ./bin/mooncake_backend_test
[2025-12-15T09:38:58.042Z] WARNING: Logging before InitGoogleLogging() is written to STDERR
[2025-12-15T09:38:58.042Z] I20251215 09:38:57.981823 134840 transfer_engine.cpp:559] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
[2025-12-15T09:38:58.042Z] I20251215 09:38:57.981873 134840 transfer_engine.cpp:101] Transfer Engine parseHostNameWithPort. server_name: 10.209.226.35 port: 12001
[2025-12-15T09:38:58.042Z] I20251215 09:38:57.981906 134840 transfer_engine.cpp:168] Transfer Engine RPC using P2P handshake, listening on 10.209.226.35:16041
[2025-12-15T09:38:58.042Z] I20251215 09:38:57.982048 134840 transfer_engine.cpp:213] Auto-discovering topology...
[2025-12-15T09:38:58.042Z] I20251215 09:38:57.993137 134840 transfer_engine.cpp:228] Topology discovery complete. Found 1 HCAs.
[2025-12-15T09:38:58.042Z] I20251215 09:38:57.993201 134840 rdma_context.cpp:77] Using SIEVE endpoint store
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996500 134840 rdma_context.cpp:442] Failed to query GID 2 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996513 134840 rdma_context.cpp:442] Failed to query GID 3 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996517 134840 rdma_context.cpp:442] Failed to query GID 4 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996520 134840 rdma_context.cpp:442] Failed to query GID 5 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996523 134840 rdma_context.cpp:442] Failed to query GID 6 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996527 134840 rdma_context.cpp:442] Failed to query GID 7 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996531 134840 rdma_context.cpp:442] Failed to query GID 8 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996533 134840 rdma_context.cpp:442] Failed to query GID 9 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996537 134840 rdma_context.cpp:442] Failed to query GID 10 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996541 134840 rdma_context.cpp:442] Failed to query GID 11 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996543 134840 rdma_context.cpp:442] Failed to query GID 12 on mlx5_0/�: No data available [61]
[2025-12-15T09:38:58.042Z] E20251215 09:38:57.996547 134840 rdma_context.cpp:442] Failed to query GID 13 on mlx5_0/�: No data available [61]

mkdir build && cd build && \
cmake .. -DBUILD_SHARED_LIBS=ON && \
make -j2 && \
cmake .. -DBUILD_SHARED_LIBS=ON -DUSE_CUDA=ON&& \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cmake .. -DBUILD_SHARED_LIBS=ON -DUSE_CUDA=ON&& \
cmake .. -DBUILD_SHARED_LIBS=ON -DUSE_CUDA=ON && \

@alogfans
Copy link
Contributor

@magicYang1573 Looking good to me.

@brminich
Copy link
Contributor

brminich commented Jan 5, 2026

@magicYang1573 casn you pls check CI issues?

@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 5, 2026

/build

@ovidiusm
Copy link
Contributor

ovidiusm commented Jan 5, 2026

/ok to test 544596c

const std::string &remote_conn_info) {
std::lock_guard<std::mutex> lock(mutex_);
auto segment_id = openSegment(engine_, remote_conn_info.c_str());
auto segment_id = openSegmentNoCache(engine_, remote_conn_info.c_str());
Copy link
Contributor

@ovidiusm ovidiusm Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to merge this change this week if we want it to make it into NIXL 0.9.0. But there are some CI issues due to the build and test scripts changes. I suggest considering opening a PR with a minimal change that fixes the memory corruption (I suppose this line) to be able to merge faster and changing the tests separately, if possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants