Skip to content

feat: continuous metadata publish retry via heartbeat thread#214

Open
zhengluo-nv wants to merge 1 commit into
mainfrom
zheluo/continuous-metadata-publish
Open

feat: continuous metadata publish retry via heartbeat thread#214
zhengluo-nv wants to merge 1 commit into
mainfrom
zheluo/continuous-metadata-publish

Conversation

@zhengluo-nv
Copy link
Copy Markdown
Contributor

@zhengluo-nv zhengluo-nv commented Apr 8, 2026

Summary

  • Refresh PR feat: continuous metadata publish retry via heartbeat thread #214 onto current main.
  • Keep the still-relevant behavior from the original PR: initial metadata publish can now run from HeartbeatThread and retry on heartbeat ticks until success or MX_PUBLISH_TIMEOUT_SECS expires.
  • Keep current-main metadata package structure, k8s-service behavior, SGLang integration, and pool-registration code instead of reintroducing obsolete older branch changes.
  • Allow P2P WorkerGrpcServer to start before mx_source_id is known, then late-bind it after publish succeeds.

Test plan

  • uv run --extra dev pytest tests/test_heartbeat.py tests/test_vllm_loader.py tests/test_k8s_service_client.py

Summary by CodeRabbit

  • New Features

    • Added deferred metadata publication with configurable timeout for heartbeat operations.
    • Implemented automatic retry logic with exponential backoff for failed metadata publishing attempts.
    • Enhanced P2P mode with improved server initialization and metadata propagation.
  • Refactor

    • Refactored metadata publication to support optional initialization parameters and deferred publishing workflows.
    • Improved error handling for retryable versus non-retryable failures during metadata operations.

Review Change Stack

@zhengluo-nv zhengluo-nv self-assigned this Apr 8, 2026
@zhengluo-nv zhengluo-nv added the enhancement New feature or request label Apr 8, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 19d87d16-6153-4234-afc7-d14c782bb6b8

📥 Commits

Reviewing files that changed from the base of the PR and between 31f3ccf and d35f7e5.

📒 Files selected for processing (5)
  • modelexpress_client/python/modelexpress/metadata/heartbeat.py
  • modelexpress_client/python/modelexpress/metadata/publish.py
  • modelexpress_client/python/modelexpress/metadata/worker_server.py
  • modelexpress_client/python/tests/test_heartbeat.py
  • modelexpress_client/python/tests/test_vllm_loader.py

Walkthrough

This PR refactors metadata publication to defer gRPC calls from startup into an async callback. The worker gRPC server now supports optional/deferred mx_source_id, HeartbeatThread gains publish_fn callback support with retry logic, and the orchestrator wires both P2P and non-P2P flows through the deferred path. Tests validate the mechanism and integration.

Changes

Deferred metadata publication flow

Layer / File(s) Summary
Worker gRPC server deferred mx_source_id support
modelexpress_client/python/modelexpress/metadata/worker_server.py
WorkerServiceServicer and WorkerGrpcServer now treat mx_source_id as optional (str | None). Server gains a port property and set_mx_source_id() method to update the servicer after construction, enabling deferred ID assignment from publish callbacks. GetTensorManifest returns empty string when mx_source_id is unset.
HeartbeatThread deferred publication mechanism
modelexpress_client/python/modelexpress/metadata/heartbeat.py
HeartbeatThread constructor now accepts optional publish_fn callback and publish_timeout_secs. When publish_fn is provided, the first _tick() calls the callback with exponential backoff retries until success or timeout. New mx_source_id property exposes the deferred result. Normal READY heartbeats begin after first publish succeeds.
Metadata publication orchestration with deferred callbacks
modelexpress_client/python/modelexpress/metadata/publish.py
publish_metadata_and_ready now creates a publish_fn closure that defers the actual gRPC publish call until HeartbeatThread's first tick. In P2P mode, WorkerGrpcServer is created immediately, then publish_fn publishes and calls set_mx_source_id() to configure it. In non-P2P mode, publish_fn publishes without gRPC update.
HeartbeatThread deferred publication unit tests
modelexpress_client/python/tests/test_heartbeat.py
New TestHeartbeatPublishAndReady class validates that HeartbeatThread correctly invokes publish_fn on first tick, sets mx_source_id from the callback result, and retries transient failures until success.
Integration tests for deferred publication flow
modelexpress_client/python/tests/test_vllm_loader.py
TestPublishMetadataAndReady refactored to assert that publication is deferred from startup, publish_fn() is extracted from HeartbeatThread and invoked to perform gRPC publish with expected retry/backoff, and mx_source_id result is propagated to the gRPC server. P2P mode tests verify WorkerGrpcServer creation order and set_mx_source_id wiring.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops with deferred delight,
Publishing later, not left nor right,
The heartbeat thread now takes the call,
While gRPC awaits, patient through it all.
Callbacks and retries dance in the night. 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: continuous metadata publish retry via heartbeat thread' accurately and specifically describes the main change—implementing retryable metadata publishing via the heartbeat thread with configurable timeout behavior.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelexpress_client/python/modelexpress/worker_server.py (1)

34-49: 🛠️ Refactor suggestion | 🟠 Major

Handle the pre-mx_source_id window in GetTensorManifest().

With the new late-bound flow, the gRPC server can accept manifest requests before set_mx_source_id() runs. fetch_tensor_manifest() always sends a non-empty id, so the current check turns that legitimate startup race into FAILED_PRECONDITION, and the target-side RDMA load can fail before it ever retries another source.

One simple way to make bootstrap requests succeed
     def GetTensorManifest(self, request, context):
-        if request.mx_source_id and request.mx_source_id != self._mx_source_id:
+        if (
+            self._mx_source_id is not None
+            and request.mx_source_id
+            and request.mx_source_id != self._mx_source_id
+        ):
             context.abort(
                 grpc.StatusCode.FAILED_PRECONDITION,
                 f"mx_source_id mismatch: expected {self._mx_source_id}, "
                 f"got {request.mx_source_id}",
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelexpress_client/python/modelexpress/worker_server.py` around lines 34 -
49, GetTensorManifest currently aborts when request.mx_source_id differs from
self._mx_source_id, which breaks the startup race where the server hasn't run
set_mx_source_id yet; modify GetTensorManifest so that if self._mx_source_id is
None and request.mx_source_id is provided, it accepts the request (and
optionally initializes self._mx_source_id = request.mx_source_id) instead of
aborting, otherwise keep the existing mismatch check and abort behavior; update
logic in GetTensorManifest (referencing GetTensorManifest, set_mx_source_id, and
_mx_source_id) so bootstrap manifest requests succeed when the server-side id is
not yet set.
🧹 Nitpick comments (1)
modelexpress_client/python/tests/test_vllm_loader.py (1)

453-458: Force centralized mode in the non-P2P tests.

Both tests inherit MX_P2P_METADATA from the runner because the env patch only sets MX_CONTIGUOUS_REG. If CI or a local shell already has P2P enabled, start_metadata_publisher() takes the other branch and these assertions become flaky.

Proposed test hardening
-        with patch.dict("os.environ", {"MX_CONTIGUOUS_REG": "0"}), \
+        with patch.dict("os.environ", {"MX_CONTIGUOUS_REG": "0", "MX_P2P_METADATA": "0"}), \
             patch("modelexpress.metadata.HeartbeatThread", return_value=mock_hb) as hb_cls:
             start_metadata_publisher(
                 mx_client, nixl_manager, tensors,
                 global_rank=2, device_id=0, identity=identity, worker_id="inst-uuid",
             )
-        with patch.dict("os.environ", {"MX_CONTIGUOUS_REG": "0"}), \
+        with patch.dict("os.environ", {"MX_CONTIGUOUS_REG": "0", "MX_P2P_METADATA": "0"}), \
             patch("modelexpress.metadata.HeartbeatThread", return_value=mock_hb) as hb_cls:
             start_metadata_publisher(
                 mx_client, nixl_manager, {}, global_rank=0,
                 device_id=0, identity=identity, worker_id="w-1",
             )

Also applies to: 483-488

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelexpress_client/python/tests/test_vllm_loader.py` around lines 453 - 458,
The test is flaky because it only patches MX_CONTIGUOUS_REG and leaves
MX_P2P_METADATA inherited from the environment; update the test patch in
modelexpress_client/python/tests/test_vllm_loader.py so the with patch.dict(...)
includes "MX_P2P_METADATA": "0" (in the same context where MX_CONTIGUOUS_REG is
set) to force centralized mode before calling
start_metadata_publisher(mx_client, nixl_manager, tensors, ...); apply the same
change to the other similar test block that calls start_metadata_publisher
around the 483-488 region.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelexpress_client/python/modelexpress/heartbeat.py`:
- Around line 169-180: After a successful call to _publish_fn() the code sets
_mx_source_id but leaves _started False until the next READY heartbeat, so
_mark_stale() will return early on shutdown and not remove the published source;
fix by marking the worker as started (set self._started = True) immediately
after assigning self._mx_source_id in the try block (right after the successful
publish log) so that shutdown/_mark_stale() can proceed to mark the source STALE
if the process exits before the next READY heartbeat.

---

Outside diff comments:
In `@modelexpress_client/python/modelexpress/worker_server.py`:
- Around line 34-49: GetTensorManifest currently aborts when
request.mx_source_id differs from self._mx_source_id, which breaks the startup
race where the server hasn't run set_mx_source_id yet; modify GetTensorManifest
so that if self._mx_source_id is None and request.mx_source_id is provided, it
accepts the request (and optionally initializes self._mx_source_id =
request.mx_source_id) instead of aborting, otherwise keep the existing mismatch
check and abort behavior; update logic in GetTensorManifest (referencing
GetTensorManifest, set_mx_source_id, and _mx_source_id) so bootstrap manifest
requests succeed when the server-side id is not yet set.

---

Nitpick comments:
In `@modelexpress_client/python/tests/test_vllm_loader.py`:
- Around line 453-458: The test is flaky because it only patches
MX_CONTIGUOUS_REG and leaves MX_P2P_METADATA inherited from the environment;
update the test patch in modelexpress_client/python/tests/test_vllm_loader.py so
the with patch.dict(...) includes "MX_P2P_METADATA": "0" (in the same context
where MX_CONTIGUOUS_REG is set) to force centralized mode before calling
start_metadata_publisher(mx_client, nixl_manager, tensors, ...); apply the same
change to the other similar test block that calls start_metadata_publisher
around the 483-488 region.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 02994a4f-1d8d-427a-9b79-669465eb2783

📥 Commits

Reviewing files that changed from the base of the PR and between 049e9f4 and 31f3ccf.

📒 Files selected for processing (9)
  • examples/p2p_transfer_k8s/client/vllm/vllm-single-node-p2p.yaml
  • modelexpress_client/python/modelexpress/heartbeat.py
  • modelexpress_client/python/modelexpress/load_strategy/base.py
  • modelexpress_client/python/modelexpress/load_strategy/default_strategy.py
  • modelexpress_client/python/modelexpress/load_strategy/rdma_strategy.py
  • modelexpress_client/python/modelexpress/metadata.py
  • modelexpress_client/python/modelexpress/nixl_transfer.py
  • modelexpress_client/python/modelexpress/worker_server.py
  • modelexpress_client/python/tests/test_vllm_loader.py
💤 Files with no reviewable changes (1)
  • modelexpress_client/python/modelexpress/nixl_transfer.py

Comment thread modelexpress_client/python/modelexpress/metadata/heartbeat.py
@zhengluo-nv zhengluo-nv changed the title refactor: offload metadata publishing to heartbeat thread with 30-min… feat: continuous metadata publish retry via heartbeat thread Apr 8, 2026
@github-actions github-actions Bot added feat and removed refactor labels Apr 8, 2026
@zhengluo-nv zhengluo-nv force-pushed the zheluo/continuous-metadata-publish branch from 31f3ccf to 909ac74 Compare April 8, 2026 23:33
@zhengluo-nv zhengluo-nv force-pushed the zheluo/continuous-metadata-publish branch from 909ac74 to f9910de Compare April 8, 2026 23:34
@zhengluo-nv
Copy link
Copy Markdown
Contributor Author

Addressed code rabbit comments

@zhengluo-nv
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Signed-off-by: Zheng Luo <zheluo@nvidia.com>
@zhengluo-nv zhengluo-nv force-pushed the zheluo/continuous-metadata-publish branch from d35f7e5 to 99f4ad0 Compare May 12, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feat size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant