feat: continuous metadata publish retry via heartbeat thread by zhengluo-nv · Pull Request #214 · ai-dynamo/modelexpress

zhengluo-nv · 2026-04-08T23:25:50Z

Summary

Refresh PR feat: continuous metadata publish retry via heartbeat thread #214 onto current main.
Keep the still-relevant behavior from the original PR: initial metadata publish can now run from HeartbeatThread and retry on heartbeat ticks until success or MX_PUBLISH_TIMEOUT_SECS expires.
Keep current-main metadata package structure, k8s-service behavior, SGLang integration, and pool-registration code instead of reintroducing obsolete older branch changes.
Allow P2P WorkerGrpcServer to start before mx_source_id is known, then late-bind it after publish succeeds.

Test plan

uv run --extra dev pytest tests/test_heartbeat.py tests/test_vllm_loader.py tests/test_k8s_service_client.py

Summary by CodeRabbit

New Features
- Added deferred metadata publication with configurable timeout for heartbeat operations.
- Implemented automatic retry logic with exponential backoff for failed metadata publishing attempts.
- Enhanced P2P mode with improved server initialization and metadata propagation.
Refactor
- Refactored metadata publication to support optional initialization parameters and deferred publishing workflows.
- Improved error handling for retryable versus non-retryable failures during metadata operations.

coderabbitai · 2026-04-08T23:31:20Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 19d87d16-6153-4234-afc7-d14c782bb6b8

📥 Commits

Reviewing files that changed from the base of the PR and between 31f3ccf and d35f7e5.

📒 Files selected for processing (5)

modelexpress_client/python/modelexpress/metadata/heartbeat.py
modelexpress_client/python/modelexpress/metadata/publish.py
modelexpress_client/python/modelexpress/metadata/worker_server.py
modelexpress_client/python/tests/test_heartbeat.py
modelexpress_client/python/tests/test_vllm_loader.py

Walkthrough

This PR refactors metadata publication to defer gRPC calls from startup into an async callback. The worker gRPC server now supports optional/deferred mx_source_id, HeartbeatThread gains publish_fn callback support with retry logic, and the orchestrator wires both P2P and non-P2P flows through the deferred path. Tests validate the mechanism and integration.

Changes

Deferred metadata publication flow

Layer / File(s)	Summary
Worker gRPC server deferred mx_source_id support `modelexpress_client/python/modelexpress/metadata/worker_server.py`	`WorkerServiceServicer` and `WorkerGrpcServer` now treat `mx_source_id` as optional (`str \| None`). Server gains a `port` property and `set_mx_source_id()` method to update the servicer after construction, enabling deferred ID assignment from publish callbacks. `GetTensorManifest` returns empty string when `mx_source_id` is unset.
HeartbeatThread deferred publication mechanism `modelexpress_client/python/modelexpress/metadata/heartbeat.py`	`HeartbeatThread` constructor now accepts optional `publish_fn` callback and `publish_timeout_secs`. When `publish_fn` is provided, the first `_tick()` calls the callback with exponential backoff retries until success or timeout. New `mx_source_id` property exposes the deferred result. Normal READY heartbeats begin after first publish succeeds.
Metadata publication orchestration with deferred callbacks `modelexpress_client/python/modelexpress/metadata/publish.py`	`publish_metadata_and_ready` now creates a `publish_fn` closure that defers the actual gRPC publish call until HeartbeatThread's first tick. In P2P mode, `WorkerGrpcServer` is created immediately, then `publish_fn` publishes and calls `set_mx_source_id()` to configure it. In non-P2P mode, `publish_fn` publishes without gRPC update.
HeartbeatThread deferred publication unit tests `modelexpress_client/python/tests/test_heartbeat.py`	New `TestHeartbeatPublishAndReady` class validates that HeartbeatThread correctly invokes `publish_fn` on first tick, sets `mx_source_id` from the callback result, and retries transient failures until success.
Integration tests for deferred publication flow `modelexpress_client/python/tests/test_vllm_loader.py`	`TestPublishMetadataAndReady` refactored to assert that publication is deferred from startup, `publish_fn()` is extracted from HeartbeatThread and invoked to perform gRPC publish with expected retry/backoff, and `mx_source_id` result is propagated to the gRPC server. P2P mode tests verify `WorkerGrpcServer` creation order and `set_mx_source_id` wiring.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops with deferred delight,
Publishing later, not left nor right,
The heartbeat thread now takes the call,
While gRPC awaits, patient through it all.
Callbacks and retries dance in the night. 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: continuous metadata publish retry via heartbeat thread' accurately and specifically describes the main change—implementing retryable metadata publishing via the heartbeat thread with configurable timeout behavior.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelexpress_client/python/modelexpress/worker_server.py (1)

34-49: 🛠️ Refactor suggestion | 🟠 Major

Handle the pre-mx_source_id window in GetTensorManifest().

With the new late-bound flow, the gRPC server can accept manifest requests before set_mx_source_id() runs. fetch_tensor_manifest() always sends a non-empty id, so the current check turns that legitimate startup race into FAILED_PRECONDITION, and the target-side RDMA load can fail before it ever retries another source.

One simple way to make bootstrap requests succeed

     def GetTensorManifest(self, request, context):
-        if request.mx_source_id and request.mx_source_id != self._mx_source_id:
+        if (
+            self._mx_source_id is not None
+            and request.mx_source_id
+            and request.mx_source_id != self._mx_source_id
+        ):
             context.abort(
                 grpc.StatusCode.FAILED_PRECONDITION,
                 f"mx_source_id mismatch: expected {self._mx_source_id}, "
                 f"got {request.mx_source_id}",
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelexpress_client/python/modelexpress/worker_server.py` around lines 34 -
49, GetTensorManifest currently aborts when request.mx_source_id differs from
self._mx_source_id, which breaks the startup race where the server hasn't run
set_mx_source_id yet; modify GetTensorManifest so that if self._mx_source_id is
None and request.mx_source_id is provided, it accepts the request (and
optionally initializes self._mx_source_id = request.mx_source_id) instead of
aborting, otherwise keep the existing mismatch check and abort behavior; update
logic in GetTensorManifest (referencing GetTensorManifest, set_mx_source_id, and
_mx_source_id) so bootstrap manifest requests succeed when the server-side id is
not yet set.

🧹 Nitpick comments (1)

modelexpress_client/python/tests/test_vllm_loader.py (1)

453-458: Force centralized mode in the non-P2P tests.

Both tests inherit MX_P2P_METADATA from the runner because the env patch only sets MX_CONTIGUOUS_REG. If CI or a local shell already has P2P enabled, start_metadata_publisher() takes the other branch and these assertions become flaky.

Proposed test hardening

-        with patch.dict("os.environ", {"MX_CONTIGUOUS_REG": "0"}), \
+        with patch.dict("os.environ", {"MX_CONTIGUOUS_REG": "0", "MX_P2P_METADATA": "0"}), \
             patch("modelexpress.metadata.HeartbeatThread", return_value=mock_hb) as hb_cls:
             start_metadata_publisher(
                 mx_client, nixl_manager, tensors,
                 global_rank=2, device_id=0, identity=identity, worker_id="inst-uuid",
             )

-        with patch.dict("os.environ", {"MX_CONTIGUOUS_REG": "0"}), \
+        with patch.dict("os.environ", {"MX_CONTIGUOUS_REG": "0", "MX_P2P_METADATA": "0"}), \
             patch("modelexpress.metadata.HeartbeatThread", return_value=mock_hb) as hb_cls:
             start_metadata_publisher(
                 mx_client, nixl_manager, {}, global_rank=0,
                 device_id=0, identity=identity, worker_id="w-1",
             )

Also applies to: 483-488

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelexpress_client/python/tests/test_vllm_loader.py` around lines 453 - 458,
The test is flaky because it only patches MX_CONTIGUOUS_REG and leaves
MX_P2P_METADATA inherited from the environment; update the test patch in
modelexpress_client/python/tests/test_vllm_loader.py so the with patch.dict(...)
includes "MX_P2P_METADATA": "0" (in the same context where MX_CONTIGUOUS_REG is
set) to force centralized mode before calling
start_metadata_publisher(mx_client, nixl_manager, tensors, ...); apply the same
change to the other similar test block that calls start_metadata_publisher
around the 483-488 region.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelexpress_client/python/modelexpress/heartbeat.py`:
- Around line 169-180: After a successful call to _publish_fn() the code sets
_mx_source_id but leaves _started False until the next READY heartbeat, so
_mark_stale() will return early on shutdown and not remove the published source;
fix by marking the worker as started (set self._started = True) immediately
after assigning self._mx_source_id in the try block (right after the successful
publish log) so that shutdown/_mark_stale() can proceed to mark the source STALE
if the process exits before the next READY heartbeat.

---

Outside diff comments:
In `@modelexpress_client/python/modelexpress/worker_server.py`:
- Around line 34-49: GetTensorManifest currently aborts when
request.mx_source_id differs from self._mx_source_id, which breaks the startup
race where the server hasn't run set_mx_source_id yet; modify GetTensorManifest
so that if self._mx_source_id is None and request.mx_source_id is provided, it
accepts the request (and optionally initializes self._mx_source_id =
request.mx_source_id) instead of aborting, otherwise keep the existing mismatch
check and abort behavior; update logic in GetTensorManifest (referencing
GetTensorManifest, set_mx_source_id, and _mx_source_id) so bootstrap manifest
requests succeed when the server-side id is not yet set.

---

Nitpick comments:
In `@modelexpress_client/python/tests/test_vllm_loader.py`:
- Around line 453-458: The test is flaky because it only patches
MX_CONTIGUOUS_REG and leaves MX_P2P_METADATA inherited from the environment;
update the test patch in modelexpress_client/python/tests/test_vllm_loader.py so
the with patch.dict(...) includes "MX_P2P_METADATA": "0" (in the same context
where MX_CONTIGUOUS_REG is set) to force centralized mode before calling
start_metadata_publisher(mx_client, nixl_manager, tensors, ...); apply the same
change to the other similar test block that calls start_metadata_publisher
around the 483-488 region.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 02994a4f-1d8d-427a-9b79-669465eb2783

📥 Commits

Reviewing files that changed from the base of the PR and between 049e9f4 and 31f3ccf.

📒 Files selected for processing (9)

examples/p2p_transfer_k8s/client/vllm/vllm-single-node-p2p.yaml
modelexpress_client/python/modelexpress/heartbeat.py
modelexpress_client/python/modelexpress/load_strategy/base.py
modelexpress_client/python/modelexpress/load_strategy/default_strategy.py
modelexpress_client/python/modelexpress/load_strategy/rdma_strategy.py
modelexpress_client/python/modelexpress/metadata.py
modelexpress_client/python/modelexpress/nixl_transfer.py
modelexpress_client/python/modelexpress/worker_server.py
modelexpress_client/python/tests/test_vllm_loader.py

💤 Files with no reviewable changes (1)

modelexpress_client/python/modelexpress/nixl_transfer.py

zhengluo-nv · 2026-04-08T23:41:07Z

Addressed code rabbit comments

zhengluo-nv · 2026-05-12T19:28:40Z

@coderabbitai review

coderabbitai · 2026-05-12T19:28:46Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Signed-off-by: Zheng Luo <zheluo@nvidia.com>

zhengluo-nv requested review from AndyDai-nv and nv-hwoo April 8, 2026 23:25

zhengluo-nv self-assigned this Apr 8, 2026

zhengluo-nv added the enhancement New feature or request label Apr 8, 2026

pull-request-size Bot added the size/L label Apr 8, 2026

zhengluo-nv had a problem deploying to GITLAB April 8, 2026 23:25 — with GitHub Actions Error

github-actions Bot added the refactor label Apr 8, 2026

coderabbitai Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread modelexpress_client/python/modelexpress/metadata/heartbeat.py

zhengluo-nv changed the title ~~refactor: offload metadata publishing to heartbeat thread with 30-min…~~ feat: continuous metadata publish retry via heartbeat thread Apr 8, 2026

github-actions Bot added feat and removed refactor labels Apr 8, 2026

zhengluo-nv force-pushed the zheluo/continuous-metadata-publish branch from 31f3ccf to 909ac74 Compare April 8, 2026 23:33

zhengluo-nv had a problem deploying to GITLAB April 8, 2026 23:33 — with GitHub Actions Error

zhengluo-nv force-pushed the zheluo/continuous-metadata-publish branch from 909ac74 to f9910de Compare April 8, 2026 23:34

zhengluo-nv had a problem deploying to GITLAB April 8, 2026 23:34 — with GitHub Actions Error

zhengluo-nv temporarily deployed to GITLAB April 8, 2026 23:40 — with GitHub Actions Inactive

zhengluo-nv force-pushed the zheluo/continuous-metadata-publish branch from 40dbf7f to 674a442 Compare April 9, 2026 18:22

zhengluo-nv had a problem deploying to GITLAB April 9, 2026 18:23 — with GitHub Actions Error

zhengluo-nv force-pushed the zheluo/continuous-metadata-publish branch from 674a442 to 0a6fb27 Compare April 9, 2026 18:23

zhengluo-nv temporarily deployed to GITLAB April 9, 2026 18:23 — with GitHub Actions Inactive

zhengluo-nv temporarily deployed to GITLAB April 10, 2026 17:13 — with GitHub Actions Inactive

zhengluo-nv force-pushed the zheluo/continuous-metadata-publish branch from 0d52585 to d35f7e5 Compare May 12, 2026 19:23

zhengluo-nv temporarily deployed to GITLAB May 12, 2026 19:24 — with GitHub Actions Inactive

feat: retry metadata publish from heartbeat

99f4ad0

Signed-off-by: Zheng Luo <zheluo@nvidia.com>

zhengluo-nv force-pushed the zheluo/continuous-metadata-publish branch from d35f7e5 to 99f4ad0 Compare May 12, 2026 21:06

zhengluo-nv temporarily deployed to GITLAB May 12, 2026 21:06 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: continuous metadata publish retry via heartbeat thread#214

feat: continuous metadata publish retry via heartbeat thread#214
zhengluo-nv wants to merge 1 commit into
mainfrom
zheluo/continuous-metadata-publish

zhengluo-nv commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

zhengluo-nv commented Apr 8, 2026

Uh oh!

zhengluo-nv commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhengluo-nv commented Apr 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhengluo-nv commented Apr 8, 2026

Uh oh!

zhengluo-nv commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhengluo-nv commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading