feat: add TRT-LLM engine integration#280
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
WalkthroughThis PR adds TensorRT-LLM engine support to ModelExpress using the shared load strategy framework. It extends LoadContext with metadata configuration fields, refactors strategies and adapters to use context instead of environment variables, implements a TrtllmAdapter and MXCheckpointLoader, and includes comprehensive tests and Docker build infrastructure. ChangesTensorRT-LLM Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
examples/p2p_transfer_k8s/client/trtllm/Dockerfile (1)
4-5: ⚡ Quick winPin the TRT-LLM base image to a fixed tag/digest.
Using
:latestmakes this image non-reproducible and can break patch assumptions as upstream changes. Prefer a fixed release tag (and ideally digest) as the default.Suggested change
-ARG TRTLLM_IMAGE=nvcr.io/nvidia/tensorrt-llm/release:latest +ARG TRTLLM_IMAGE=nvcr.io/nvidia/tensorrt-llm/release:<fixed-version>@sha256:<digest>🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/p2p_transfer_k8s/client/trtllm/Dockerfile` around lines 4 - 5, The Dockerfile currently uses ARG TRTLLM_IMAGE with a :latest default which makes builds non-reproducible; change ARG TRTLLM_IMAGE to a fixed release tag or (preferably) an image digest (e.g., nvcr.io/...:vX.Y.Z or nvcr.io/...@sha256:...) and keep FROM ${TRTLLM_IMAGE} so callers can still override the ARG; update any README/build scripts to mention the pinned default and how to override TRTLLM_IMAGE if needed.examples/p2p_transfer_k8s/client/trtllm/install_modelexpress_client.sh (1)
9-12: ⚡ Quick winPin
grpcioandgrpcio-toolsto the same exact version.Range-based constraints here can drift and produce non-reproducible builds or tooling/runtime skew for generated gRPC code.
Suggested change
pip install --no-cache-dir \ - "grpcio>=1.66.2" \ - "grpcio-tools<=1.66.2" \ + "grpcio==1.66.2" \ + "grpcio-tools==1.66.2" \ "protobuf>=5.27.0,<6.0.0"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/p2p_transfer_k8s/client/trtllm/install_modelexpress_client.sh` around lines 9 - 12, The pip install command currently uses range constraints for grpcio and grpcio-tools which can drift; update the install line to pin both packages to the same exact version (e.g., replace "grpcio>=1.66.2" and "grpcio-tools<=1.66.2" with exact pins like "grpcio==1.66.2" and "grpcio-tools==1.66.2") so generated gRPC code and runtime use the identical grpc versions while leaving the protobuf constraint as-is.examples/p2p_transfer_k8s/client/trtllm/fix_nixl_runpath.py (1)
19-21: ⚡ Quick winAvoid hardcoding Python 3.12
dist-packagesfor binding discovery.This will fail if the base image Python/site-packages layout changes. Discover candidate site-packages dynamically, then glob under
nixl_cu13.Suggested change
import glob import os +import site import subprocess @@ - bindings = glob.glob("/usr/local/lib/python3.12/dist-packages/nixl_cu13/_bindings*.so") + bindings: list[str] = [] + for pkg_dir in site.getsitepackages(): + bindings.extend(glob.glob(f"{pkg_dir}/nixl_cu13/_bindings*.so"))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/p2p_transfer_k8s/client/trtllm/fix_nixl_runpath.py` around lines 19 - 21, The code currently hardcodes "/usr/local/lib/python3.12/dist-packages" when building the glob for bindings (variable bindings), which breaks on different Python/site-packages layouts; change the discovery to iterate site-packages locations obtained from site.getsitepackages(), sysconfig.get_paths()["purelib"], and sys.path (filtering existing directories), and then run glob.glob(os.path.join(site_pkg, "nixl_cu13", "_bindings*.so")) across those locations, collecting matches into bindings; keep the existing check on len(bindings) and nixl_lib_dir usage but replace the hardcoded path construction with this dynamic site-packages discovery approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelexpress_client/python/modelexpress/load_strategy/base.py`:
- Around line 188-204: publish_loaded_model currently always registers tensors,
publishes metadata, and retains source runtime even when the incoming LoadResult
is marked publishable=False; modify publish_loaded_model so after calling
_as_load_result(result_or_model) you check result.publishable and return early
when it's False, avoiding calls to register_tensors, publish_metadata, and
_retain_source_runtime for non-publishable results (while still handling cases
where a raw nn.Module is passed by relying on _as_load_result semantics). Ensure
the check uses the LoadResult.publishable attribute and that the rest of the
function (register_tensors, publish_metadata, _retain_source_runtime) only runs
when publishable is True.
---
Nitpick comments:
In `@examples/p2p_transfer_k8s/client/trtllm/Dockerfile`:
- Around line 4-5: The Dockerfile currently uses ARG TRTLLM_IMAGE with a :latest
default which makes builds non-reproducible; change ARG TRTLLM_IMAGE to a fixed
release tag or (preferably) an image digest (e.g., nvcr.io/...:vX.Y.Z or
nvcr.io/...@sha256:...) and keep FROM ${TRTLLM_IMAGE} so callers can still
override the ARG; update any README/build scripts to mention the pinned default
and how to override TRTLLM_IMAGE if needed.
In `@examples/p2p_transfer_k8s/client/trtllm/fix_nixl_runpath.py`:
- Around line 19-21: The code currently hardcodes
"/usr/local/lib/python3.12/dist-packages" when building the glob for bindings
(variable bindings), which breaks on different Python/site-packages layouts;
change the discovery to iterate site-packages locations obtained from
site.getsitepackages(), sysconfig.get_paths()["purelib"], and sys.path
(filtering existing directories), and then run glob.glob(os.path.join(site_pkg,
"nixl_cu13", "_bindings*.so")) across those locations, collecting matches into
bindings; keep the existing check on len(bindings) and nixl_lib_dir usage but
replace the hardcoded path construction with this dynamic site-packages
discovery approach.
In `@examples/p2p_transfer_k8s/client/trtllm/install_modelexpress_client.sh`:
- Around line 9-12: The pip install command currently uses range constraints for
grpcio and grpcio-tools which can drift; update the install line to pin both
packages to the same exact version (e.g., replace "grpcio>=1.66.2" and
"grpcio-tools<=1.66.2" with exact pins like "grpcio==1.66.2" and
"grpcio-tools==1.66.2") so generated gRPC code and runtime use the identical
grpc versions while leaving the protobuf constraint as-is.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7ad1806e-5aac-4b8f-a1bd-3c803e3ff931
📒 Files selected for processing (32)
examples/p2p_transfer_k8s/client/trtllm/Dockerfileexamples/p2p_transfer_k8s/client/trtllm/Dockerfile.ph3-gcp-gb200examples/p2p_transfer_k8s/client/trtllm/README.mdexamples/p2p_transfer_k8s/client/trtllm/fix_nixl_runpath.pyexamples/p2p_transfer_k8s/client/trtllm/install_modelexpress_client.shexamples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yamlexamples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yamlexamples/p2p_transfer_k8s/client/trtllm/mx-infra-decode.yamlexamples/p2p_transfer_k8s/client/trtllm/patch_trtllm_mx_runtime.pymodelexpress_client/python/modelexpress/engines/sglang/adapter.pymodelexpress_client/python/modelexpress/engines/trtllm/__init__.pymodelexpress_client/python/modelexpress/engines/trtllm/adapter.pymodelexpress_client/python/modelexpress/engines/trtllm/loader.pymodelexpress_client/python/modelexpress/engines/vllm/adapter.pymodelexpress_client/python/modelexpress/load_strategy/__init__.pymodelexpress_client/python/modelexpress/load_strategy/base.pymodelexpress_client/python/modelexpress/load_strategy/context.pymodelexpress_client/python/modelexpress/load_strategy/default_strategy.pymodelexpress_client/python/modelexpress/load_strategy/model_streamer_strategy.pymodelexpress_client/python/modelexpress/load_strategy/rdma_strategy.pymodelexpress_client/python/modelexpress/metadata/client_factory.pymodelexpress_client/python/modelexpress/trtllm_live_transfer.pymodelexpress_client/python/tests/test_k8s_service_client.pymodelexpress_client/python/tests/test_model_streamer_strategy.pymodelexpress_client/python/tests/test_sglang_loader.pymodelexpress_client/python/tests/test_trtllm_loader.pymodelexpress_client/python/tests/test_vllm_adapter.pymodelexpress_client/python/tests/test_vllm_loader.pytrtllm_patches/v1.3.0rc5/README.mdtrtllm_patches/v1.3.0rc5/apply_patches.pytrtllm_patches/v1.3.0rc5/patch_model_loader.pytrtllm_patches/v1.3.0rc5/patch_tp_allgather.py
💤 Files with no reviewable changes (9)
- trtllm_patches/v1.3.0rc5/README.md
- examples/p2p_transfer_k8s/client/trtllm/Dockerfile.ph3-gcp-gb200
- examples/p2p_transfer_k8s/client/trtllm/mx-infra-decode.yaml
- trtllm_patches/v1.3.0rc5/patch_tp_allgather.py
- trtllm_patches/v1.3.0rc5/apply_patches.py
- examples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yaml
- examples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yaml
- modelexpress_client/python/modelexpress/trtllm_live_transfer.py
- trtllm_patches/v1.3.0rc5/patch_model_loader.py
342970f to
a747b15
Compare
a747b15 to
ebc1971
Compare
ebc1971 to
22c4fd9
Compare
|
Addressed the remaining CodeRabbit nitpicks in
Validation: |
22c4fd9 to
cd9635b
Compare
Signed-off-by: Zheng Luo <zheluo@nvidia.com>
cd9635b to
21b91f6
Compare
zhengluo-nv
left a comment
There was a problem hiding this comment.
Do not merge until we came to agreement with TRTLLM on the integration interface
Overview
Adds the ModelExpress-owned side of the TRT-LLM integration. This pairs with NVIDIA/TensorRT-LLM#14151, where TRT-LLM keeps only the minimal checkpoint-loader delegation surface.
TRT-LLM passes its initialized model and mapping into
modelexpress.engines.trtllm.loader.MXCheckpointLoader. ModelExpress owns source discovery, identity construction, metadata lookup/publication, NIXL transfer, heartbeat, fallback behavior, and TRT-LLM-specific tensor discovery.TRT-LLM does not have separate source and target launch modes in this path. Every server starts the same way; the first compatible server loads natively and publishes metadata, and later servers can receive weights from a ready source through the shared ModelExpress strategy chain.
What changed
TRT-LLM adapter and loader: adds
modelexpress.engines.trtllmwithTrtllmAdapterandMXCheckpointLoader. The loader runs through the sharedLoadStrategyChain, matching the vLLM and SGLang integration shape.Load-format naming: the canonical TRT-LLM load format is
modelexpress, avoidingmxbecause that conflicts with MXFP4/MXFP8 terminology. vLLM native support also usesmodelexpress; vLLM plugin registration keepsmxas a backward-compatible alias for older deployments.Shared strategy behavior: adds small shared hooks needed by TRT-LLM, including selected-strategy tracking and a source-publish path that can run after engine post-load processing.
Post-load publication: TRT-LLM publishes only after its weight loading, post-load processing, and CUDA stream synchronization have completed. This avoids advertising tensor addresses/content before TRT-LLM has finished mutating the model, while still allowing a successful P2P receiver to become a source for later replicas.
Engine-neutral config cleanup: loading config is carried through
LoadContextinstead of re-reading environment variables after context construction.MX_MODEL_URIremains a gate for ModelStreamer, and the resolved model URI is frozen intoLoadContextby each engine adapter.Legacy cleanup: removes the old
modelexpress.trtllm_live_transferimplementation and deletes the historicaltrtllm_patches/v1.3.0rc5patch scripts.TRT-LLM example image: keeps only the minimum image-build helpers under
examples/p2p_transfer_k8s/client/trtllm: install the local ModelExpress client and fix NIXL runpath. Runtime TRT-LLM patching is intentionally not committed here; local test-only patches can live outside the repo.Compatibility notes
TRTLLM_IMAGEalready contains the TRT-LLM ModelExpress hooks; overrideTRTLLM_IMAGEwith a local validation image until those hooks are available in a released TRT-LLM image.Testing
Unit and local validation
Focused validation after rebasing onto current
origin/main:Result: 53 passed.
Additional local checks:
Result: passed.
TRT-LLM e2e
Validated direct
trtllm-servewithnvidia/Kimi-K2.5-NVFP4on the nscale B200 cluster.nvcr.io/nvidian/dynamo-dev/model-express-dev-containers:trtllm-mx-1.3.0rc11-narrow-2992272-publishfix2-amd64/healthand/v1/modelsreturned HTTP 200Follow-ups