ESM2 changes to work with vLLM by gagank1 · Pull Request #1473 · NVIDIA/bionemo-framework

gagank1 · 2026-02-18T23:44:15Z

Description

This PR makes the ESM2 model compatible with vLLM. Primary issues were a naming incompatibility (vLLM expects model. prefix and ESM2 uses esm.) and NVEsmModel defaults to add_pooling_layer=True when loading the checkpoint even though it's exported without pooler weights.

Usage

python test_esm2_golden_values.py from inside the container, instructions to build and run it are provided.

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

@coderabbitai review - Triggers a standard review
@coderabbitai full review - Triggers a comprehensive review

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

copy-pr-bot · 2026-02-18T23:44:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-18T23:44:25Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

ESM2 model architecture refactored to rename the base model namespace from "esm" to "model" across core components, model classes, and related tests. A new add_pooling_layer configuration option was introduced to control pooler presence. Additionally, vLLM integration infrastructure added with Docker setup and validation tooling.

Changes

Cohort / File(s)	Summary
Core Model Architecture Refactoring `bionemo-recipes/models/esm2/convert.py`, `export.py`, `modeling_esm_te.py`	Namespace renaming from "esm" to "model" in weight mappings, base model references, and state dict handling. Added `add_pooling_layer` configuration option (defaults to True) wired through NVEsmConfig and NVEsmModel. Updated base_model_prefix and tied weights references. Removed quantized_model_init wrapper in favor of fp8_model_init context. Changed state_dict filtering to exclude both `_extra_state` and `.inv_freq` keys for vLLM compatibility.
ESM2 Test Suite Updates `bionemo-recipes/models/esm2/tests/test_cp_bshd.py`, `test_cp_thd.py`, `test_distributed_fp8.py`, `test_distributed_strategies.py`, `test_modeling_esm_te.py`	Updated model layer access paths from `model.esm.encoder.layers` to `model.model.encoder.layers` across distributed testing, context-parallel gradient validation, and state_dict verification. No functional logic changes, only attribute path corrections.
Recipe Implementation Propagation `bionemo-recipes/recipes/esm2_accelerate_te/...`, `esm2_native_te/...`, `esm2_peft_te/.../esm_nv.py`	Applied namespace refactoring, pooling layer configuration, and weight initialization updates from core model to recipe implementations. Updated NVEsmPreTrainedModel, NVEsmForMaskedLM, NVEsmForTokenClassification, and NVEsmLMHead across all recipe variants. Replaced direct "esm" references with "model" and removed legacy TE quantization contexts.
Training Script Updates `bionemo-recipes/recipes/esm2_native_te/tests/test_stop_and_go.py`, `train_ddp.py`, `train_ddp_cp.py`, `train_fsdp2.py`, `train_fsdp2_cp.py`	Updated contact_head deletion and encoder layer access paths from `model.esm.` to `model.model.` to align with new model structure. No logic changes, only path corrections for distributed training configurations (DDP, FSDP2, context parallelism).
vLLM Integration `bionemo-recipes/vllm/Dockerfile`, `README.md`, `launch.sh`, `test_esm2_golden_values.py`	New vLLM inference infrastructure added: Dockerfile extends vLLM base with CUDA 13.1 tooling and transformer_engine from source; launch.sh provides Docker run wrapper with optional mounts and headless mode; test_esm2_golden_values.py validates ESM2 embedding consistency across vLLM, HuggingFace, and hub models with numerical comparison metrics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 Hop, hop! The model's now refined,
From "esm" paths, we've left behind,
A "model" namespace, clean and bright,
With pooling layers—flexible might!
And vLLM joins the warren's flight, 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description covers the basic what (making ESM2 compatible with vLLM) and usage instructions, but is missing detailed explanation of the changes, their purpose, and how they achieve vLLM compatibility.	Add more comprehensive details about the specific changes made, the rationale behind the namespace refactoring from 'esm' to 'model', the pooling layer changes, and how these changes enable vLLM compatibility.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'ESM2 changes to work with vLLM' clearly describes the main objective of the PR - making ESM2 compatible with vLLM.
Docstring Coverage	✅ Passed	Docstring coverage is 96.30% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch gkaushik/esm2-vllm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

broland-hat · 2026-02-20T22:19:41Z

@gagank1 : Are you working on a readme?

gagank1 · 2026-02-23T17:06:00Z

/ok to test

copy-pr-bot · 2026-02-23T17:06:03Z

/ok to test

@gagank1, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

gagank1 · 2026-02-23T17:06:15Z

/ok to test

copy-pr-bot · 2026-02-23T17:06:18Z

/ok to test

@gagank1, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

gagank1 · 2026-02-23T17:07:18Z

/ok to test c34c09b

gagank1 · 2026-02-23T17:12:02Z

@coderabbitai review

coderabbitai · 2026-02-23T17:12:16Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (3)

bionemo-recipes/recipes/esm2_peft_te/example_8m_checkpoint/esm_nv.py (1)

405-407: Same _tied_weights_keys dict type concern.

See comment on modeling_esm_te.py.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/esm2_peft_te/example_8m_checkpoint/esm_nv.py` around
lines 405 - 407, The class variable _tied_weights_keys uses the modern built-in
generic dict[...] typing which may be incompatible with the rest of the
codebase; change its annotation to use typing.Dict[str, str] (and ensure Dict is
imported) or use typing.Mapping if immutability is desired, mirroring the fix
applied in modeling_esm_te.py so the declaration becomes ClassVar[Dict[str,
str]] with the same key/value entries retained.

bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py (1)

405-407: Same _tied_weights_keys dict type concern as in modeling_esm_te.py.

See comment on the canonical file — HF expects list[str], not dict[str, str].

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py`
around lines 405 - 407, The _tied_weights_keys typed as ClassVar[dict[str, str]]
should be changed to ClassVar[list[str]] to match HF expectations (same fix as
in modeling_esm_te.py); replace the dict literal with a list of the relevant
parameter names (e.g. ["lm_head.decoder.weight",
"model.embeddings.word_embeddings.weight"]) and update any usages that assume
dict semantics to use the list order or explicit pairing where needed.

bionemo-recipes/recipes/esm2_accelerate_te/example_8m_checkpoint/esm_nv.py (1)

405-407: Same _tied_weights_keys dict type concern.

See comment on modeling_esm_te.py.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/esm2_accelerate_te/example_8m_checkpoint/esm_nv.py`
around lines 405 - 407, _tied_weights_keys is annotated as a mutable dict
ClassVar which repeats the same typing concern as in modeling_esm_te.py; change
the annotation to an immutable mapping type (e.g., ClassVar[Mapping[str, str]]
from typing) and, to avoid accidental mutation, assign a read-only view (e.g.,
types.MappingProxyType({"lm_head.decoder.weight":
"model.embeddings.word_embeddings.weight"})); update the import list to include
typing.Mapping and types if not present and mirror the same pattern used/fixed
in modeling_esm_te.py.

🧹 Nitpick comments (4)

bionemo-recipes/vllm/launch.sh (1)
50-50: exec $DOCKER_CMD is unquoted — word splitting will break paths with spaces.

If PROJECT_ROOT contains spaces (e.g., /home/user/my projects/...), the -v argument will be incorrectly split into multiple tokens. Use a Bash array to avoid this:
🔧 Proposed fix (array-based approach)

Replace the string-based DOCKER_CMD with an array throughout the script:
-DOCKER_CMD="docker run -itd ..."
+DOCKER_CMD=("docker" "run" "-itd" "--gpus" "all" "--network" "host" "--ipc=host" "-e" "HF_TOKEN" "--rm" "--name" "${CONTAINER}_dev")
 
 if [ "$MOUNT_DIR" = true ]; then
     PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
-    DOCKER_CMD="$DOCKER_CMD -v ${PROJECT_ROOT}:/workspace/bionemo-framework"
+    DOCKER_CMD+=("-v" "${PROJECT_ROOT}:/workspace/bionemo-framework")
 fi
 
-DOCKER_CMD="$DOCKER_CMD $CONTAINER /bin/bash"
+DOCKER_CMD+=("$CONTAINER" "/bin/bash")
 
-exec $DOCKER_CMD
+exec "${DOCKER_CMD[@]}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/vllm/launch.sh` at line 50, The exec $DOCKER_CMD call uses an
unquoted string which allows word-splitting and breaks mount paths with spaces
(e.g., PROJECT_ROOT used in -v). Convert the string DOCKER_CMD into a Bash array
(e.g., DOCKER_CMD=(docker run ...)) and update all places that build/append to
DOCKER_CMD so they push elements into the array, then replace exec $DOCKER_CMD
with exec "${DOCKER_CMD[@]}" so each argument (including the -v PROJECT_ROOT
value) is preserved; update any helper code that concatenates DOCKER_CMD to use
array operations instead.
bionemo-recipes/vllm/Dockerfile (2)
2-3: Base image is hosted on an internal NVIDIA GitLab registry — not pullable outside NVIDIA.

gitlab-master.nvidia.com:5005/dl/dgx/vllm:main-py3.43005406-devel requires internal network/credentials access. The commented-out nvcr.io/nvidia/vllm:26.01-py3 alternative on line 1 is the publicly accessible equivalent. Once an NGC release with vLLM ≥ 0.14 becomes available, switching to the public image will make this recipe usable by external contributors without additional setup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/vllm/Dockerfile` around lines 2 - 3, The Dockerfile's FROM
line currently uses an internal image tag
"gitlab-master.nvidia.com:5005/dl/dgx/vllm:main-py3.43005406-devel" which is not
pullable externally; replace that base image with the public equivalent
"nvcr.io/nvidia/vllm:26.01-py3" (or parametrize the base via a build ARG) in the
FROM instruction so external contributors can build without internal
credentials, and retain a short comment noting the vLLM>=0.14 requirement and to
switch back when an official public image with the needed version is available.
30-30: Pin the transformer_engine version for reproducible builds.

pip install --no-build-isolation transformer_engine[pytorch] with no version specifier will install whichever version is latest at build time. TE releases frequently and has had breaking API changes between major versions (e.g., 1.x → 2.x). A silent version bump can break the integration without any change to this file.
🔧 Proposed fix
-RUN pip install --no-build-isolation transformer_engine[pytorch]
+RUN pip install --no-build-isolation "transformer_engine[pytorch]==<verified_version>"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/vllm/Dockerfile` at line 30, The Dockerfile currently
installs transformer_engine without a version pin (the RUN pip install
--no-build-isolation transformer_engine[pytorch] line); change that to install a
specific, tested TE release by updating that RUN to include an exact version
specifier (for example: RUN pip install --no-build-isolation
transformer_engine==<MAJOR.MINOR.PATCH>[pytorch]) so builds are reproducible and
won’t break on upstream major/minor bumps; optionally add a short comment noting
the chosen compatible version.
bionemo-recipes/vllm/test_esm2_golden_values.py (1)
46-63: sys.path.insert + os.chdir is fragile for test infrastructure.

sys.path.insert(0, ...) at module level (line 46) and os.chdir inside fresh_export make this script sensitive to working directory and import order. This is acceptable for a standalone validation script run manually inside a container, but consider adding a note that this is not designed to run as part of a pytest suite.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/vllm/test_esm2_golden_values.py` around lines 46 - 63, The
test module mutates import paths and working directory (sys.path.insert(0, ...),
os.chdir(...) inside fresh_export) which is fragile for pytest; update the file
to document this by adding a clear module-level comment or docstring near
sys.path.insert and a brief note on fresh_export explaining it intentionally
changes cwd for export_hf_checkpoint and that the script is not intended to be
run under pytest/parallel test runners (referencing sys.path.insert,
ESM2_MODEL_DIR, and fresh_export by name), or alternatively guard execution with
a main-check so pytest won't import/run it implicitly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bionemo-recipes/vllm/Dockerfile`:
- Around line 1-36: Add a non-root user and switch to it in the Dockerfile so
the container does not run as root; create a user/group (e.g., bionemo), chown
the application directory (/workspace/bionemo) and any cache/venv paths to that
user, and add a final USER bionemo line. Locate the Dockerfile sections around
WORKDIR /workspace/bionemo and COPY . . and insert user creation
(useradd/addgroup or groupadd) and chown before switching context, then add USER
bionemo at the end so subsequent runtime processes run unprivileged.

In `@bionemo-recipes/vllm/launch.sh`:
- Around line 36-40: The script hardcodes "--name vllm_dev" in DOCKER_CMD which
prevents multiple instances and misrepresents the positional $CONTAINER
argument; update the usage comment to clarify that the positional argument is
the image name (or image_name) and change the DOCKER_CMD assignments in the
HEADLESS branch to use a container name derived from $CONTAINER (for example
--name "$CONTAINER" or --name "${CONTAINER}_dev") instead of "vllm_dev" so the
image argument is also used as the container name and avoids name collisions.

In `@bionemo-recipes/vllm/README.md`:
- Around line 21-23: Change the fenced code block that currently uses the wrong
language tag; locate the block containing the shell command "python
test_esm2_golden_values.py" in the README and replace the opening fence language
identifier from ```python to ```bash so the command is treated as a shell
snippet and highlighted correctly.
- Line 17: The README's "or use `launch.sh`" is misleading because `launch.sh
--mount_dir` mounts the repo to `/workspace/bionemo-framework` while the manual
`docker run` mounts to `/workspace/bionemo`, causing `python
test_esm2_golden_values.py` to fail when run from WORKDIR; fix by either (A)
update `launch.sh` to mount the project root into `/workspace/bionemo` instead
of `/workspace/bionemo-framework` (adjust mount target and any downstream path
references in launch.sh), or (B) update README.md to explicitly document that
`launch.sh --mount_dir` mounts at `/workspace/bionemo-framework` and instruct
users to cd into the correct subdirectory (where `test_esm2_golden_values.py`
lives) before running the test; mention the affected files `launch.sh`,
`README.md`, and `test_esm2_golden_values.py` so reviewers can locate the
changes.

In `@bionemo-recipes/vllm/test_esm2_golden_values.py`:
- Around line 147-210: The script currently only prints comparisons and never
fails; add assertions that enforce the golden-value tolerances using RTOL and
ATOL: for each pair in pairs (refer to the pairs list and variables a, b),
assert np.allclose(a, b, rtol=RTOL, atol=ATOL) (or equivalently assert
(np.abs(a-b) <= ATOL + RTOL * np.abs(b)).all()) and fail the test if not, and
also assert cosine_sim(a, b) exceeds a sensible threshold or that exact is True
when ATOL/RTOL are zero; add per-sequence assertions inside the per-sequence
loop to ensure each sequence max-diff <= ATOL + RTOL * max(|b_i|) so the test
fails on unacceptable drift.

---

Duplicate comments:
In `@bionemo-recipes/recipes/esm2_accelerate_te/example_8m_checkpoint/esm_nv.py`:
- Around line 405-407: _tied_weights_keys is annotated as a mutable dict
ClassVar which repeats the same typing concern as in modeling_esm_te.py; change
the annotation to an immutable mapping type (e.g., ClassVar[Mapping[str, str]]
from typing) and, to avoid accidental mutation, assign a read-only view (e.g.,
types.MappingProxyType({"lm_head.decoder.weight":
"model.embeddings.word_embeddings.weight"})); update the import list to include
typing.Mapping and types if not present and mirror the same pattern used/fixed
in modeling_esm_te.py.

In `@bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py`:
- Around line 405-407: The _tied_weights_keys typed as ClassVar[dict[str, str]]
should be changed to ClassVar[list[str]] to match HF expectations (same fix as
in modeling_esm_te.py); replace the dict literal with a list of the relevant
parameter names (e.g. ["lm_head.decoder.weight",
"model.embeddings.word_embeddings.weight"]) and update any usages that assume
dict semantics to use the list order or explicit pairing where needed.

In `@bionemo-recipes/recipes/esm2_peft_te/example_8m_checkpoint/esm_nv.py`:
- Around line 405-407: The class variable _tied_weights_keys uses the modern
built-in generic dict[...] typing which may be incompatible with the rest of the
codebase; change its annotation to use typing.Dict[str, str] (and ensure Dict is
imported) or use typing.Mapping if immutability is desired, mirroring the fix
applied in modeling_esm_te.py so the declaration becomes ClassVar[Dict[str,
str]] with the same key/value entries retained.

---

Nitpick comments:
In `@bionemo-recipes/vllm/Dockerfile`:
- Around line 2-3: The Dockerfile's FROM line currently uses an internal image
tag "gitlab-master.nvidia.com:5005/dl/dgx/vllm:main-py3.43005406-devel" which is
not pullable externally; replace that base image with the public equivalent
"nvcr.io/nvidia/vllm:26.01-py3" (or parametrize the base via a build ARG) in the
FROM instruction so external contributors can build without internal
credentials, and retain a short comment noting the vLLM>=0.14 requirement and to
switch back when an official public image with the needed version is available.
- Line 30: The Dockerfile currently installs transformer_engine without a
version pin (the RUN pip install --no-build-isolation
transformer_engine[pytorch] line); change that to install a specific, tested TE
release by updating that RUN to include an exact version specifier (for example:
RUN pip install --no-build-isolation
transformer_engine==<MAJOR.MINOR.PATCH>[pytorch]) so builds are reproducible and
won’t break on upstream major/minor bumps; optionally add a short comment noting
the chosen compatible version.

In `@bionemo-recipes/vllm/launch.sh`:
- Line 50: The exec $DOCKER_CMD call uses an unquoted string which allows
word-splitting and breaks mount paths with spaces (e.g., PROJECT_ROOT used in
-v). Convert the string DOCKER_CMD into a Bash array (e.g., DOCKER_CMD=(docker
run ...)) and update all places that build/append to DOCKER_CMD so they push
elements into the array, then replace exec $DOCKER_CMD with exec
"${DOCKER_CMD[@]}" so each argument (including the -v PROJECT_ROOT value) is
preserved; update any helper code that concatenates DOCKER_CMD to use array
operations instead.

In `@bionemo-recipes/vllm/test_esm2_golden_values.py`:
- Around line 46-63: The test module mutates import paths and working directory
(sys.path.insert(0, ...), os.chdir(...) inside fresh_export) which is fragile
for pytest; update the file to document this by adding a clear module-level
comment or docstring near sys.path.insert and a brief note on fresh_export
explaining it intentionally changes cwd for export_hf_checkpoint and that the
script is not intended to be run under pytest/parallel test runners (referencing
sys.path.insert, ESM2_MODEL_DIR, and fresh_export by name), or alternatively
guard execution with a main-check so pytest won't import/run it implicitly.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73cd13d and c34c09b.

📒 Files selected for processing (20)

bionemo-recipes/models/esm2/convert.py
bionemo-recipes/models/esm2/export.py
bionemo-recipes/models/esm2/modeling_esm_te.py
bionemo-recipes/models/esm2/tests/test_cp_bshd.py
bionemo-recipes/models/esm2/tests/test_cp_thd.py
bionemo-recipes/models/esm2/tests/test_distributed_fp8.py
bionemo-recipes/models/esm2/tests/test_distributed_strategies.py
bionemo-recipes/models/esm2/tests/test_modeling_esm_te.py
bionemo-recipes/recipes/esm2_accelerate_te/example_8m_checkpoint/esm_nv.py
bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py
bionemo-recipes/recipes/esm2_native_te/tests/test_stop_and_go.py
bionemo-recipes/recipes/esm2_native_te/train_ddp.py
bionemo-recipes/recipes/esm2_native_te/train_ddp_cp.py
bionemo-recipes/recipes/esm2_native_te/train_fsdp2.py
bionemo-recipes/recipes/esm2_native_te/train_fsdp2_cp.py
bionemo-recipes/recipes/esm2_peft_te/example_8m_checkpoint/esm_nv.py
bionemo-recipes/vllm/Dockerfile
bionemo-recipes/vllm/README.md
bionemo-recipes/vllm/launch.sh
bionemo-recipes/vllm/test_esm2_golden_values.py

coderabbitai · 2026-02-23T17:19:41Z

bionemo-recipes/vllm/Dockerfile

+# FROM nvcr.io/nvidia/vllm:26.01-py3
+FROM gitlab-master.nvidia.com:5005/dl/dgx/vllm:main-py3.43005406-devel
+# using this because we need vllm >= 0.14 to work with Transformers v5. no released nvidia version with this yet.
+
+# The vLLM image has CUDA 13.1 runtime and nvcc, but missing dev headers (cusparse.h, nvtx, etc.)
+# Install cuda-keyring to add NVIDIA's apt repo, then install the dev headers for transformer_engine
+RUN apt-get update && apt-get install -y --no-install-recommends wget && \
+    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb && \
+    dpkg -i cuda-keyring_1.1-1_all.deb && \
+    rm cuda-keyring_1.1-1_all.deb && \
+    apt-get update && apt-get install -y --no-install-recommends \
+    cuda-nvtx-13-1 \
+    cuda-cupti-dev-13-1 \
+    cuda-nvml-dev-13-1 \
+    libcusparse-dev-13-1 \
+    libcusolver-dev-13-1 \
+    libcufft-dev-13-1 \
+    libnvjitlink-dev-13-1 \
+    libnvjpeg-dev-13-1 \
+    libcublasmp0-dev-cuda-13 \
+    libcudnn9-cuda-13 \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install remaining dependencies
+RUN --mount=type=cache,target=/root/.cache/pip \
+    --mount=type=bind,source=requirements.txt,target=/requirements.txt \
+    pip install -r /requirements.txt
+
+# Install transformer_engine from source (force build for CUDA 13.1, not pre-built cu12 wheel)
+RUN pip install --no-build-isolation transformer_engine[pytorch]
+
+RUN pip install transformers[torch]==5.0.0
+
+
+WORKDIR /workspace/bionemo
+COPY . .


⚠️ Potential issue | 🟠 Major

Container runs as root — add a USER directive.

No USER command is present, so all processes run as root inside the container. This is flagged by Trivy (DS-0002) and violates the principle of least privilege.

🔒 Proposed fix

WORKDIR /workspace/bionemo COPY . . + +RUN useradd -m -u 1000 appuser && chown -R appuser /workspace/bionemo +USER appuser

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# FROM nvcr.io/nvidia/vllm:26.01-py3

FROM gitlab-master.nvidia.com:5005/dl/dgx/vllm:main-py3.43005406-devel

# using this because we need vllm >= 0.14 to work with Transformers v5. no released nvidia version with this yet.

# The vLLM image has CUDA 13.1 runtime and nvcc, but missing dev headers (cusparse.h, nvtx, etc.)

# Install cuda-keyring to add NVIDIA's apt repo, then install the dev headers for transformer_engine

RUN apt-get update && apt-get install -y --no-install-recommends wget && \

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb && \

dpkg -i cuda-keyring_1.1-1_all.deb && \

rm cuda-keyring_1.1-1_all.deb && \

apt-get update && apt-get install -y --no-install-recommends \

cuda-nvtx-13-1 \

cuda-cupti-dev-13-1 \

cuda-nvml-dev-13-1 \

libcusparse-dev-13-1 \

libcusolver-dev-13-1 \

libcufft-dev-13-1 \

libnvjitlink-dev-13-1 \

libnvjpeg-dev-13-1 \

libcublasmp0-dev-cuda-13 \

libcudnn9-cuda-13 \

&& rm -rf /var/lib/apt/lists/*

# Install remaining dependencies

RUN --mount=type=cache,target=/root/.cache/pip \

--mount=type=bind,source=requirements.txt,target=/requirements.txt \

pip install -r /requirements.txt

# Install transformer_engine from source (force build for CUDA 13.1, not pre-built cu12 wheel)

RUN pip install --no-build-isolation transformer_engine[pytorch]

RUN pip install transformers[torch]==5.0.0

WORKDIR /workspace/bionemo

COPY . .

# FROM nvcr.io/nvidia/vllm:26.01-py3

FROM gitlab-master.nvidia.com:5005/dl/dgx/vllm:main-py3.43005406-devel

# using this because we need vllm >= 0.14 to work with Transformers v5. no released nvidia version with this yet.

# The vLLM image has CUDA 13.1 runtime and nvcc, but missing dev headers (cusparse.h, nvtx, etc.)

# Install cuda-keyring to add NVIDIA's apt repo, then install the dev headers for transformer_engine

RUN apt-get update && apt-get install -y --no-install-recommends wget && \

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb && \

dpkg -i cuda-keyring_1.1-1_all.deb && \

rm cuda-keyring_1.1-1_all.deb && \

apt-get update && apt-get install -y --no-install-recommends \

cuda-nvtx-13-1 \

cuda-cupti-dev-13-1 \

cuda-nvml-dev-13-1 \

libcusparse-dev-13-1 \

libcusolver-dev-13-1 \

libcufft-dev-13-1 \

libnvjitlink-dev-13-1 \

libnvjpeg-dev-13-1 \

libcublasmp0-dev-cuda-13 \

libcudnn9-cuda-13 \

&& rm -rf /var/lib/apt/lists/*

# Install remaining dependencies

RUN --mount=type=cache,target=/root/.cache/pip \

--mount=type=bind,source=requirements.txt,target=/requirements.txt \

pip install -r /requirements.txt

# Install transformer_engine from source (force build for CUDA 13.1, not pre-built cu12 wheel)

RUN pip install --no-build-isolation transformer_engine[pytorch]

RUN pip install transformers[torch]==5.0.0

WORKDIR /workspace/bionemo

COPY . .

RUN useradd -m -u 1000 appuser && chown -R appuser /workspace/bionemo

USER appuser

🧰 Tools

🪛 Trivy (0.69.1)

[error] 1-1: Image user should not be 'root'

Specify at least 1 USER command in Dockerfile with non-root user as argument

Rule: DS-0002

Learn more

(IaC/Dockerfile)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/vllm/Dockerfile` around lines 1 - 36, Add a non-root user and switch to it in the Dockerfile so the container does not run as root; create a user/group (e.g., bionemo), chown the application directory (/workspace/bionemo) and any cache/venv paths to that user, and add a final USER bionemo line. Locate the Dockerfile sections around WORKDIR /workspace/bionemo and COPY . . and insert user creation (useradd/addgroup or groupadd) and chown before switching context, then add USER bionemo at the end so subsequent runtime processes run unprivileged.

coderabbitai · 2026-02-23T17:19:41Z

bionemo-recipes/vllm/launch.sh

+if [ "$HEADLESS" = true ]; then
+    DOCKER_CMD="docker run -itd --gpus all --network host --ipc=host -e HF_TOKEN --rm --name vllm_dev"
+else
+    DOCKER_CMD="docker run -it --gpus all --network host --ipc=host -e HF_TOKEN --rm --name vllm_dev"
+fi


⚠️ Potential issue | 🟡 Minor

Hardcoded --name vllm_dev prevents running multiple containers and shadows the $CONTAINER argument's meaning.

The script's positional argument ($CONTAINER) is the Docker image name, but the running container name is always vllm_dev. This has two problems:

Starting a second instance (e.g., a second headless container) will fail with docker: Error response from daemon: Conflict. The container name "/vllm_dev" is already in use.

The usage comment says <container_name> but it's really <image_name>, which is confusing.

Consider using $CONTAINER as both the image name and the container name (or derive the container name from it):

🔧 Proposed fix

-DOCKER_CMD="docker run -itd --gpus all --network host --ipc=host -e HF_TOKEN --rm --name vllm_dev" +DOCKER_CMD="docker run -itd --gpus all --network host --ipc=host -e HF_TOKEN --rm --name ${CONTAINER}_dev"

-DOCKER_CMD="docker run -it --gpus all --network host --ipc=host -e HF_TOKEN --rm --name vllm_dev" +DOCKER_CMD="docker run -it --gpus all --network host --ipc=host -e HF_TOKEN --rm --name ${CONTAINER}_dev"

Also update the usage comment on line 4:

-# Usage: ./launch.sh <container_name> [--mount_dir] [--headless] +# Usage: ./launch.sh <image_name> [--mount_dir] [--headless]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [ "$HEADLESS" = true ]; then

DOCKER_CMD="docker run -itd --gpus all --network host --ipc=host -e HF_TOKEN --rm --name vllm_dev"

else

DOCKER_CMD="docker run -it --gpus all --network host --ipc=host -e HF_TOKEN --rm --name vllm_dev"

fi

if [ "$HEADLESS" = true ]; then

DOCKER_CMD="docker run -itd --gpus all --network host --ipc=host -e HF_TOKEN --rm --name ${CONTAINER}_dev"

else

DOCKER_CMD="docker run -it --gpus all --network host --ipc=host -e HF_TOKEN --rm --name ${CONTAINER}_dev"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/vllm/launch.sh` around lines 36 - 40, The script hardcodes "--name vllm_dev" in DOCKER_CMD which prevents multiple instances and misrepresents the positional $CONTAINER argument; update the usage comment to clarify that the positional argument is the image name (or image_name) and change the DOCKER_CMD assignments in the HEADLESS branch to use a container name derived from $CONTAINER (for example --name "$CONTAINER" or --name "${CONTAINER}_dev") instead of "vllm_dev" so the image argument is also used as the container name and avoids name collisions.

coderabbitai · 2026-02-23T17:19:41Z

bionemo-recipes/vllm/README.md

+docker run -it --gpus all --network host --ipc=host -e HF_TOKEN --rm -v ${PWD}:/workspace/bionemo vllm /bin/bash
+```
+
+or use `launch.sh`.


⚠️ Potential issue | 🟡 Minor

launch.sh mounts to a different container path than the manual docker run command — the README's test instruction will break after using launch.sh --mount_dir.

The manual command (line 14) mounts ${PWD} → /workspace/bionemo, which is the container's WORKDIR, so python test_esm2_golden_values.py works directly. However, launch.sh --mount_dir mounts the project root (two levels up) to /workspace/bionemo-framework, leaving /workspace/bionemo populated only by the image's COPY . . layer. Running python test_esm2_golden_values.py from WORKDIR after using --mount_dir will fail unless the user navigates to the correct subdirectory. The "or use launch.sh" phrasing implies equivalence — either clarify the path difference or update launch.sh to mount to /workspace/bionemo instead of /workspace/bionemo-framework.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/vllm/README.md` at line 17, The README's "or use `launch.sh`" is misleading because `launch.sh --mount_dir` mounts the repo to `/workspace/bionemo-framework` while the manual `docker run` mounts to `/workspace/bionemo`, causing `python test_esm2_golden_values.py` to fail when run from WORKDIR; fix by either (A) update `launch.sh` to mount the project root into `/workspace/bionemo` instead of `/workspace/bionemo-framework` (adjust mount target and any downstream path references in launch.sh), or (B) update README.md to explicitly document that `launch.sh --mount_dir` mounts at `/workspace/bionemo-framework` and instruct users to cd into the correct subdirectory (where `test_esm2_golden_values.py` lives) before running the test; mention the affected files `launch.sh`, `README.md`, and `test_esm2_golden_values.py` so reviewers can locate the changes.

coderabbitai · 2026-02-23T17:19:41Z

bionemo-recipes/vllm/README.md

+```python
+python test_esm2_golden_values.py
+```


⚠️ Potential issue | 🟡 Minor

Wrong code fence language identifier — should be bash, not python.

The python test_esm2_golden_values.py command is a shell invocation, not Python source code. Using ```python causes syntax highlighters to misparse it.

📝 Proposed fix

-```python +```bash python test_esm2_golden_values.py

</details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @bionemo-recipes/vllm/README.md around lines 21 - 23, Change the fenced code
block that currently uses the wrong language tag; locate the block containing
the shell command "python test_esm2_golden_values.py" in the README and replace
the opening fence language identifier from python to bash so the command
is treated as a shell snippet and highlighted correctly.

</details>  

coderabbitai · 2026-02-23T17:19:41Z

bionemo-recipes/vllm/test_esm2_golden_values.py

+if __name__ == "__main__":
+    print(f"GPUs: {torch.cuda.device_count()}")
+
+    # Step 0: fresh export (facebook HF -> our TE format)
+    print("\n[0/3] Exporting checkpoint ...")
+    MODEL_ID = fresh_export()
+
+    print(f"MODEL_ID:           {MODEL_ID}")
+    print(f"REFERENCE_MODEL_ID: {REFERENCE_MODEL_ID}")
+    print(f"Sequences:          {len(SEQUENCES)}")
+
+    # 1) vLLM on exported model
+    print("\n[1/3] vLLM inference (exported model) ...")
+    emb_vllm = vllm_embed(MODEL_ID, SEQUENCES)
+
+    # 2) HuggingFace on exported model
+    print("\n[2/3] HuggingFace inference (exported model) ...")
+    emb_hf_exported = hf_embed(MODEL_ID, SEQUENCES)
+
+    # 3) HuggingFace on reference Hub model
+    print("\n[3/3] HuggingFace inference (reference model) ...")
+    emb_hf_reference = hf_embed(REFERENCE_MODEL_ID, SEQUENCES)
+
+    # ---- Pairwise comparisons ----
+    pairs = [
+        ("vLLM (exported)", "HF (exported)", emb_vllm, emb_hf_exported),
+        ("vLLM (exported)", "HF (reference)", emb_vllm, emb_hf_reference),
+        ("HF (exported)", "HF (reference)", emb_hf_exported, emb_hf_reference),
+    ]
+
+    # ---- Summary table ----
+    header = f"{'Pair':<35} {'max |diff|':>14} {'mean |diff|':>14} {'cos sim':>12} {'exact':>7}"
+    sep = "-" * len(header)
+    print(f"\n{sep}")
+    print(header)
+    print(sep)
+
+    for name_a, name_b, a, b in pairs:
+        diffs = np.abs(a.astype(np.float64) - b.astype(np.float64))
+        label = f"{name_a}  vs  {name_b}"
+        exact = np.array_equal(a, b)
+        print(
+            f"{label:<35} {diffs.max():>14.8e} {diffs.mean():>14.8e} "
+            f"{cosine_sim(a, b):>12.10f} {'YES' if exact else 'NO':>7}"
+        )
+
+    print(sep)
+    print(f"Tolerance: rtol={RTOL}, atol={ATOL} (0 = exact match required)")
+
+    # Per-sequence breakdown
+    short = {"vLLM (exported)": "vllm", "HF (exported)": "hf_exp", "HF (reference)": "hf_ref"}
+    print("\nPer-sequence max |diff|:")
+    for i in range(len(SEQUENCES)):
+        row = f"  seq {i}:"
+        for name_a, name_b, a, b in pairs:
+            d = float(np.abs(a[i].astype(np.float64) - b[i].astype(np.float64)).max())
+            row += f"  {short[name_a]}_vs_{short[name_b]}={d:.8e}"
+        print(row)
+
+    print(sep)
+
+    # Cleanup
+    if torch.distributed.is_initialized():
+        torch.distributed.destroy_process_group()


⚠️ Potential issue | 🟠 Major

Test script has no assertions — it will never fail.

This "test" prints a comparison table but never asserts that results are within tolerance. RTOL and ATOL (line 74) are defined but unused. A silent pass regardless of output drift defeats the purpose of a golden-value test.

Add assertions after the comparison loop, for example:

Suggested assertion block

+ all_passed = True for name_a, name_b, a, b in pairs: diffs = np.abs(a.astype(np.float64) - b.astype(np.float64)) label = f"{name_a} vs {name_b}" exact = np.array_equal(a, b) print( f"{label:<35} {diffs.max():>14.8e} {diffs.mean():>14.8e} " f"{cosine_sim(a, b):>12.10f} {'YES' if exact else 'NO':>7}" ) + if not np.allclose(a, b, rtol=RTOL, atol=ATOL): + all_passed = False print(sep) - print(f"Tolerance: rtol={RTOL}, atol={ATOL} (0 = exact match required)") + print(f"Tolerance: rtol={RTOL}, atol={ATOL}") + + assert all_passed, "Golden value comparison failed — see table above for details."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/vllm/test_esm2_golden_values.py` around lines 147 - 210, The script currently only prints comparisons and never fails; add assertions that enforce the golden-value tolerances using RTOL and ATOL: for each pair in pairs (refer to the pairs list and variables a, b), assert np.allclose(a, b, rtol=RTOL, atol=ATOL) (or equivalently assert (np.abs(a-b) <= ATOL + RTOL * np.abs(b)).all()) and fail the test if not, and also assert cosine_sim(a, b) exceeds a sensible threshold or that exact is True when ATOL/RTOL are zero; add per-sequence assertions inside the per-sequence loop to ensure each sequence max-diff <= ATOL + RTOL * max(|b_i|) so the test fails on unacceptable drift.

pstjohn · 2026-02-23T22:46:09Z

bionemo-recipes/models/esm2/export.py

+    # only needed at runtime for FP8 training efficiency; users who train with FP8 pass
+    # padded_vocab_size explicitly.  Keeping vocab_size-sized weights in the checkpoint


users who train with FP8 pass padded_vocab_size explicitly

i'm not sure that's a great assumption 😆. CC @jomitchellnv

yea that's not true, users who train with FP8 don't pass this

pstjohn · 2026-02-23T22:47:25Z

bionemo-recipes/models/esm2/modeling_esm_te.py

+            add_pooling_layer: Whether the base model should include a pooling layer. Set to
+                ``False`` for exported checkpoints that are saved from ``NVEsmForMaskedLM``
+                (which does not use a pooler).  This avoids missing-weight errors in vLLM.


why don't we just make the default false and omit the config edit in export.py?

pstjohn · 2026-02-23T22:48:04Z

bionemo-recipes/models/esm2/modeling_esm_te.py

    """NVEsmForMaskedLM is a TransformerEngine-optimized ESM model for masked language modeling."""

-    _tied_weights_keys: ClassVar[dict[str, str]] = {"lm_head.decoder.weight": "esm.embeddings.word_embeddings.weight"}
-    _do_not_quantize = ("lm_head.dense", "lm_head.decoder")  # Flag for testing that these layers are not quantized.


you're deleting _do_not_quantize? we need that

pstjohn · 2026-02-23T22:48:35Z

bionemo-recipes/models/esm2/modeling_esm_te.py

-        with transformer_engine.pytorch.autocast(enabled=False):
+        with transformer_engine.pytorch.fp8_autocast(enabled=False):


wait, why -- fp8_autocast is deprecated

pstjohn · 2026-02-23T22:49:36Z

bionemo-recipes/models/esm2/modeling_esm_te.py

-        with transformer_engine.pytorch.quantized_model_init(enabled=False):
-            self.dense = transformer_engine.pytorch.Linear(
-                config.hidden_size,
-                config.hidden_size,
-                params_dtype=config.dtype,
-                device="meta" if torch.get_default_device() == torch.device("meta") else "cuda",
-                init_method=lambda x: torch.nn.init.normal_(x, mean=0.0, std=config.initializer_range),
-            )
+        self.dense = transformer_engine.pytorch.Linear(
+            config.hidden_size,
+            config.hidden_size,
+            params_dtype=config.dtype,
+            device="meta" if torch.get_default_device() == torch.device("meta") else "cuda",
+            init_method=lambda x: torch.nn.init.normal_(x, mean=0.0, std=config.initializer_range),
+        )

+        with transformer_engine.pytorch.fp8_model_init(enabled=False):


why? this also reverts back to a deprecated context manager, and i think we want that dense layer not to be quantized IIUC

jomitchellnv · 2026-02-23T22:52:09Z

bionemo-recipes/models/esm2/tests/test_cp_bshd.py

        sample_layers = [
-            model.esm.encoder.layers[0].self_attention.core_attention,
-            model.esm.encoder.layers[0].self_attention.layernorm_qkv,
+            model.model.encoder.layers[0].self_attention.core_attention,


why the rename here from esm -> models?

pstjohn · 2026-02-23T22:52:15Z

bionemo-recipes/models/esm2/modeling_esm_te.py

-        with transformer_engine.pytorch.autocast(enabled=False):
-            prediction_scores = self.lm_head(sequence_output)
+        prediction_scores = self.lm_head(sequence_output)


we can't just remove this, this is important

pstjohn · 2026-02-23T22:53:21Z

bionemo-recipes/vllm/Dockerfile

i don't think we want this folder at all. At the very least it wont work with our current CI

jomitchellnv · 2026-02-23T22:54:17Z

bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py

                mask_ratio_observed = n_masked_per_seq.float() / src_lengths
                scale_factor = (1 - mask_ratio_train) / (1 - mask_ratio_observed)
-                reshaped_scale_factor = torch.repeat_interleave(scale_factor, src_lengths_padded, dim=0)
+                reshaped_scale_factor = torch.repeat_interleave(scale_factor, src_lengths, dim=0)


why the change here?

jomitchellnv · 2026-02-23T22:54:48Z

bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py

        # Keep the last layers of the network in higher precision to avoid numerical instability.
        # Please see recipes/fp8_analysis/README.md for more details.
-        with transformer_engine.pytorch.autocast(enabled=False):
+        with transformer_engine.pytorch.fp8_autocast(enabled=False):


this is incorrect. autocast is the API now. It covers FP8/ FP4 autocast.

jomitchellnv · 2026-02-23T22:56:36Z