Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion examples/moe/run_qwen3_30B_A3B_16H100.sh
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,6 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.ref.veomni.optimizer_offload=True \
algorithm.use_kl_in_reward=False \
trainer.use_legacy_worker_impl=disable \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name='verl_grpo_example_gsm8k_math' \
Expand Down
1 change: 0 additions & 1 deletion examples/moe/run_qwen3_30B_A3B_dapo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,6 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.rollout.profiler.save_path=$profile_save_path \
actor_rollout_ref.ref.veomni.optimizer_offload=True \
algorithm.use_kl_in_reward=False \
trainer.use_legacy_worker_impl=disable \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name='verl_grpo_qwen3moe_dapo' \
Expand Down
1 change: 0 additions & 1 deletion examples/moe/run_qwen3_30B_A3B_reinforce.sh
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,6 @@ RAY_DEDUP_LOGS=0 PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
custom_reward_function.name=compute_math_score \
trainer.project_name=vexact-baseline-math-moe-reinforce \
trainer.experiment_name=vexact-exp-MOE \
trainer.use_legacy_worker_impl=disable \
trainer.test_freq=20 \
trainer.log_val_generations=20 \
trainer.val_before_train=True \
Expand Down
14 changes: 12 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ dev = [
"pre-commit"
]
vllm = [
"vllm==0.18.0",
"vllm==0.19.1",
]
verl = [
"verl",
Expand All @@ -37,6 +37,11 @@ verl = [
]
veomni = [
"veomni",
# VeOmni's default install pins transformers==5.2.0 (via its
# `transformers-stable` dependency group). Mirror that pin here so vexact
# users picking up the veomni extra resolve to the same version VeOmni
# tests/develops against.
"transformers==5.2.0",
Comment on lines +40 to +44
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment here states that the pin only hits users of the veomni extra. However, due to the global override added in tool.uv.override-dependencies (line 162), this version is actually enforced for all uv resolutions in this project. The pin in the extra is effectively redundant for uv users but remains relevant for pip users. Please update the comment to reflect the actual behavior under uv.

Suggested change
# VeOmni's default install pins transformers==5.2.0 (via its
# `transformers-stable` dependency group). Mirror that pin here so vexact
# users picking up the veomni extra resolve to the same version VeOmni
# tests/develops against.
"transformers==5.2.0",
# VeOmni's default install pins transformers==5.2.0 (via its
# `transformers-stable` dependency group). Mirror that pin here for pip
# users; note that for uv users, this is enforced project-wide via
# override-dependencies to resolve conflicts with vllm.
"transformers==5.2.0",

]

[build-system]
Expand Down Expand Up @@ -125,7 +130,7 @@ known-third-party = [
verl = { git = "https://github.com/verl-project/verl.git", rev = "61f29997fb026a5a269dafccfe2f3bb800e32ef4" }
# To work on verl locally, point this at `{ path = "./verl", editable = true }`.
# To work on VeOmni locally, point this at `{ path = "./VeOmni", editable = true }`.
veomni = { git = "https://github.com/ByteDance-Seed/VeOmni.git", rev = "58759e78015ad429507079aa443215e3c515364f" }
veomni = { git = "https://github.com/ByteDance-Seed/VeOmni.git", rev = "a4ed599119afb21f5e559f15e95635f0edbbc5c6" }
torch = [
{ index = "pytorch", extra = "gpu" },
]
Expand All @@ -150,4 +155,9 @@ no-build-isolation-package = ["flash-attn"]
environments = ["sys_platform == 'linux'"]
override-dependencies = [
"opencv-python-headless<4.13.0",
# vllm 0.19.1's metadata still excludes transformers 5.0.*-5.4.* (only
# 5.5.1+ is whitelisted), but VeOmni pins transformers==5.2.0. Override
# vllm's conservative ceiling so the `vllm` and `veomni` extras can
# coexist; vllm 0.19.1 runs fine against transformers 5.2 in practice.
"transformers==5.2.0",
Comment on lines +158 to +162
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Forcing transformers==5.2.0 via override-dependencies to bypass vllm's version constraints is risky. Since vllm explicitly excludes versions 5.0.* through 5.4.*, there may be known incompatibilities or breaking changes in the transformers API that vllm relies on. While the smoke tests passed for the Qwen3-1.7B rollout, this override might cause issues in other vllm features or models.

Additionally, this override makes the pin global for all uv users, which contradicts the PR description's intent to only affect veomni users. If transformers 5.5.1+ is already whitelisted by vllm, consider if VeOmni can be updated to that version to avoid the need for a global override.

]
70 changes: 48 additions & 22 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 7 additions & 1 deletion vexact/inferencer/model_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,13 @@ def create_model(self):
if self._pp_info.pp_size > 1:
self._apply_pp()

with TorchMemorySaverAdapter.get_instance().region("weights", enable_cpu_backup=False):
# ``enable_cpu_backup=True``: torch_memory_saver offloads weights to
# CPU on pause and restores them on resume. Without this, pause→resume
# leaves the GPU memory uninitialized and the model produces garbage
# logits whenever any weight key isn't re-covered by the subsequent
# FSDP→rollout sync (notably an issue for MoE archs whose actor-side
# state_dict naming evolves across transformers releases).
with TorchMemorySaverAdapter.get_instance().region("weights", enable_cpu_backup=True):
init_parameters(self._causal_model, self._config.dtype, self._device)

load_weights_from_weight_path(self._causal_model, self._config, self._model_path)
Expand Down
Loading
Loading