[model] feat: bump to transformers v5.2.0 + VeOmni a4ed599 by Luosuu · Pull Request #29 · verl-project/vexact

Luosuu · 2026-05-21T00:31:21Z

Summary

VeOmni's main now defaults to transformers==5.2.0 (PR ByteDance-Seed/VeOmni#751, gated via the transformers-stable dependency group). This PR mirrors that pin in vexact so users on the veomni extra resolve to the same version VeOmni tests/develops against, and bumps the VeOmni pin to current main.

Changes

veomni rev: 58759e7 → a4ed599. Picks up the v5 default plus the Qwen3-VL/Omni-MoE CPU-sync removals (#762, #764), v5 loader test (#727), MoE router replay hook (#719), and the v4 cleanup (#768).
vllm: 0.18.0 → 0.19.1 (latest vllm still on torch 2.10; 0.20+ would cascade a torch 2.11 and flash-attn-wheel bump).
transformers==5.2.0 added to the veomni extra so it only hits users actually doing veomni-based training.
override-dependencies forces transformers==5.2.0 globally so the vllm extra — whose metadata still excludes transformers 5.0.*-5.4.* until the vllm devs whitelist 5.5.1+ — can coexist in the resolution.

Smoke verification

1×8 H100, examples/getting_started/run_qwen3_1b7.sh (Qwen3-1.7B + gsm8k), capped at 2 training steps:

step	`rollout_probs_diff_max`	`rollout_actor_probs_pearson_corr`	`actor/entropy`	`k3_kl`
1	0.0	0.9999999	0.180	0.0
2	0.0	1.0	0.176	0.0

Bitwise actor↔rollout alignment is preserved under transformers v5.2.0.

The 30B-A3B B200 recipe was not re-validated on H100 here because its max_cache_blocks=4608 is sized for 192GB HBM and OOMs on 80GB H100; alignment math is shared with the 1.7B path, so v5 risk is contained to the import/dataclass surface (already covered by the 1.7B smoke).

Test plan

uv lock + uv sync --extra gpu --extra verl --extra veomni resolves (transformers 5.2.0, veomni 0.1.9a5 @ a4ed599, vllm 0.19.1)
python -c "import vexact.models.qwen3_moe, vexact.models.deepseek_v3" clean under v5.2.0
Qwen3-1.7B + gsm8k smoke: rollout_probs_diff_max=0.0 step 1, step 2

AI assistance disclosure

This PR was prepared with Claude Code assistance for the dep-resolution sweep (resolving the vllm/transformers/torch ceiling cascade), uv.lock regeneration, and smoke-test orchestration on a remote H100 worker.

VeOmni's main branch now defaults to transformers v5.2.0 (PR #751, gated via the `transformers-stable` dependency group); mirror that pin in vexact so users on the veomni extra resolve to the same version VeOmni tests/develops against. Changes: - veomni rev: 58759e7 -> a4ed599. Picks up the v5 default plus Qwen3-VL/Omni-MoE CPU-sync removals (#762, #764), v5 loader test (#727), MoE router replay hook (#719), and the v4 cleanup (#768). - vllm: 0.18.0 -> 0.19.1 (latest vllm still on torch 2.10; 0.20+ would cascade a torch 2.11 + flash-attn-wheel bump). - transformers pinned to 5.2.0 inside the `veomni` extra so it only hits users actually doing veomni-based training. - override-dependencies forces transformers==5.2.0 globally so the `vllm` extra (whose metadata still excludes 5.0.*-5.4.* until the vllm devs whitelist 5.5.1+) can coexist in the resolution. Smoke verification (1x8 H100, Qwen3-1.7B + gsm8k, 2 training steps): step:1 rollout_probs_diff_max=0.0 pearson_corr=0.9999999 actor/entropy=0.180 k3_kl=0.0 step:2 rollout_probs_diff_max=0.0 pearson_corr=1.0 actor/entropy=0.176 k3_kl=0.0 Bitwise actor<->rollout alignment is preserved under transformers v5. The 30B-A3B B200 recipe was not re-validated on H100 here because its `max_cache_blocks=4608` is sized for 192GB HBM and OOMs on 80GB H100; alignment math is shared with the 1.7B path, so v5 risk is contained to the import/dataclass surface (already covered by the 1.7B smoke).

gemini-code-assist

Code Review

This pull request updates vllm to 0.19.1 and veomni to a newer revision, while adding a global override for transformers==5.2.0 to resolve dependency conflicts. Reviewers noted that this override is risky as vllm explicitly excludes this version range, suggesting that veomni should ideally be updated to a whitelisted transformers version (5.5.1+) instead. Feedback also included a request to clarify comments in pyproject.toml regarding the global nature of the uv override compared to its effect on pip users.

gemini-code-assist · 2026-05-21T00:33:34Z

+    # vllm 0.19.1's metadata still excludes transformers 5.0.*-5.4.* (only
+    # 5.5.1+ is whitelisted), but VeOmni pins transformers==5.2.0. Override
+    # vllm's conservative ceiling so the `vllm` and `veomni` extras can
+    # coexist; vllm 0.19.1 runs fine against transformers 5.2 in practice.
+    "transformers==5.2.0",


Forcing transformers==5.2.0 via override-dependencies to bypass vllm's version constraints is risky. Since vllm explicitly excludes versions 5.0.* through 5.4.*, there may be known incompatibilities or breaking changes in the transformers API that vllm relies on. While the smoke tests passed for the Qwen3-1.7B rollout, this override might cause issues in other vllm features or models.

Additionally, this override makes the pin global for all uv users, which contradicts the PR description's intent to only affect veomni users. If transformers 5.5.1+ is already whitelisted by vllm, consider if VeOmni can be updated to that version to avoid the need for a global override.

gemini-code-assist · 2026-05-21T00:33:34Z

+    # VeOmni's default install pins transformers==5.2.0 (via its
+    # `transformers-stable` dependency group). Mirror that pin here so vexact
+    # users picking up the veomni extra resolve to the same version VeOmni
+    # tests/develops against.
+    "transformers==5.2.0",


The comment here states that the pin only hits users of the veomni extra. However, due to the global override added in tool.uv.override-dependencies (line 162), this version is actually enforced for all uv resolutions in this project. The pin in the extra is effectively redundant for uv users but remains relevant for pip users. Please update the comment to reflect the actual behavior under uv.

Suggested change

# VeOmni's default install pins transformers==5.2.0 (via its

# `transformers-stable` dependency group). Mirror that pin here so vexact

# users picking up the veomni extra resolve to the same version VeOmni

# tests/develops against.

"transformers==5.2.0",

# VeOmni's default install pins transformers==5.2.0 (via its

# `transformers-stable` dependency group). Mirror that pin here for pip

# users; note that for uv users, this is enforced project-wide via

# override-dependencies to resolve conflicts with vllm.

"transformers==5.2.0",

Without these fixes vexact rollout drifts severely from the actor side on both supported MoE architectures under transformers v5 (``rollout_probs_diff_max`` ≈0.99/0.998), making bitwise-aligned RL training impossible. Five concrete causes; one fix per cause: 1. ``ModelCreator`` allocated the ``"weights"`` TorchMemorySaver region with ``enable_cpu_backup=False``. ``rollout.release()`` pauses every region, so weights were *freed* on pause and came back as uninitialised garbage on resume. Flip the flag to ``True`` so the offload preserves values. 2. VeOmni v5's actor exposes MoE experts as a fused ``mlp.experts.gate_up_proj`` (shape ``[E, 2I, H]``), but vexact stored per-projection ``gate_proj`` / ``up_proj`` tensors. verl's bucketed FSDP→rollout transfer therefore silently dropped every expert key. Move vexact's ``Qwen3MoeExperts`` and ``PatchDeepseekV3NaiveMoe`` to the fused layout, slice into per-projection views at MoE-forward time. 3. Extend the per-expert loader to also accept ``gate_up_proj.weight`` (the bucketed sync ships ``mlp.experts.{idx}.gate_up_proj.weight``) and to know how to write the fused destination from per-projection disk keys. 4. Use VeOmni's ``veomni.ops.fused_moe_forward`` (``fc1_1_2_weight=…`` merged-fc1 path) on the rollout side too so the kernels match the actor's exactly. Initialise ``veomni.distributed.parallel_state`` (non-EP, ``dp_size=world_size``) in ``Worker.__init__`` because the group_gemm/quack/npu kernels read ``get_parallel_state().ep_enabled`` on every forward and would otherwise crash with ``ValueError: product of parallel sizes…`` under PP>1. The kernel binding tolerates CPU-only worker processes (AgentLoopWorker, etc.). 5. For deepseek_v3: VeOmni v5's stock ``DeepseekV3Attention.forward`` pads ``value_states`` to ``qk_head_dim`` when FA is requested, forcing FA4 onto its non-MLA codegen path. vexact uses FA4's MLA-native path (unpadded V=128). Patch VeOmni's actor-side attention / RoPE / RMSNorm modules to use vexact's MLA-native versions so the two sides hit identical kernel call signatures. The three v4 ``trainer.use_legacy_worker_impl=disable`` Hydra overrides in ``examples/moe/run_qwen3_30B_A3B_*.sh`` are removed; that knob no longer exists in the current verl pin and the recipes errored out immediately when run. Smoke verification (1x8 H100, transformers 5.2.0, veomni a4ed599): Qwen3-30B-A3B (qwen3_moe), examples/moe/run_qwen3_30B_A3B_dapo.sh step:1 rollout_probs_diff_max=0.0 pearson_corr=1.0 entropy=0.124 Moonlight-16B-A3B (deepseek_v3 / MLA), examples/moe/run_moonlight_gsm8k.sh step:1 rollout_probs_diff_max=0.0 pearson_corr=1.0 entropy=0.036 Both archs now match the dense Qwen3-1.7B baseline.

Luosuu · 2026-05-21T20:09:21Z

Update: MoE rollout alignment under transformers v5 — both archs verified

Pushed e7e4f88 which extends this PR with the vexact-side MoE fixes needed for bitwise-aligned rollout under transformers v5.

Smoke verification on 1×8 H100

Model	Arch	`pearson_corr`	`actor/entropy`
Qwen3-1.7B	dense	1.0	0.18
Qwen3-30B-A3B	qwen3_moe	1.0	0.124
Moonlight-16B-A3B	deepseek_v3 (MLA)	1.0	0.036

Before this commit (with the pure dep bump alone): Qwen3-MoE diverged at 0.99 and Moonlight at 0.998; entropy was uniform-ish (~7.4/8.4) — rollout was producing garbage logits.

Root causes fixed

enable_cpu_backup=False on the weights TorchMemorySaver region. rollout.release() pauses every region; with no CPU backup the GPU memory came back uninitialised on resume. Flipped to True.
MoE expert storage mismatch. VeOmni v5 ships fused mlp.experts.gate_up_proj ([E, 2I, H]); vexact stored gate_proj/up_proj separately, so verl's bucketed FSDP→rollout sync silently dropped every expert key. Moved vexact's experts to the fused layout; slice into per-projection views in the MoE forward.
Loader didn't know gate_up_proj keys. Extended _EXPERT_PROJS and added per-projection-to-fused write paths for both disk (per-expert separate) and FSDP-sync (per-expert fused) variants.
Used VeOmni's fused_moe_forward (merged-fc1_1_2_weight path) on rollout too so the kernels match the actor exactly; init veomni.distributed.parallel_state non-EP in Worker.__init__ so the kernel's ep_enabled check doesn't raise under PP>1 (it walks pp*dp*cp*ulysses*tp == world_size).
deepseek_v3 attention path divergence. VeOmni v5's actor-side DeepseekV3Attention.forward pads value_states to qk_head_dim when FA is requested, forcing FA4 onto its standard codegen path. vexact's rollout uses FA4's MLA-native path (unpadded V=128). Patched VeOmni's actor-side attention / RoPE / RMSNorm to reuse vexact's MLA-native versions so both sides hit identical kernel call signatures.

Also cleared the stale trainer.use_legacy_worker_impl=disable Hydra override from the three examples/moe/run_qwen3_30B_A3B_*.sh recipes — that knob no longer exists in the current verl pin and the recipes errored at startup.

Test plan

uv lock + uv sync --extra gpu --extra verl --extra veomni resolves
Dense Qwen3-1.7B smoke: rollout_probs_diff_max = 0.0
Qwen3-30B-A3B (qwen3_moe) smoke on 8×H100 (PP=4): rollout_probs_diff_max = 0.0
Moonlight (deepseek_v3 / MLA) smoke on 8×H100: rollout_probs_diff_max = 0.0

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

Luosuu changed the title ~~[deps] feat: bump to transformers v5.2.0 + VeOmni a4ed599~~ [model] feat: bump to transformers v5.2.0 + VeOmni a4ed599 May 21, 2026

Luosuu added 2 commits May 21, 2026 20:14

[model] fix: wrap long MoE-kernel-binding log lines under 120 cols

103ef48

[model] fix: ruff format vexact deepseek_v3

d5f6ffb

Luosuu merged commit 1832a19 into verl-project:main May 21, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model] feat: bump to transformers v5.2.0 + VeOmni a4ed599#29

[model] feat: bump to transformers v5.2.0 + VeOmni a4ed599#29
Luosuu merged 4 commits into
verl-project:mainfrom
Luosuu:deps/bump-transformers-v5

Luosuu commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

Luosuu commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luosuu commented May 21, 2026

Summary

Changes

Smoke verification

Test plan

AI assistance disclosure

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Luosuu commented May 21, 2026

Update: MoE rollout alignment under transformers v5 — both archs verified

Smoke verification on 1×8 H100

Root causes fixed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant