Skip to content

[model] feat: bump to transformers v5.2.0 + VeOmni a4ed599#29

Merged
Luosuu merged 4 commits into
verl-project:mainfrom
Luosuu:deps/bump-transformers-v5
May 21, 2026
Merged

[model] feat: bump to transformers v5.2.0 + VeOmni a4ed599#29
Luosuu merged 4 commits into
verl-project:mainfrom
Luosuu:deps/bump-transformers-v5

Conversation

@Luosuu
Copy link
Copy Markdown
Collaborator

@Luosuu Luosuu commented May 21, 2026

Summary

VeOmni's main now defaults to transformers==5.2.0 (PR ByteDance-Seed/VeOmni#751, gated via the transformers-stable dependency group). This PR mirrors that pin in vexact so users on the veomni extra resolve to the same version VeOmni tests/develops against, and bumps the VeOmni pin to current main.

Changes

  • veomni rev: 58759e7a4ed599. Picks up the v5 default plus the Qwen3-VL/Omni-MoE CPU-sync removals (#762, #764), v5 loader test (#727), MoE router replay hook (#719), and the v4 cleanup (#768).
  • vllm: 0.18.00.19.1 (latest vllm still on torch 2.10; 0.20+ would cascade a torch 2.11 and flash-attn-wheel bump).
  • transformers==5.2.0 added to the veomni extra so it only hits users actually doing veomni-based training.
  • override-dependencies forces transformers==5.2.0 globally so the vllm extra — whose metadata still excludes transformers 5.0.*-5.4.* until the vllm devs whitelist 5.5.1+ — can coexist in the resolution.

Smoke verification

1×8 H100, examples/getting_started/run_qwen3_1b7.sh (Qwen3-1.7B + gsm8k), capped at 2 training steps:

step rollout_probs_diff_max rollout_actor_probs_pearson_corr actor/entropy k3_kl
1 0.0 0.9999999 0.180 0.0
2 0.0 1.0 0.176 0.0

Bitwise actor↔rollout alignment is preserved under transformers v5.2.0.

The 30B-A3B B200 recipe was not re-validated on H100 here because its max_cache_blocks=4608 is sized for 192GB HBM and OOMs on 80GB H100; alignment math is shared with the 1.7B path, so v5 risk is contained to the import/dataclass surface (already covered by the 1.7B smoke).

Test plan

  • uv lock + uv sync --extra gpu --extra verl --extra veomni resolves (transformers 5.2.0, veomni 0.1.9a5 @ a4ed599, vllm 0.19.1)
  • python -c "import vexact.models.qwen3_moe, vexact.models.deepseek_v3" clean under v5.2.0
  • Qwen3-1.7B + gsm8k smoke: rollout_probs_diff_max=0.0 step 1, step 2

AI assistance disclosure

This PR was prepared with Claude Code assistance for the dep-resolution sweep (resolving the vllm/transformers/torch ceiling cascade), uv.lock regeneration, and smoke-test orchestration on a remote H100 worker.

VeOmni's main branch now defaults to transformers v5.2.0 (PR #751,
gated via the `transformers-stable` dependency group); mirror that pin
in vexact so users on the veomni extra resolve to the same version
VeOmni tests/develops against.

Changes:

- veomni rev: 58759e7 -> a4ed599. Picks up the v5 default plus
  Qwen3-VL/Omni-MoE CPU-sync removals (#762, #764), v5 loader test
  (#727), MoE router replay hook (#719), and the v4 cleanup (#768).
- vllm: 0.18.0 -> 0.19.1 (latest vllm still on torch 2.10; 0.20+ would
  cascade a torch 2.11 + flash-attn-wheel bump).
- transformers pinned to 5.2.0 inside the `veomni` extra so it only
  hits users actually doing veomni-based training.
- override-dependencies forces transformers==5.2.0 globally so the
  `vllm` extra (whose metadata still excludes 5.0.*-5.4.* until the
  vllm devs whitelist 5.5.1+) can coexist in the resolution.

Smoke verification (1x8 H100, Qwen3-1.7B + gsm8k, 2 training steps):

  step:1  rollout_probs_diff_max=0.0  pearson_corr=0.9999999
          actor/entropy=0.180         k3_kl=0.0
  step:2  rollout_probs_diff_max=0.0  pearson_corr=1.0
          actor/entropy=0.176         k3_kl=0.0

Bitwise actor<->rollout alignment is preserved under transformers v5.

The 30B-A3B B200 recipe was not re-validated on H100 here because its
`max_cache_blocks=4608` is sized for 192GB HBM and OOMs on 80GB
H100; alignment math is shared with the 1.7B path, so v5 risk is
contained to the import/dataclass surface (already covered by the
1.7B smoke).
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates vllm to 0.19.1 and veomni to a newer revision, while adding a global override for transformers==5.2.0 to resolve dependency conflicts. Reviewers noted that this override is risky as vllm explicitly excludes this version range, suggesting that veomni should ideally be updated to a whitelisted transformers version (5.5.1+) instead. Feedback also included a request to clarify comments in pyproject.toml regarding the global nature of the uv override compared to its effect on pip users.

Comment thread pyproject.toml
Comment on lines +158 to +162
# vllm 0.19.1's metadata still excludes transformers 5.0.*-5.4.* (only
# 5.5.1+ is whitelisted), but VeOmni pins transformers==5.2.0. Override
# vllm's conservative ceiling so the `vllm` and `veomni` extras can
# coexist; vllm 0.19.1 runs fine against transformers 5.2 in practice.
"transformers==5.2.0",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Forcing transformers==5.2.0 via override-dependencies to bypass vllm's version constraints is risky. Since vllm explicitly excludes versions 5.0.* through 5.4.*, there may be known incompatibilities or breaking changes in the transformers API that vllm relies on. While the smoke tests passed for the Qwen3-1.7B rollout, this override might cause issues in other vllm features or models.

Additionally, this override makes the pin global for all uv users, which contradicts the PR description's intent to only affect veomni users. If transformers 5.5.1+ is already whitelisted by vllm, consider if VeOmni can be updated to that version to avoid the need for a global override.

Comment thread pyproject.toml
Comment on lines +40 to +44
# VeOmni's default install pins transformers==5.2.0 (via its
# `transformers-stable` dependency group). Mirror that pin here so vexact
# users picking up the veomni extra resolve to the same version VeOmni
# tests/develops against.
"transformers==5.2.0",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment here states that the pin only hits users of the veomni extra. However, due to the global override added in tool.uv.override-dependencies (line 162), this version is actually enforced for all uv resolutions in this project. The pin in the extra is effectively redundant for uv users but remains relevant for pip users. Please update the comment to reflect the actual behavior under uv.

Suggested change
# VeOmni's default install pins transformers==5.2.0 (via its
# `transformers-stable` dependency group). Mirror that pin here so vexact
# users picking up the veomni extra resolve to the same version VeOmni
# tests/develops against.
"transformers==5.2.0",
# VeOmni's default install pins transformers==5.2.0 (via its
# `transformers-stable` dependency group). Mirror that pin here for pip
# users; note that for uv users, this is enforced project-wide via
# override-dependencies to resolve conflicts with vllm.
"transformers==5.2.0",

Without these fixes vexact rollout drifts severely from the actor side on
both supported MoE architectures under transformers v5
(``rollout_probs_diff_max`` ≈0.99/0.998), making bitwise-aligned RL
training impossible.

Five concrete causes; one fix per cause:

1. ``ModelCreator`` allocated the ``"weights"`` TorchMemorySaver region with
   ``enable_cpu_backup=False``. ``rollout.release()`` pauses every region,
   so weights were *freed* on pause and came back as uninitialised garbage
   on resume. Flip the flag to ``True`` so the offload preserves values.

2. VeOmni v5's actor exposes MoE experts as a fused
   ``mlp.experts.gate_up_proj`` (shape ``[E, 2I, H]``), but vexact stored
   per-projection ``gate_proj`` / ``up_proj`` tensors. verl's bucketed
   FSDP→rollout transfer therefore silently dropped every expert key. Move
   vexact's ``Qwen3MoeExperts`` and ``PatchDeepseekV3NaiveMoe`` to the
   fused layout, slice into per-projection views at MoE-forward time.

3. Extend the per-expert loader to also accept ``gate_up_proj.weight`` (the
   bucketed sync ships ``mlp.experts.{idx}.gate_up_proj.weight``) and to
   know how to write the fused destination from per-projection disk keys.

4. Use VeOmni's ``veomni.ops.fused_moe_forward`` (``fc1_1_2_weight=…``
   merged-fc1 path) on the rollout side too so the kernels match the
   actor's exactly. Initialise ``veomni.distributed.parallel_state``
   (non-EP, ``dp_size=world_size``) in ``Worker.__init__`` because the
   group_gemm/quack/npu kernels read ``get_parallel_state().ep_enabled``
   on every forward and would otherwise crash with
   ``ValueError: product of parallel sizes…`` under PP>1. The kernel
   binding tolerates CPU-only worker processes (AgentLoopWorker, etc.).

5. For deepseek_v3: VeOmni v5's stock ``DeepseekV3Attention.forward``
   pads ``value_states`` to ``qk_head_dim`` when FA is requested,
   forcing FA4 onto its non-MLA codegen path. vexact uses FA4's
   MLA-native path (unpadded V=128). Patch VeOmni's actor-side attention
   / RoPE / RMSNorm modules to use vexact's MLA-native versions so the
   two sides hit identical kernel call signatures.

The three v4 ``trainer.use_legacy_worker_impl=disable`` Hydra overrides
in ``examples/moe/run_qwen3_30B_A3B_*.sh`` are removed; that knob no
longer exists in the current verl pin and the recipes errored out
immediately when run.

Smoke verification (1x8 H100, transformers 5.2.0, veomni a4ed599):

  Qwen3-30B-A3B (qwen3_moe), examples/moe/run_qwen3_30B_A3B_dapo.sh
    step:1  rollout_probs_diff_max=0.0  pearson_corr=1.0  entropy=0.124

  Moonlight-16B-A3B (deepseek_v3 / MLA), examples/moe/run_moonlight_gsm8k.sh
    step:1  rollout_probs_diff_max=0.0  pearson_corr=1.0  entropy=0.036

Both archs now match the dense Qwen3-1.7B baseline.
@Luosuu
Copy link
Copy Markdown
Collaborator Author

Luosuu commented May 21, 2026

Update: MoE rollout alignment under transformers v5 — both archs verified

Pushed e7e4f88 which extends this PR with the vexact-side MoE fixes needed for bitwise-aligned rollout under transformers v5.

Smoke verification on 1×8 H100

Model Arch rollout_probs_diff_max pearson_corr actor/entropy
Qwen3-1.7B dense 0.0 1.0 0.18
Qwen3-30B-A3B qwen3_moe 0.0 1.0 0.124
Moonlight-16B-A3B deepseek_v3 (MLA) 0.0 1.0 0.036

Before this commit (with the pure dep bump alone): Qwen3-MoE diverged at 0.99 and Moonlight at 0.998; entropy was uniform-ish (~7.4/8.4) — rollout was producing garbage logits.

Root causes fixed

  1. enable_cpu_backup=False on the weights TorchMemorySaver region. rollout.release() pauses every region; with no CPU backup the GPU memory came back uninitialised on resume. Flipped to True.
  2. MoE expert storage mismatch. VeOmni v5 ships fused mlp.experts.gate_up_proj ([E, 2I, H]); vexact stored gate_proj/up_proj separately, so verl's bucketed FSDP→rollout sync silently dropped every expert key. Moved vexact's experts to the fused layout; slice into per-projection views in the MoE forward.
  3. Loader didn't know gate_up_proj keys. Extended _EXPERT_PROJS and added per-projection-to-fused write paths for both disk (per-expert separate) and FSDP-sync (per-expert fused) variants.
  4. Used VeOmni's fused_moe_forward (merged-fc1_1_2_weight path) on rollout too so the kernels match the actor exactly; init veomni.distributed.parallel_state non-EP in Worker.__init__ so the kernel's ep_enabled check doesn't raise under PP>1 (it walks pp*dp*cp*ulysses*tp == world_size).
  5. deepseek_v3 attention path divergence. VeOmni v5's actor-side DeepseekV3Attention.forward pads value_states to qk_head_dim when FA is requested, forcing FA4 onto its standard codegen path. vexact's rollout uses FA4's MLA-native path (unpadded V=128). Patched VeOmni's actor-side attention / RoPE / RMSNorm to reuse vexact's MLA-native versions so both sides hit identical kernel call signatures.

Also cleared the stale trainer.use_legacy_worker_impl=disable Hydra override from the three examples/moe/run_qwen3_30B_A3B_*.sh recipes — that knob no longer exists in the current verl pin and the recipes errored at startup.

Test plan

  • uv lock + uv sync --extra gpu --extra verl --extra veomni resolves
  • Dense Qwen3-1.7B smoke: rollout_probs_diff_max = 0.0
  • Qwen3-30B-A3B (qwen3_moe) smoke on 8×H100 (PP=4): rollout_probs_diff_max = 0.0
  • Moonlight (deepseek_v3 / MLA) smoke on 8×H100: rollout_probs_diff_max = 0.0

@Luosuu Luosuu changed the title [deps] feat: bump to transformers v5.2.0 + VeOmni a4ed599 [model] feat: bump to transformers v5.2.0 + VeOmni a4ed599 May 21, 2026
@Luosuu Luosuu merged commit 1832a19 into verl-project:main May 21, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant