[model] feat: bump to transformers v5.2.0 + VeOmni a4ed599#29
Conversation
VeOmni's main branch now defaults to transformers v5.2.0 (PR #751,
gated via the `transformers-stable` dependency group); mirror that pin
in vexact so users on the veomni extra resolve to the same version
VeOmni tests/develops against.
Changes:
- veomni rev: 58759e7 -> a4ed599. Picks up the v5 default plus
Qwen3-VL/Omni-MoE CPU-sync removals (#762, #764), v5 loader test
(#727), MoE router replay hook (#719), and the v4 cleanup (#768).
- vllm: 0.18.0 -> 0.19.1 (latest vllm still on torch 2.10; 0.20+ would
cascade a torch 2.11 + flash-attn-wheel bump).
- transformers pinned to 5.2.0 inside the `veomni` extra so it only
hits users actually doing veomni-based training.
- override-dependencies forces transformers==5.2.0 globally so the
`vllm` extra (whose metadata still excludes 5.0.*-5.4.* until the
vllm devs whitelist 5.5.1+) can coexist in the resolution.
Smoke verification (1x8 H100, Qwen3-1.7B + gsm8k, 2 training steps):
step:1 rollout_probs_diff_max=0.0 pearson_corr=0.9999999
actor/entropy=0.180 k3_kl=0.0
step:2 rollout_probs_diff_max=0.0 pearson_corr=1.0
actor/entropy=0.176 k3_kl=0.0
Bitwise actor<->rollout alignment is preserved under transformers v5.
The 30B-A3B B200 recipe was not re-validated on H100 here because its
`max_cache_blocks=4608` is sized for 192GB HBM and OOMs on 80GB
H100; alignment math is shared with the 1.7B path, so v5 risk is
contained to the import/dataclass surface (already covered by the
1.7B smoke).
There was a problem hiding this comment.
Code Review
This pull request updates vllm to 0.19.1 and veomni to a newer revision, while adding a global override for transformers==5.2.0 to resolve dependency conflicts. Reviewers noted that this override is risky as vllm explicitly excludes this version range, suggesting that veomni should ideally be updated to a whitelisted transformers version (5.5.1+) instead. Feedback also included a request to clarify comments in pyproject.toml regarding the global nature of the uv override compared to its effect on pip users.
| # vllm 0.19.1's metadata still excludes transformers 5.0.*-5.4.* (only | ||
| # 5.5.1+ is whitelisted), but VeOmni pins transformers==5.2.0. Override | ||
| # vllm's conservative ceiling so the `vllm` and `veomni` extras can | ||
| # coexist; vllm 0.19.1 runs fine against transformers 5.2 in practice. | ||
| "transformers==5.2.0", |
There was a problem hiding this comment.
Forcing transformers==5.2.0 via override-dependencies to bypass vllm's version constraints is risky. Since vllm explicitly excludes versions 5.0.* through 5.4.*, there may be known incompatibilities or breaking changes in the transformers API that vllm relies on. While the smoke tests passed for the Qwen3-1.7B rollout, this override might cause issues in other vllm features or models.
Additionally, this override makes the pin global for all uv users, which contradicts the PR description's intent to only affect veomni users. If transformers 5.5.1+ is already whitelisted by vllm, consider if VeOmni can be updated to that version to avoid the need for a global override.
| # VeOmni's default install pins transformers==5.2.0 (via its | ||
| # `transformers-stable` dependency group). Mirror that pin here so vexact | ||
| # users picking up the veomni extra resolve to the same version VeOmni | ||
| # tests/develops against. | ||
| "transformers==5.2.0", |
There was a problem hiding this comment.
The comment here states that the pin only hits users of the veomni extra. However, due to the global override added in tool.uv.override-dependencies (line 162), this version is actually enforced for all uv resolutions in this project. The pin in the extra is effectively redundant for uv users but remains relevant for pip users. Please update the comment to reflect the actual behavior under uv.
| # VeOmni's default install pins transformers==5.2.0 (via its | |
| # `transformers-stable` dependency group). Mirror that pin here so vexact | |
| # users picking up the veomni extra resolve to the same version VeOmni | |
| # tests/develops against. | |
| "transformers==5.2.0", | |
| # VeOmni's default install pins transformers==5.2.0 (via its | |
| # `transformers-stable` dependency group). Mirror that pin here for pip | |
| # users; note that for uv users, this is enforced project-wide via | |
| # override-dependencies to resolve conflicts with vllm. | |
| "transformers==5.2.0", |
Without these fixes vexact rollout drifts severely from the actor side on
both supported MoE architectures under transformers v5
(``rollout_probs_diff_max`` ≈0.99/0.998), making bitwise-aligned RL
training impossible.
Five concrete causes; one fix per cause:
1. ``ModelCreator`` allocated the ``"weights"`` TorchMemorySaver region with
``enable_cpu_backup=False``. ``rollout.release()`` pauses every region,
so weights were *freed* on pause and came back as uninitialised garbage
on resume. Flip the flag to ``True`` so the offload preserves values.
2. VeOmni v5's actor exposes MoE experts as a fused
``mlp.experts.gate_up_proj`` (shape ``[E, 2I, H]``), but vexact stored
per-projection ``gate_proj`` / ``up_proj`` tensors. verl's bucketed
FSDP→rollout transfer therefore silently dropped every expert key. Move
vexact's ``Qwen3MoeExperts`` and ``PatchDeepseekV3NaiveMoe`` to the
fused layout, slice into per-projection views at MoE-forward time.
3. Extend the per-expert loader to also accept ``gate_up_proj.weight`` (the
bucketed sync ships ``mlp.experts.{idx}.gate_up_proj.weight``) and to
know how to write the fused destination from per-projection disk keys.
4. Use VeOmni's ``veomni.ops.fused_moe_forward`` (``fc1_1_2_weight=…``
merged-fc1 path) on the rollout side too so the kernels match the
actor's exactly. Initialise ``veomni.distributed.parallel_state``
(non-EP, ``dp_size=world_size``) in ``Worker.__init__`` because the
group_gemm/quack/npu kernels read ``get_parallel_state().ep_enabled``
on every forward and would otherwise crash with
``ValueError: product of parallel sizes…`` under PP>1. The kernel
binding tolerates CPU-only worker processes (AgentLoopWorker, etc.).
5. For deepseek_v3: VeOmni v5's stock ``DeepseekV3Attention.forward``
pads ``value_states`` to ``qk_head_dim`` when FA is requested,
forcing FA4 onto its non-MLA codegen path. vexact uses FA4's
MLA-native path (unpadded V=128). Patch VeOmni's actor-side attention
/ RoPE / RMSNorm modules to use vexact's MLA-native versions so the
two sides hit identical kernel call signatures.
The three v4 ``trainer.use_legacy_worker_impl=disable`` Hydra overrides
in ``examples/moe/run_qwen3_30B_A3B_*.sh`` are removed; that knob no
longer exists in the current verl pin and the recipes errored out
immediately when run.
Smoke verification (1x8 H100, transformers 5.2.0, veomni a4ed599):
Qwen3-30B-A3B (qwen3_moe), examples/moe/run_qwen3_30B_A3B_dapo.sh
step:1 rollout_probs_diff_max=0.0 pearson_corr=1.0 entropy=0.124
Moonlight-16B-A3B (deepseek_v3 / MLA), examples/moe/run_moonlight_gsm8k.sh
step:1 rollout_probs_diff_max=0.0 pearson_corr=1.0 entropy=0.036
Both archs now match the dense Qwen3-1.7B baseline.
Update: MoE rollout alignment under transformers v5 — both archs verifiedPushed Smoke verification on 1×8 H100
Before this commit (with the pure dep bump alone): Qwen3-MoE diverged at 0.99 and Moonlight at 0.998; entropy was uniform-ish (~7.4/8.4) — rollout was producing garbage logits. Root causes fixed
Also cleared the stale Test plan
|
Summary
VeOmni's
mainnow defaults totransformers==5.2.0(PR ByteDance-Seed/VeOmni#751, gated via thetransformers-stabledependency group). This PR mirrors that pin in vexact so users on theveomniextra resolve to the same version VeOmni tests/develops against, and bumps the VeOmni pin to currentmain.Changes
veomnirev:58759e7→a4ed599. Picks up the v5 default plus the Qwen3-VL/Omni-MoE CPU-sync removals (#762, #764), v5 loader test (#727), MoE router replay hook (#719), and the v4 cleanup (#768).vllm:0.18.0→0.19.1(latest vllm still ontorch 2.10; 0.20+ would cascade atorch 2.11and flash-attn-wheel bump).transformers==5.2.0added to theveomniextra so it only hits users actually doing veomni-based training.override-dependenciesforcestransformers==5.2.0globally so thevllmextra — whose metadata still excludestransformers 5.0.*-5.4.*until the vllm devs whitelist 5.5.1+ — can coexist in the resolution.Smoke verification
1×8 H100,
examples/getting_started/run_qwen3_1b7.sh(Qwen3-1.7B + gsm8k), capped at 2 training steps:rollout_probs_diff_maxrollout_actor_probs_pearson_corractor/entropyk3_klBitwise actor↔rollout alignment is preserved under transformers v5.2.0.
The 30B-A3B B200 recipe was not re-validated on H100 here because its
max_cache_blocks=4608is sized for 192GB HBM and OOMs on 80GB H100; alignment math is shared with the 1.7B path, so v5 risk is contained to the import/dataclass surface (already covered by the 1.7B smoke).Test plan
uv lock+uv sync --extra gpu --extra verl --extra veomniresolves (transformers 5.2.0, veomni 0.1.9a5 @ a4ed599, vllm 0.19.1)python -c "import vexact.models.qwen3_moe, vexact.models.deepseek_v3"clean under v5.2.0rollout_probs_diff_max=0.0step 1, step 2AI assistance disclosure
This PR was prepared with Claude Code assistance for the dep-resolution sweep (resolving the vllm/transformers/torch ceiling cascade), uv.lock regeneration, and smoke-test orchestration on a remote H100 worker.