[megatron] feat: support DeepSeek V4 GRPO by HollowMan6 · Pull Request #6473 · verl-project/verl

HollowMan6 · 2026-05-26T06:11:44Z

What does this PR do?

Adds DeepSeek V4 Flash GRPO support with Megatron-Bridge actor/ref workers, vLLM rollout, FP8/MXFP4 weight transfer handling, and checkpoint save/export verification.

Need

Support DeepSeek V4 quantization scales during HF export NVIDIA-NeMo/Megatron-Bridge#3969

Checklist Before Starting

Search for similar PRs. Paste at least one query link here:
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

End-to-end verification:

Model: DeepSeek-V4-Flash
Engine: Megatron actor/ref + vLLM rollout
Nodes: 16
GRPO dataset: DAPO math
MAX_RESPONSE_LENGTH=10240
Result: COMPLETED 0:0
Checkpoint save/export result: Success: All tensors from the original checkpoint were written.

Key metrics:

response_length/max=9406
response_length/clip_ratio=0.0
critic/rewards/mean=0.645361
training/rollout_actor_probs_pearson_corr=0.797749
rollout diff mean/std/max: 0.111945 / 0.195764 / 0.999996
rollout_corr/kl=0.346802
timing_s/step=2273.88

API and Usage Example

This PR adds an example script:

bash examples/grpo_trainer/run_deepseek_v4_flash_megatron.sh

Example Slurm usage:

sbatch --nodes=16 \
  --export=ALL,MODEL_PATH=/path/to/DeepSeek-V4-Flash,TRAIN_FILE=/path/to/train.parquet,TEST_FILE=/path/to/test.parquet \
  outputs/deepseek_v4_flash_example_grpo.sbatch

Design & Code Changes

Add DeepSeek V4 Flash GRPO example script and README entry.
Add vLLM FP8/MXFP4 utilities for quantized rollout weight preparation and reload-safe parameter handling.
Add bucketed Megatron-to-vLLM weight transfer coverage for quantized weights.
Normalize DeepSeek V4 vLLM HF overrides, including MTP disablement and Yarn RoPE scaling fields.
Adjust Megatron transformer config handling for disabled MTP layers.
Add model config support needed by DeepSeek V4 Flash.
Add unit coverage for vLLM FP8 utilities, bucketed transfer behavior, and model config handling.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

Adds DeepSeek V4 Flash GRPO support with Megatron-Bridge actor/ref workers, vLLM rollout, FP8/MXFP4 weight transfer handling, and checkpoint save/export verification. Signed-off-by: Hollow Man <hollowman@opensuse.org>

gemini-code-assist

Code Review

This pull request introduces support for the DeepSeek-V4-Flash model in the GRPO trainer, adding a new Megatron training script and extensive utilities for FP8 and MXFP4 quantization weight loading. It also enhances custom chat template resolution, aligns offsets in bucketed weight transfers to respect tensor element sizes, and adds comprehensive unit tests. The review feedback highlights two critical issues: a potential AttributeError in transformer_impl.py when accessing csa_compress_ratios on standard models, and another potential AttributeError in vllm_fp8_utils.py when copying parameter attributes.

Copilot

Pull request overview

Adds DeepSeek V4 Flash GRPO support for Megatron-Bridge actor/ref workers with vLLM rollout, including FP8/MXFP4 weight reload handling, config overrides, and example + unit coverage.

Changes:

Normalize vLLM HF overrides (MTP disablement + Yarn RoPE scaling) and align Megatron transformer/provider overrides for disabled MTP layers.
Improve vLLM rollout weight sync for quantized models (FP8/MXFP4), including bucketed transfer alignment and reload-safe parameter handling.
Add custom_chat_template resolution (file/env) plus tests and a DeepSeek-V4-Flash Megatron GRPO example script / README entry.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
verl/workers/rollout/vllm_rollout/vllm_async_server.py	Adds HF override normalization for MTP disablement and Yarn RoPE scaling.
verl/workers/rollout/vllm_rollout/utils.py	Updates vLLM colocated worker to better handle FP8 reload preparation/post-processing across bucketed weight sync.
verl/workers/rollout/vllm_rollout/bucketed_weight_transfer.py	Aligns bucket offsets to tensor element size to support safe dtype views on the receiver side.
verl/workers/engine/megatron/transformer_impl.py	Adjusts Megatron provider overrides when MTP is disabled (and trims CSA ratios).
verl/workers/config/model.py	Adds `custom_chat_template` support and resolution from file/env at config materialization time.
verl/utils/vllm/vllm_fp8_utils.py	Adds MXFP4 quantization + reload-safe param restoration and DeepSeek V4 naming/mapping tweaks.
tests/workers/config/test_model_config_on_cpu.py	Adds unit tests for chat template resolution and mutability.
tests/utils/test_vllm_fp8_utils.py	Adds targeted unit tests for MXFP4 packing, prequantized detection, and scale-name conventions.
tests/utils/test_bucketed_weight_transfer.py	Adds unit test for new bucket offset alignment helper.
examples/grpo_trainer/run_deepseek_v4_flash_megatron.sh	Adds runnable DeepSeek-V4-Flash GRPO example (Megatron + vLLM rollout).
examples/grpo_trainer/README.md	Documents the new DeepSeek-V4-Flash example entry.

Comments suppressed due to low confidence (1)

verl/workers/rollout/vllm_rollout/utils.py:311

The drafter FP8 reload path still calls load_quanted_weights(...) with default prepare_model=True/process_model=True for every received bucket. With bucketed transfer this can repeat non-idempotent post-processing and adds significant overhead. It should mirror the main-model behavior by skipping per-bucket prepare/process when quant_prepared is true, and rely on the once-per-reload processing at the end of update_weights_from_ipc().

                # Keep the draft model in sync when present.
                if self._use_mtp_drafter_weight_sync():
                    load_quanted_weights(weights, self.model_runner, is_drafter=True)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gemini-code-assist

Code Review

This pull request introduces support for the DeepSeek-V4-Flash model, adding training scripts, custom chat template resolution, and extensive FP8/MXFP4 quantization and weight-loading utilities. It also updates the bucketed weight transfer mechanism to align offsets based on tensor element size and fixes Megatron transformer configuration building when MTP is disabled. The review feedback highlights critical stability improvements: guarding against None values when casting rope_scaling factors to float in vllm_async_server.py, handling cases where packed_modules_mapping is explicitly set to None in vllm_fp8_utils.py, and broadening exception handling during parameter attribute copying to prevent unexpected crashes.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

gemini-code-assist

Code Review

This pull request introduces support for the DeepSeek-V4-Flash model, featuring a Megatron training script, custom chat template resolution from files or environment variables, and comprehensive FP8/MXFP4 quantization utilities for vLLM rollout. It also addresses alignment issues in bucketed weight transfer by respecting tensor element size and disables MTP layers when MTP is not enabled. A critical feedback item was identified in the _model_type helper function, where an AttributeError could occur if the model's configuration is None; a safe traversal suggestion has been provided to prevent potential crashes.

gemini-code-assist

Code Review

This pull request introduces support for the DeepSeek-V4-Flash model in the GRPO trainer, including a new Megatron training launch script. Key changes include implementing MXFP4 weight quantization and loading utilities, aligning offsets during weight transfer based on tensor element sizes, resolving custom chat templates from files or environment variables, and disabling Multi-Token Prediction (MTP) layers when not enabled. Feedback focuses on avoiding blocking ZMQ socket operations within an asynchronous method by offloading them to a separate thread, and ensuring that rope_scaling configurations are properly converted to dictionaries before processing to handle RoPEScalingConfig objects.

Signed-off-by: Hollow Man <hollowman@opensuse.org>

gemini-code-assist

Code Review

This pull request introduces support for the DeepSeek-V4-Flash model within the GRPO trainer, including a new Megatron training script and updated documentation. It adds robust FP8 and MXFP4 quantization utilities, improves weight loading and reloading mechanisms for vLLM, and enhances custom chat template resolution from files or environment variables. Additionally, it updates the bucketed weight transfer logic to align offsets by tensor element size and use asynchronous ZMQ. There are no review comments, so no further feedback is provided.

Meirtz · 2026-05-28T05:56:09Z

Hey @HollowMan6 — opened #6515 as the symmetric fix for the vanilla_mbridge=True (ISEEKYAN/mbridge) branch in the same _build_tf_config function. The MTP-disable + csa_compress_ratios trim pattern is the same as your change here, just applied at the set_extra_args rebuild point instead of inside provider_overrides.

End-to-end verified on GB200 single GPU through ISEEKYAN/mbridge + DSv4 hybrid attention (forward+backward+optimizer.step), pairing well with your vanilla_mbridge=False path coverage. They're complementary and independent — either can merge first.

Also kicked an upstream bridge-side default at NVIDIA-NeMo/Megatron-Bridge#4003 for dsa_indexer_loss_coeff (MLATransformerConfig defaults the field to None so csa.py's getattr(..., 0.0) fallback never fires — you're working around it via ++override_transformer_config.dsa_indexer_loss_coeff=0.0 in run_deepseek_v4_flash_megatron.sh, that override stays valid but becomes redundant once the bridge sets a non-None default). Mentioning in case useful coordination.

…on vanilla_mbridge=True path PR verl-project#6473 added the same fix to the vanilla_mbridge=False (NeMo MB) path of MegatronEngine._build_tf_config. The vanilla_mbridge=True (ISEEKYAN/mbridge) path needs the symmetric treatment: when self.model_config.mtp.enable is False, force mtp_num_layers=0 so the bridge does not build MTP blocks, and trim the per-layer csa_compress_ratios list (DSv4-Flash HF configs pad it for the MTP layer when num_nextn_predict_layers > 0). mtp_num_layers uses direct assignment (not setdefault) so a disabled-MTP run always forces 0 even if override_transformer_config carried a stale value. Why not duplicate: verl-project#6473 only modifies the vanilla_mbridge=False branch. This PR modifies the vanilla_mbridge=True branch — different code path, complementary fix. Test plan: validated end-to-end on GB200 (1 GPU) through ISEEKYAN/mbridge + DSv4 hybrid attention — forward + backward + optimizer.step() with the vanilla=True path produces finite loss / finite grad_norm / update_successful=True. AI assistance disclosure: developed with AI-assisted coding (Claude); author reviewed every changed line. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lingrui Mei <lmei@nvidia.com>

[megatron] feat: support DeepSeek V4 GRPO

7058378

Adds DeepSeek V4 Flash GRPO support with Megatron-Bridge actor/ref workers, vLLM rollout, FP8/MXFP4 weight transfer handling, and checkpoint save/export verification. Signed-off-by: Hollow Man <hollowman@opensuse.org>

Copilot AI review requested due to automatic review settings May 26, 2026 06:11

HollowMan6 requested review from ArronHZG, ISEEKYAN, PeterSH6, chenhaiq, ji-huazhong, tardis-key, vermouth1992, wucong25 and wuxibin89 as code owners May 26, 2026 06:11

Copilot started reviewing on behalf of HollowMan6 May 26, 2026 06:11 View session

HollowMan6 mentioned this pull request May 26, 2026

[roadmap] verl 26Q2 roadmap #5836

Open

35 tasks