Skip to content

[megatron] feat: support DeepSeek V4 GRPO#6473

Open
HollowMan6 wants to merge 2 commits into
verl-project:mainfrom
HollowMan6:dsv4
Open

[megatron] feat: support DeepSeek V4 GRPO#6473
HollowMan6 wants to merge 2 commits into
verl-project:mainfrom
HollowMan6:dsv4

Conversation

@HollowMan6
Copy link
Copy Markdown
Collaborator

@HollowMan6 HollowMan6 commented May 26, 2026

What does this PR do?

Adds DeepSeek V4 Flash GRPO support with Megatron-Bridge actor/ref workers, vLLM rollout, FP8/MXFP4 weight transfer handling, and checkpoint save/export verification.

Need

Checklist Before Starting

Test

End-to-end verification:

  • Model: DeepSeek-V4-Flash
  • Engine: Megatron actor/ref + vLLM rollout
  • Nodes: 16
  • GRPO dataset: DAPO math
  • MAX_RESPONSE_LENGTH=10240
  • Result: COMPLETED 0:0
  • Checkpoint save/export result: Success: All tensors from the original checkpoint were written.

Key metrics:

  • response_length/max=9406
  • response_length/clip_ratio=0.0
  • critic/rewards/mean=0.645361
  • training/rollout_actor_probs_pearson_corr=0.797749
  • rollout diff mean/std/max: 0.111945 / 0.195764 / 0.999996
  • rollout_corr/kl=0.346802
  • timing_s/step=2273.88

API and Usage Example

This PR adds an example script:

bash examples/grpo_trainer/run_deepseek_v4_flash_megatron.sh

Example Slurm usage:

sbatch --nodes=16 \
  --export=ALL,MODEL_PATH=/path/to/DeepSeek-V4-Flash,TRAIN_FILE=/path/to/train.parquet,TEST_FILE=/path/to/test.parquet \
  outputs/deepseek_v4_flash_example_grpo.sbatch

Design & Code Changes

  • Add DeepSeek V4 Flash GRPO example script and README entry.
  • Add vLLM FP8/MXFP4 utilities for quantized rollout weight preparation and reload-safe parameter handling.
  • Add bucketed Megatron-to-vLLM weight transfer coverage for quantized weights.
  • Normalize DeepSeek V4 vLLM HF overrides, including MTP disablement and Yarn RoPE scaling fields.
  • Adjust Megatron transformer config handling for disabled MTP layers.
  • Add model config support needed by DeepSeek V4 Flash.
  • Add unit coverage for vLLM FP8 utilities, bucketed transfer behavior, and model config handling.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Adds DeepSeek V4 Flash GRPO support with Megatron-Bridge actor/ref workers, vLLM rollout, FP8/MXFP4 weight transfer handling, and checkpoint save/export verification.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the DeepSeek-V4-Flash model in the GRPO trainer, adding a new Megatron training script and extensive utilities for FP8 and MXFP4 quantization weight loading. It also enhances custom chat template resolution, aligns offsets in bucketed weight transfers to respect tensor element sizes, and adds comprehensive unit tests. The review feedback highlights two critical issues: a potential AttributeError in transformer_impl.py when accessing csa_compress_ratios on standard models, and another potential AttributeError in vllm_fp8_utils.py when copying parameter attributes.

Comment thread verl/workers/engine/megatron/transformer_impl.py Outdated
Comment thread verl/utils/vllm/vllm_fp8_utils.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds DeepSeek V4 Flash GRPO support for Megatron-Bridge actor/ref workers with vLLM rollout, including FP8/MXFP4 weight reload handling, config overrides, and example + unit coverage.

Changes:

  • Normalize vLLM HF overrides (MTP disablement + Yarn RoPE scaling) and align Megatron transformer/provider overrides for disabled MTP layers.
  • Improve vLLM rollout weight sync for quantized models (FP8/MXFP4), including bucketed transfer alignment and reload-safe parameter handling.
  • Add custom_chat_template resolution (file/env) plus tests and a DeepSeek-V4-Flash Megatron GRPO example script / README entry.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
verl/workers/rollout/vllm_rollout/vllm_async_server.py Adds HF override normalization for MTP disablement and Yarn RoPE scaling.
verl/workers/rollout/vllm_rollout/utils.py Updates vLLM colocated worker to better handle FP8 reload preparation/post-processing across bucketed weight sync.
verl/workers/rollout/vllm_rollout/bucketed_weight_transfer.py Aligns bucket offsets to tensor element size to support safe dtype views on the receiver side.
verl/workers/engine/megatron/transformer_impl.py Adjusts Megatron provider overrides when MTP is disabled (and trims CSA ratios).
verl/workers/config/model.py Adds custom_chat_template support and resolution from file/env at config materialization time.
verl/utils/vllm/vllm_fp8_utils.py Adds MXFP4 quantization + reload-safe param restoration and DeepSeek V4 naming/mapping tweaks.
tests/workers/config/test_model_config_on_cpu.py Adds unit tests for chat template resolution and mutability.
tests/utils/test_vllm_fp8_utils.py Adds targeted unit tests for MXFP4 packing, prequantized detection, and scale-name conventions.
tests/utils/test_bucketed_weight_transfer.py Adds unit test for new bucket offset alignment helper.
examples/grpo_trainer/run_deepseek_v4_flash_megatron.sh Adds runnable DeepSeek-V4-Flash GRPO example (Megatron + vLLM rollout).
examples/grpo_trainer/README.md Documents the new DeepSeek-V4-Flash example entry.
Comments suppressed due to low confidence (1)

verl/workers/rollout/vllm_rollout/utils.py:311

  • The drafter FP8 reload path still calls load_quanted_weights(...) with default prepare_model=True/process_model=True for every received bucket. With bucketed transfer this can repeat non-idempotent post-processing and adds significant overhead. It should mirror the main-model behavior by skipping per-bucket prepare/process when quant_prepared is true, and rely on the once-per-reload processing at the end of update_weights_from_ipc().
                # Keep the draft model in sync when present.
                if self._use_mtp_drafter_weight_sync():
                    load_quanted_weights(weights, self.model_runner, is_drafter=True)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread verl/workers/rollout/vllm_rollout/utils.py
Comment thread verl/workers/rollout/vllm_rollout/utils.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the DeepSeek-V4-Flash model, adding training scripts, custom chat template resolution, and extensive FP8/MXFP4 quantization and weight-loading utilities. It also updates the bucketed weight transfer mechanism to align offsets based on tensor element size and fixes Megatron transformer configuration building when MTP is disabled. The review feedback highlights critical stability improvements: guarding against None values when casting rope_scaling factors to float in vllm_async_server.py, handling cases where packed_modules_mapping is explicitly set to None in vllm_fp8_utils.py, and broadening exception handling during parameter attribute copying to prevent unexpected crashes.

Comment thread verl/workers/rollout/vllm_rollout/vllm_async_server.py
Comment thread verl/utils/vllm/vllm_fp8_utils.py Outdated
Comment thread verl/utils/vllm/vllm_fp8_utils.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the DeepSeek-V4-Flash model, featuring a Megatron training script, custom chat template resolution from files or environment variables, and comprehensive FP8/MXFP4 quantization utilities for vLLM rollout. It also addresses alignment issues in bucketed weight transfer by respecting tensor element size and disables MTP layers when MTP is not enabled. A critical feedback item was identified in the _model_type helper function, where an AttributeError could occur if the model's configuration is None; a safe traversal suggestion has been provided to prevent potential crashes.

Comment thread verl/utils/vllm/vllm_fp8_utils.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the DeepSeek-V4-Flash model in the GRPO trainer, including a new Megatron training launch script. Key changes include implementing MXFP4 weight quantization and loading utilities, aligning offsets during weight transfer based on tensor element sizes, resolving custom chat templates from files or environment variables, and disabling Multi-Token Prediction (MTP) layers when not enabled. Feedback focuses on avoiding blocking ZMQ socket operations within an asynchronous method by offloading them to a separate thread, and ensuring that rope_scaling configurations are properly converted to dictionaries before processing to handle RoPEScalingConfig objects.

Comment thread verl/workers/rollout/vllm_rollout/bucketed_weight_transfer.py
Comment thread verl/workers/rollout/vllm_rollout/vllm_async_server.py
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the DeepSeek-V4-Flash model within the GRPO trainer, including a new Megatron training script and updated documentation. It adds robust FP8 and MXFP4 quantization utilities, improves weight loading and reloading mechanisms for vLLM, and enhances custom chat template resolution from files or environment variables. Additionally, it updates the bucketed weight transfer logic to align offsets by tensor element size and use asynchronous ZMQ. There are no review comments, so no further feedback is provided.

@Meirtz
Copy link
Copy Markdown

Meirtz commented May 28, 2026

Hey @HollowMan6 — opened #6515 as the symmetric fix for the vanilla_mbridge=True (ISEEKYAN/mbridge) branch in the same _build_tf_config function. The MTP-disable + csa_compress_ratios trim pattern is the same as your change here, just applied at the set_extra_args rebuild point instead of inside provider_overrides.

End-to-end verified on GB200 single GPU through ISEEKYAN/mbridge + DSv4 hybrid attention (forward+backward+optimizer.step), pairing well with your vanilla_mbridge=False path coverage. They're complementary and independent — either can merge first.

Also kicked an upstream bridge-side default at NVIDIA-NeMo/Megatron-Bridge#4003 for dsa_indexer_loss_coeff (MLATransformerConfig defaults the field to None so csa.py's getattr(..., 0.0) fallback never fires — you're working around it via ++override_transformer_config.dsa_indexer_loss_coeff=0.0 in run_deepseek_v4_flash_megatron.sh, that override stays valid but becomes redundant once the bridge sets a non-None default). Mentioning in case useful coordination.

Meirtz added a commit to Meirtz/verl that referenced this pull request May 29, 2026
…on vanilla_mbridge=True path

PR verl-project#6473 added the same fix to the vanilla_mbridge=False (NeMo MB) path of
MegatronEngine._build_tf_config. The vanilla_mbridge=True (ISEEKYAN/mbridge)
path needs the symmetric treatment: when self.model_config.mtp.enable is
False, force mtp_num_layers=0 so the bridge does not build MTP blocks, and
trim the per-layer csa_compress_ratios list (DSv4-Flash HF configs pad it for
the MTP layer when num_nextn_predict_layers > 0).

mtp_num_layers uses direct assignment (not setdefault) so a disabled-MTP run
always forces 0 even if override_transformer_config carried a stale value.

Why not duplicate: verl-project#6473 only modifies the vanilla_mbridge=False branch.
This PR modifies the vanilla_mbridge=True branch — different code path,
complementary fix.

Test plan: validated end-to-end on GB200 (1 GPU) through ISEEKYAN/mbridge +
DSv4 hybrid attention — forward + backward + optimizer.step() with the
vanilla=True path produces finite loss / finite grad_norm /
update_successful=True.

AI assistance disclosure: developed with AI-assisted coding (Claude); author
reviewed every changed line.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Lingrui Mei <lmei@nvidia.com>
Meirtz added a commit to Meirtz/verl that referenced this pull request May 29, 2026
…on vanilla_mbridge=True path

PR verl-project#6473 added the same fix to the vanilla_mbridge=False (NeMo MB) path of
MegatronEngine._build_tf_config. The vanilla_mbridge=True (ISEEKYAN/mbridge)
path needs the symmetric treatment: when self.model_config.mtp.enable is
False, force mtp_num_layers=0 so the bridge does not build MTP blocks, and
trim the per-layer csa_compress_ratios list (DSv4-Flash HF configs pad it for
the MTP layer when num_nextn_predict_layers > 0).

mtp_num_layers uses direct assignment (not setdefault) so a disabled-MTP run
always forces 0 even if override_transformer_config carried a stale value.

Why not duplicate: verl-project#6473 only modifies the vanilla_mbridge=False branch.
This PR modifies the vanilla_mbridge=True branch — different code path,
complementary fix.

Test plan: validated end-to-end on GB200 (1 GPU) through ISEEKYAN/mbridge +
DSv4 hybrid attention — forward + backward + optimizer.step() with the
vanilla=True path produces finite loss / finite grad_norm /
update_successful=True.

AI assistance disclosure: developed with AI-assisted coding (Claude); author
reviewed every changed line.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Lingrui Mei <lmei@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants