Zmj/add qwen3.5 npu longcontext#6474
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new launch script for running Qwen 3.5 35B GRPO training using Megatron and vLLM on Ascend NPUs. The feedback points out a critical configuration issue where the expert parallel size (EP) is set to 8 on a single-node 16-GPU setup. Because the calculated data parallel size (DP) is 1, having EP=8 violates Megatron-Core's constraint that EP must be less than or equal to DP, which will cause an assertion failure. It is recommended to set EP to 1.
| TP=${TP:-2} | ||
| PP=${PP:-2} | ||
| CP=${CP:-4} | ||
| EP=${EP:-8} |
There was a problem hiding this comment.
In this single-node 16-GPU configuration (trainer.nnodes=1, trainer.n_gpus_per_node=16), the data parallel size is DP = World_Size / (TP * PP * CP) = 16 / (2 * 2 * 4) = 1. Since Megatron-Core requires the expert model parallel size (EP) to be less than or equal to the data parallel size (DP), setting EP=8 will cause an assertion failure during model parallel initialization. For a single-node run, EP must be set to 1.
| EP=${EP:-8} | |
| EP=${EP:-1} |
|
Refer to other scripts for naming conventions |
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.