sync: gitlab/main -> github/main#32
Merged
Merged
Conversation
# 🐛 Bug Fix ## Disable torch.compile around GatedDeltaNet QKV prep - Wrap `_prepare_qkv_for_gated_delta_rule` with `torch._dynamo.config.patch(disable=True)` in the Megatron patch - Avoids torch.compile failure on the Qwen3.6 GatedDeltaNet path --- # ⭐ Feature ## Generalize unsplit-forward path to text-only Qwen3.6 / Qwen3.5 - Detect `Qwen3VLModel` at model build and set `args.uses_unsplit_forward`; the bridge model does CP+SP splitting internally for both VL and text-only Qwen3.5/3.6 sharing the same architecture - Route unsplit tokens + tp*cp*2-aligned `cu_seqlens` through `forward_only` / `train_one_step` whenever the flag is on, not just for VL inputs - Propagate the flag through `data.get_batch`, `loss.compute_advantages_and_returns`, `log_rollout_data`, and `stream_dataloader.post_process_rollout_data` so padding stays consistent ## Add Qwen3.6-35B-A3B 8xGPU DAPO-math training script - New `scripts/training/text/run-qwen36-35B-A3B-8xgpu.sh` for sync GRPO training with TP=2/PP=2/CP=2/EP=4 and partial rollout
# 🐛 Bug Fix ## Propagate fp16 to SGLang Mamba conv dtype - `relax/distributed/ray/genrm.py` and `relax/distributed/ray/rollout.py`: pass `SGLANG_MAMBA_CONV_DTYPE=float16` to the engine env when `--fp16` is set, so Qwen3.6 hybrid-Mamba layers use the matching dtype in rollout/GenRM - `scripts/training/multimodal/run-qwen3-vl-30B-A3B-8xgpu.sh`: add `--fp16 --use-rollout-routing-replay --use-slime-router` - `scripts/training/multimodal/run-qwen35-35B-A3B-8xgpu.sh`: add `--use-rollout-routing-replay --use-slime-router`; document why fp16 is intentionally left disabled for Qwen3.5 ## Skip routing replay for MTP layers - `relax/utils/training/routing_replay.py`: MTP routers exist in training but rollout (sglang) does not run MTP, so there is nothing to record or replay against. Install a pre-hook that clears the global `ROUTING_REPLAY` (so `compute_topk` falls through to the original impl) and skip registration in `all_routing_replays` to keep the per-layer accounting consistent - Guard `compute_topk` against `ROUTING_REPLAY is None` ## Detect Ray 2.x head node by internal resource - `relax/utils/utils.py::get_serve_url`: Ray 2.x auto-registers `node:__internal_head__` on the head node; legacy setups also tag it with a custom `head` resource. Accept either when scanning `ray.nodes()` so head IP discovery works on both --- # 📝 Documentation ## Announce Qwen3.6 support in README - `README.md` / `README_zh.md`: add 05/11/2026 news entry noting Qwen3.6 series (text + VLM) support
# ⭐ Feature ## Auto-detect shared-GPU colocate sub-mode for GenRM - Pick mode from GPU allocation: `R+G==A` keeps the existing split layout; `R==G==A` activates the new shared layout where rollout and genrm overlap on the same bundles. Other combinations are now rejected at startup with a clear error. - Drop the bundle offset for genrm in shared mode so both engines schedule on the same `[0, A)` bundles, and lower genrm Ray fractional `num_gpus` default from 0.2 to 0.1 to leave room alongside rollout. - Plumb `mem_fraction_static` through `--genrm-engine-config`; rollout keeps using `--sglang-mem-fraction-static`. Two engines can now split each GPU independently. - Onload rollout weights and genrm KV in parallel inside `update_weights()` so both engines come back together before the next rollout step. --- # 📝 Documentation ## Document the new GenRM colocate sub-mode (en + zh) - Add a second ASCII architecture diagram for the shared layout and a sub-mode auto-detection table. - Introduce a Shared-mode launch example with `mem_fraction_static` settings and a warning to keep the per-GPU sum < 1.0. - Update Best Practices with sub-mode selection guidance and OOM troubleshooting for shared mode. - Fix stale defaults in the sampling-config table (temperature 0.2 -> 0.1, max_response_len 1024 -> 4096) and add `ep_size` / `mem_fraction_static` to the engine-config table. - Update the example launch script to demonstrate shared mode (rollout 0.5 + genrm 0.3, both on 8 GPU). --- # 🔩 Chore ## Add py-spy multi-PID dump helper - `scripts/tools/_pyspy_dump.sh` runs `py-spy dump` over a list of PIDs in one ray-job submission, used by the debug-hang skill to avoid per-PID submission overhead.
This commit integrates the glm_moe_dsa model, updates Megatron backend, sets up corresponding training scripts, and modifies entrypoint scripts (local.sh, ray-job.sh, spmd-multinode.sh) to support environment variable overrides, keeping the cleanup logic intact.
# 🔩 Chore ## Update torch_memory_saver dependency - Switch source repo from `fzyzcjy/torch_memory_saver` to `redai-infra/torch_memory_saver` fork - Pin to commit `afc13785c50119048e2dd8ac497cc9e29ec75bd4` - Set `TMS_CUDA_MAJOR=12` build-time env var for CUDA 12 compatibility
# ♻️ Refactor
## Split path env vars in launcher scripts
- Introduce `MODEL_DIR` (HF weights / `--ref-load`) and `DATA_DIR`
(`PROMPT_SET` / `--eval-prompt-data`) alongside `EXP_DIR`
(`--load` / `--save`) across 31 training and example scripts
- Each variable is overridable independently; `MODEL_DIR` and
`DATA_DIR` fall back to `EXP_DIR`, while `EXP_DIR` falls back to
`MODEL_DIR` to preserve the legacy `export MODEL_DIR=/root` flow
- Wire `omni-16xgpu-async` defaults (`HF_CHECKPOINT`, `PROMPT_SET`,
`EVAL_PROMPT_DATA`) through the new vars instead of `/path/to/...`
placeholders so the convention is uniform
---
# 📝 Documentation
## Document the path variable convention
- Add a tip block in `docs/{en,zh}/guide/customize-training.md`
describing the three directories and the fallback chain
- Update the example `--hf-checkpoint` / `--ref-load` snippet to use
`${MODEL_DIR}` instead of `${EXP_DIR}`
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Routine internal -> external sync.