sync: gitlab/main -> github/main by Yangruipis · Pull Request #32 · redai-infra/Relax

Yangruipis · 2026-05-14T04:00:08Z

Routine internal -> external sync.

# 🐛 Bug Fix ## Disable torch.compile around GatedDeltaNet QKV prep - Wrap `_prepare_qkv_for_gated_delta_rule` with `torch._dynamo.config.patch(disable=True)` in the Megatron patch - Avoids torch.compile failure on the Qwen3.6 GatedDeltaNet path --- # ⭐ Feature ## Generalize unsplit-forward path to text-only Qwen3.6 / Qwen3.5 - Detect `Qwen3VLModel` at model build and set `args.uses_unsplit_forward`; the bridge model does CP+SP splitting internally for both VL and text-only Qwen3.5/3.6 sharing the same architecture - Route unsplit tokens + tp*cp*2-aligned `cu_seqlens` through `forward_only` / `train_one_step` whenever the flag is on, not just for VL inputs - Propagate the flag through `data.get_batch`, `loss.compute_advantages_and_returns`, `log_rollout_data`, and `stream_dataloader.post_process_rollout_data` so padding stays consistent ## Add Qwen3.6-35B-A3B 8xGPU DAPO-math training script - New `scripts/training/text/run-qwen36-35B-A3B-8xgpu.sh` for sync GRPO training with TP=2/PP=2/CP=2/EP=4 and partial rollout

# 🐛 Bug Fix ## Propagate fp16 to SGLang Mamba conv dtype - `relax/distributed/ray/genrm.py` and `relax/distributed/ray/rollout.py`: pass `SGLANG_MAMBA_CONV_DTYPE=float16` to the engine env when `--fp16` is set, so Qwen3.6 hybrid-Mamba layers use the matching dtype in rollout/GenRM - `scripts/training/multimodal/run-qwen3-vl-30B-A3B-8xgpu.sh`: add `--fp16 --use-rollout-routing-replay --use-slime-router` - `scripts/training/multimodal/run-qwen35-35B-A3B-8xgpu.sh`: add `--use-rollout-routing-replay --use-slime-router`; document why fp16 is intentionally left disabled for Qwen3.5 ## Skip routing replay for MTP layers - `relax/utils/training/routing_replay.py`: MTP routers exist in training but rollout (sglang) does not run MTP, so there is nothing to record or replay against. Install a pre-hook that clears the global `ROUTING_REPLAY` (so `compute_topk` falls through to the original impl) and skip registration in `all_routing_replays` to keep the per-layer accounting consistent - Guard `compute_topk` against `ROUTING_REPLAY is None` ## Detect Ray 2.x head node by internal resource - `relax/utils/utils.py::get_serve_url`: Ray 2.x auto-registers `node:__internal_head__` on the head node; legacy setups also tag it with a custom `head` resource. Accept either when scanning `ray.nodes()` so head IP discovery works on both --- # 📝 Documentation ## Announce Qwen3.6 support in README - `README.md` / `README_zh.md`: add 05/11/2026 news entry noting Qwen3.6 series (text + VLM) support

# ⭐ Feature ## Auto-detect shared-GPU colocate sub-mode for GenRM - Pick mode from GPU allocation: `R+G==A` keeps the existing split layout; `R==G==A` activates the new shared layout where rollout and genrm overlap on the same bundles. Other combinations are now rejected at startup with a clear error. - Drop the bundle offset for genrm in shared mode so both engines schedule on the same `[0, A)` bundles, and lower genrm Ray fractional `num_gpus` default from 0.2 to 0.1 to leave room alongside rollout. - Plumb `mem_fraction_static` through `--genrm-engine-config`; rollout keeps using `--sglang-mem-fraction-static`. Two engines can now split each GPU independently. - Onload rollout weights and genrm KV in parallel inside `update_weights()` so both engines come back together before the next rollout step. --- # 📝 Documentation ## Document the new GenRM colocate sub-mode (en + zh) - Add a second ASCII architecture diagram for the shared layout and a sub-mode auto-detection table. - Introduce a Shared-mode launch example with `mem_fraction_static` settings and a warning to keep the per-GPU sum < 1.0. - Update Best Practices with sub-mode selection guidance and OOM troubleshooting for shared mode. - Fix stale defaults in the sampling-config table (temperature 0.2 -> 0.1, max_response_len 1024 -> 4096) and add `ep_size` / `mem_fraction_static` to the engine-config table. - Update the example launch script to demonstrate shared mode (rollout 0.5 + genrm 0.3, both on 8 GPU). --- # 🔩 Chore ## Add py-spy multi-PID dump helper - `scripts/tools/_pyspy_dump.sh` runs `py-spy dump` over a list of PIDs in one ray-job submission, used by the debug-hang skill to avoid per-PID submission overhead.

This commit integrates the glm_moe_dsa model, updates Megatron backend, sets up corresponding training scripts, and modifies entrypoint scripts (local.sh, ray-job.sh, spmd-multinode.sh) to support environment variable overrides, keeping the cleanup logic intact.

# 🔩 Chore ## Update torch_memory_saver dependency - Switch source repo from `fzyzcjy/torch_memory_saver` to `redai-infra/torch_memory_saver` fork - Pin to commit `afc13785c50119048e2dd8ac497cc9e29ec75bd4` - Set `TMS_CUDA_MAJOR=12` build-time env var for CUDA 12 compatibility

# ♻️ Refactor ## Split path env vars in launcher scripts - Introduce `MODEL_DIR` (HF weights / `--ref-load`) and `DATA_DIR` (`PROMPT_SET` / `--eval-prompt-data`) alongside `EXP_DIR` (`--load` / `--save`) across 31 training and example scripts - Each variable is overridable independently; `MODEL_DIR` and `DATA_DIR` fall back to `EXP_DIR`, while `EXP_DIR` falls back to `MODEL_DIR` to preserve the legacy `export MODEL_DIR=/root` flow - Wire `omni-16xgpu-async` defaults (`HF_CHECKPOINT`, `PROMPT_SET`, `EVAL_PROMPT_DATA`) through the new vars instead of `/path/to/...` placeholders so the convention is uniform --- # 📝 Documentation ## Document the path variable convention - Add a tip block in `docs/{en,zh}/guide/customize-training.md` describing the three directories and the fallback chain - Update the example `--hf-checkpoint` / `--ref-load` snippet to use `${MODEL_DIR}` instead of `${EXP_DIR}`

NINGBENZHE

/lgtm

Yangruipis and others added 7 commits May 14, 2026 11:57

fix(device): avoid gloo dist backend on CPU-only Ray driver

44c41e4

Yangruipis requested review from Aurelius84, NINGBENZHE and yxyOo as code owners May 14, 2026 04:00

NINGBENZHE approved these changes May 14, 2026

View reviewed changes

NINGBENZHE merged commit cbb1a82 into main May 14, 2026
5 checks passed

Yangruipis deleted the sync/from-gitlab branch May 14, 2026 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: gitlab/main -> github/main#32

sync: gitlab/main -> github/main#32
NINGBENZHE merged 7 commits into
mainfrom
sync/from-gitlab

Yangruipis commented May 14, 2026

Uh oh!

NINGBENZHE left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yangruipis commented May 14, 2026

Uh oh!

NINGBENZHE left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants