Skip to content

sync: gitlab/main -> github/main#32

Merged
NINGBENZHE merged 7 commits into
mainfrom
sync/from-gitlab
May 14, 2026
Merged

sync: gitlab/main -> github/main#32
NINGBENZHE merged 7 commits into
mainfrom
sync/from-gitlab

Conversation

@Yangruipis
Copy link
Copy Markdown
Collaborator

Routine internal -> external sync.

Yangruipis and others added 7 commits May 14, 2026 11:57
# 🐛 Bug Fix

## Disable torch.compile around GatedDeltaNet QKV prep

- Wrap `_prepare_qkv_for_gated_delta_rule` with `torch._dynamo.config.patch(disable=True)` in the Megatron patch
- Avoids torch.compile failure on the Qwen3.6 GatedDeltaNet path

---

# ⭐ Feature

## Generalize unsplit-forward path to text-only Qwen3.6 / Qwen3.5

- Detect `Qwen3VLModel` at model build and set `args.uses_unsplit_forward`; the bridge model does CP+SP splitting internally for both VL and text-only Qwen3.5/3.6 sharing the same architecture
- Route unsplit tokens + tp*cp*2-aligned `cu_seqlens` through `forward_only` / `train_one_step` whenever the flag is on, not just for VL inputs
- Propagate the flag through `data.get_batch`, `loss.compute_advantages_and_returns`, `log_rollout_data`, and `stream_dataloader.post_process_rollout_data` so padding stays consistent

## Add Qwen3.6-35B-A3B 8xGPU DAPO-math training script

- New `scripts/training/text/run-qwen36-35B-A3B-8xgpu.sh` for sync GRPO training with TP=2/PP=2/CP=2/EP=4 and partial rollout
# 🐛 Bug Fix

## Propagate fp16 to SGLang Mamba conv dtype

- `relax/distributed/ray/genrm.py` and `relax/distributed/ray/rollout.py`:
  pass `SGLANG_MAMBA_CONV_DTYPE=float16` to the engine env when `--fp16` is
  set, so Qwen3.6 hybrid-Mamba layers use the matching dtype in rollout/GenRM
- `scripts/training/multimodal/run-qwen3-vl-30B-A3B-8xgpu.sh`: add
  `--fp16 --use-rollout-routing-replay --use-slime-router`
- `scripts/training/multimodal/run-qwen35-35B-A3B-8xgpu.sh`: add
  `--use-rollout-routing-replay --use-slime-router`; document why fp16 is
  intentionally left disabled for Qwen3.5

## Skip routing replay for MTP layers

- `relax/utils/training/routing_replay.py`: MTP routers exist in training
  but rollout (sglang) does not run MTP, so there is nothing to record or
  replay against. Install a pre-hook that clears the global
  `ROUTING_REPLAY` (so `compute_topk` falls through to the original impl)
  and skip registration in `all_routing_replays` to keep the per-layer
  accounting consistent
- Guard `compute_topk` against `ROUTING_REPLAY is None`

## Detect Ray 2.x head node by internal resource

- `relax/utils/utils.py::get_serve_url`: Ray 2.x auto-registers
  `node:__internal_head__` on the head node; legacy setups also tag it with
  a custom `head` resource. Accept either when scanning `ray.nodes()` so
  head IP discovery works on both

---

# 📝 Documentation

## Announce Qwen3.6 support in README

- `README.md` / `README_zh.md`: add 05/11/2026 news entry noting Qwen3.6
  series (text + VLM) support
# ⭐ Feature

## Auto-detect shared-GPU colocate sub-mode for GenRM

- Pick mode from GPU allocation: `R+G==A` keeps the existing split layout; `R==G==A` activates the new shared layout where rollout and genrm overlap on the same bundles. Other combinations are now rejected at startup with a clear error.
- Drop the bundle offset for genrm in shared mode so both engines schedule on the same `[0, A)` bundles, and lower genrm Ray fractional `num_gpus` default from 0.2 to 0.1 to leave room alongside rollout.
- Plumb `mem_fraction_static` through `--genrm-engine-config`; rollout keeps using `--sglang-mem-fraction-static`. Two engines can now split each GPU independently.
- Onload rollout weights and genrm KV in parallel inside `update_weights()` so both engines come back together before the next rollout step.

---

# 📝 Documentation

## Document the new GenRM colocate sub-mode (en + zh)

- Add a second ASCII architecture diagram for the shared layout and a sub-mode auto-detection table.
- Introduce a Shared-mode launch example with `mem_fraction_static` settings and a warning to keep the per-GPU sum < 1.0.
- Update Best Practices with sub-mode selection guidance and OOM troubleshooting for shared mode.
- Fix stale defaults in the sampling-config table (temperature 0.2 -> 0.1, max_response_len 1024 -> 4096) and add `ep_size` / `mem_fraction_static` to the engine-config table.
- Update the example launch script to demonstrate shared mode (rollout 0.5 + genrm 0.3, both on 8 GPU).

---

# 🔩 Chore

## Add py-spy multi-PID dump helper

- `scripts/tools/_pyspy_dump.sh` runs `py-spy dump` over a list of PIDs in one ray-job submission, used by the debug-hang skill to avoid per-PID submission overhead.
This commit integrates the glm_moe_dsa model, updates Megatron backend, sets up corresponding training scripts, and modifies entrypoint scripts (local.sh, ray-job.sh, spmd-multinode.sh) to support environment variable overrides, keeping the cleanup logic intact.
# 🔩 Chore

## Update torch_memory_saver dependency

- Switch source repo from `fzyzcjy/torch_memory_saver` to `redai-infra/torch_memory_saver` fork
- Pin to commit `afc13785c50119048e2dd8ac497cc9e29ec75bd4`
- Set `TMS_CUDA_MAJOR=12` build-time env var for CUDA 12 compatibility
# ♻️ Refactor

## Split path env vars in launcher scripts

- Introduce `MODEL_DIR` (HF weights / `--ref-load`) and `DATA_DIR`
  (`PROMPT_SET` / `--eval-prompt-data`) alongside `EXP_DIR`
  (`--load` / `--save`) across 31 training and example scripts
- Each variable is overridable independently; `MODEL_DIR` and
  `DATA_DIR` fall back to `EXP_DIR`, while `EXP_DIR` falls back to
  `MODEL_DIR` to preserve the legacy `export MODEL_DIR=/root` flow
- Wire `omni-16xgpu-async` defaults (`HF_CHECKPOINT`, `PROMPT_SET`,
  `EVAL_PROMPT_DATA`) through the new vars instead of `/path/to/...`
  placeholders so the convention is uniform

---

# 📝 Documentation

## Document the path variable convention

- Add a tip block in `docs/{en,zh}/guide/customize-training.md`
  describing the three directories and the fallback chain
- Update the example `--hf-checkpoint` / `--ref-load` snippet to use
  `${MODEL_DIR}` instead of `${EXP_DIR}`
Copy link
Copy Markdown
Member

@NINGBENZHE NINGBENZHE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@NINGBENZHE NINGBENZHE merged commit cbb1a82 into main May 14, 2026
5 checks passed
@Yangruipis Yangruipis deleted the sync/from-gitlab branch May 14, 2026 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants