sync: gitlab/main -> github/main#11
Merged
Merged
Conversation
# 🐛 Bug Fix ## Remove multimodal recompute-granularity guard in model_provider - Delete the `is_multimodal` detection block that stripped recompute-related bridge keys when `--recompute-granularity=full` was set - The upstream Megatron-Bridge now handles multi-axis RoPE list unpacking, making the workaround obsolete --- # 🐛 Bug Fix ## Remove unsupported group_rm assertion in eval_rollout - Delete `assert not args.group_rm` that blocked eval rollout when group reward model was enabled - Allows eval rollout to proceed with group RM configurations
…sted type parsing
# ⭐ Feature ## Support prefixed agentic router passthrough flags - add automatic `--sglang-router-*` flag passthrough from RouterArgs - keep framework-managed router fields hidden from passthrough - preserve explicit policy and request-timeout compatibility flags
# 🐛 Bug Fix ## Throttle metadata polling in DeviceDirectBackend - Add `time.sleep(0.5)` when the metadata endpoint returns empty - Prevents tight HTTP GET loop that spammed logs with repeated `httpx` INFO requests during async multimodal training
# ⭐ Feature
## Add megatron-bridge integration for weight conversion
- Add as drop-in replacement for
- Use Bridge mapping registry for automatic Megatron-to-HF name/format conversion
- Enabled via flag
- Eliminates need for hand-written per-model converters when Bridge mapping exists
## Bridge task map initialization
- Lazy-init Bridge tasks on first use via
- Supplement missing params (tied embeddings, output_layer) from mapping registry
- Eagerly initialize AutoMapping inner delegates for group patching
- Dynamic task creation for cross-EP-rank expert params with caching
## Process group patching for local-only conversion
- Patch all PP/TP/EP groups to None on all mapping levels (recursive)
- Monkey-patch on all classes in delegation chain
- Restore original groups and methods in finally block
---
# 🐛 Bug Fix
## Fix recv_weight_meta polling with long-poll support
- Add parameter to coordinator endpoint
- Implement lightweight long-poll loop to reduce idle HTTP polling
- Update client-side to use configurable long-poll timeout
- Handle gracefully with retry
# 🐛 Bug Fix ## Fix fatal TryReadObjectRefStream crash on repeated global restarts Root cause: `ray.shutdown()` destroys ObjectRefStreams while the AsyncLoopThread event loop still holds active C++ watchers on them. The watchers attempt to read from destroyed streams, triggering a fatal `RAY_CHECK` failure in Ray core. - Add `shutdown_async_loop()` to `async_utils.py` that stops the global event loop and **blocks** until its thread fully exits, ensuring no C++ watchers survive into `ray.shutdown()` - Call `shutdown_async_loop()` in `_global_restart()` Phase 1 (step 1.8) before `serve.shutdown()` / `ray.shutdown()` - Add `_cancel_pending_tasks()` to force-cancel all tracked ObjectRefs before shutdown, unblocking the main thread from stale `await task_ref` calls - Track pending ObjectRefs in `_pending_task_refs` with a lock for thread-safe access between main and HealthChecker threads - Guard `_pending_task_refs` and its lock with `hasattr` to survive `self.__init__()` re-invocation during restart --- # 🔩 Chore ## Qwen3.5-9B training config adjustments - Rename script to `run-qwen35-9B-8xgpu-openr1mm-async.sh` - Reduce `--sglang-mem-fraction-static` from 0.8 to 0.6 to avoid SGLang OOM on hybrid GDN model (large Mamba cache) - Disable ClearML auto-connect for tensorboard/pytorch to prevent framework conflict
# 🐛 Bug Fix ## Fix ~20 GB memory leak on _is_pp_src_rank for MoE models - Pass `param_weight=None` instead of `param_weight=param` in `WeightConversionTask` constructor within `_convert_to_hf_bridge()` - The `param_weight` field is only used for HF→Megatron (load) direction, not Megatron→HF (export); storing the EP-gathered tensor in the frozen dataclass kept ~20 GB alive indefinitely on `_is_pp_src_rank` via `self._bridge_task_map` cache - Add `torch.cuda.empty_cache()` at end of `update_weights_for_rollout()` to release fragmented reserved memory from all_gather + HF-convert buffers --- # ⚡ Performance ## Enable optimizer CPU offload for Qwen3-30B-A3B async - Add `--optimizer-cpu-offload` and related flags to reduce GPU memory pressure during training - Disable MoE aux loss to avoid algorithm performance degradation - Skip eval before first train step
NINGBENZHE
approved these changes
Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Routine internal -> external sync.