sync: gitlab/main -> github/main by Yangruipis · Pull Request #11 · redai-infra/Relax

Yangruipis · 2026-04-23T06:35:08Z

Routine internal -> external sync.

# 🐛 Bug Fix ## Remove multimodal recompute-granularity guard in model_provider - Delete the `is_multimodal` detection block that stripped recompute-related bridge keys when `--recompute-granularity=full` was set - The upstream Megatron-Bridge now handles multi-axis RoPE list unpacking, making the workaround obsolete --- # 🐛 Bug Fix ## Remove unsupported group_rm assertion in eval_rollout - Delete `assert not args.group_rm` that blocked eval rollout when group reward model was enabled - Allows eval rollout to proceed with group RM configurations

…sted type parsing

# ⭐ Feature ## Support prefixed agentic router passthrough flags - add automatic `--sglang-router-*` flag passthrough from RouterArgs - keep framework-managed router fields hidden from passthrough - preserve explicit policy and request-timeout compatibility flags

# 🐛 Bug Fix ## Throttle metadata polling in DeviceDirectBackend - Add `time.sleep(0.5)` when the metadata endpoint returns empty - Prevents tight HTTP GET loop that spammed logs with repeated `httpx` INFO requests during async multimodal training

# ⭐ Feature ## Add megatron-bridge integration for weight conversion - Add as drop-in replacement for - Use Bridge mapping registry for automatic Megatron-to-HF name/format conversion - Enabled via flag - Eliminates need for hand-written per-model converters when Bridge mapping exists ## Bridge task map initialization - Lazy-init Bridge tasks on first use via - Supplement missing params (tied embeddings, output_layer) from mapping registry - Eagerly initialize AutoMapping inner delegates for group patching - Dynamic task creation for cross-EP-rank expert params with caching ## Process group patching for local-only conversion - Patch all PP/TP/EP groups to None on all mapping levels (recursive) - Monkey-patch on all classes in delegation chain - Restore original groups and methods in finally block --- # 🐛 Bug Fix ## Fix recv_weight_meta polling with long-poll support - Add parameter to coordinator endpoint - Implement lightweight long-poll loop to reduce idle HTTP polling - Update client-side to use configurable long-poll timeout - Handle gracefully with retry

# 🐛 Bug Fix ## Fix fatal TryReadObjectRefStream crash on repeated global restarts Root cause: `ray.shutdown()` destroys ObjectRefStreams while the AsyncLoopThread event loop still holds active C++ watchers on them. The watchers attempt to read from destroyed streams, triggering a fatal `RAY_CHECK` failure in Ray core. - Add `shutdown_async_loop()` to `async_utils.py` that stops the global event loop and **blocks** until its thread fully exits, ensuring no C++ watchers survive into `ray.shutdown()` - Call `shutdown_async_loop()` in `_global_restart()` Phase 1 (step 1.8) before `serve.shutdown()` / `ray.shutdown()` - Add `_cancel_pending_tasks()` to force-cancel all tracked ObjectRefs before shutdown, unblocking the main thread from stale `await task_ref` calls - Track pending ObjectRefs in `_pending_task_refs` with a lock for thread-safe access between main and HealthChecker threads - Guard `_pending_task_refs` and its lock with `hasattr` to survive `self.__init__()` re-invocation during restart --- # 🔩 Chore ## Qwen3.5-9B training config adjustments - Rename script to `run-qwen35-9B-8xgpu-openr1mm-async.sh` - Reduce `--sglang-mem-fraction-static` from 0.8 to 0.6 to avoid SGLang OOM on hybrid GDN model (large Mamba cache) - Disable ClearML auto-connect for tensorboard/pytorch to prevent framework conflict

# 🐛 Bug Fix ## Fix ~20 GB memory leak on _is_pp_src_rank for MoE models - Pass `param_weight=None` instead of `param_weight=param` in `WeightConversionTask` constructor within `_convert_to_hf_bridge()` - The `param_weight` field is only used for HF→Megatron (load) direction, not Megatron→HF (export); storing the EP-gathered tensor in the frozen dataclass kept ~20 GB alive indefinitely on `_is_pp_src_rank` via `self._bridge_task_map` cache - Add `torch.cuda.empty_cache()` at end of `update_weights_for_rollout()` to release fragmented reserved memory from all_gather + HF-convert buffers --- # ⚡ Performance ## Enable optimizer CPU offload for Qwen3-30B-A3B async - Add `--optimizer-cpu-offload` and related flags to reduce GPU memory pressure during training - Disable MoE aux loss to avoid algorithm performance degradation - Skip eval before first train step

NINGBENZHE and others added 8 commits April 23, 2026 14:32

fix(scripts,data): migrate ref-load to bridge mode and fix pyarrow ne…

d6d3bcc

…sted type parsing

feat: add opd overlap ratio

e525635

Yangruipis requested review from Aurelius84, NINGBENZHE and yxyOo as code owners April 23, 2026 06:35

fix: skip gpu related test on github

3365182

NINGBENZHE approved these changes Apr 23, 2026

View reviewed changes

Yangruipis merged commit 5a9d977 into main Apr 23, 2026
5 checks passed

Yangruipis deleted the sync/from-gitlab branch April 23, 2026 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: gitlab/main -> github/main#11

sync: gitlab/main -> github/main#11
Yangruipis merged 9 commits into
mainfrom
sync/from-gitlab

Yangruipis commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Yangruipis commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants