Skip to content

sync: gitlab/main -> github/main#11

Merged
Yangruipis merged 9 commits into
mainfrom
sync/from-gitlab
Apr 23, 2026
Merged

sync: gitlab/main -> github/main#11
Yangruipis merged 9 commits into
mainfrom
sync/from-gitlab

Conversation

@Yangruipis
Copy link
Copy Markdown
Collaborator

Routine internal -> external sync.

NINGBENZHE and others added 8 commits April 23, 2026 14:32
# 🐛 Bug Fix

## Remove multimodal recompute-granularity guard in model_provider

- Delete the `is_multimodal` detection block that stripped recompute-related
  bridge keys when `--recompute-granularity=full` was set
- The upstream Megatron-Bridge now handles multi-axis RoPE list unpacking,
  making the workaround obsolete

---

# 🐛 Bug Fix

## Remove unsupported group_rm assertion in eval_rollout

- Delete `assert not args.group_rm` that blocked eval rollout when
  group reward model was enabled
- Allows eval rollout to proceed with group RM configurations
# ⭐ Feature

## Support prefixed agentic router passthrough flags

- add automatic `--sglang-router-*` flag passthrough from RouterArgs
- keep framework-managed router fields hidden from passthrough
- preserve explicit policy and request-timeout compatibility flags
# 🐛 Bug Fix

## Throttle metadata polling in DeviceDirectBackend

- Add `time.sleep(0.5)` when the metadata endpoint returns empty
- Prevents tight HTTP GET loop that spammed logs with repeated
  `httpx` INFO requests during async multimodal training
    # ⭐ Feature

    ## Add megatron-bridge integration for weight conversion

    - Add  as drop-in replacement for
    - Use Bridge mapping registry for automatic Megatron-to-HF name/format conversion
    - Enabled via  flag
    - Eliminates need for hand-written per-model converters when Bridge mapping exists

    ## Bridge task map initialization

    - Lazy-init Bridge tasks on first use via
    - Supplement missing params (tied embeddings, output_layer) from mapping registry
    - Eagerly initialize AutoMapping inner delegates for group patching
    - Dynamic task creation for cross-EP-rank expert params with caching

    ## Process group patching for local-only conversion

    - Patch all PP/TP/EP groups to None on all mapping levels (recursive)
    - Monkey-patch  on all classes in delegation chain
    - Restore original groups and methods in finally block

    ---

    # 🐛 Bug Fix

    ## Fix recv_weight_meta polling with long-poll support

    - Add  parameter to coordinator  endpoint
    - Implement lightweight long-poll loop to reduce idle HTTP polling
    - Update client-side  to use configurable long-poll timeout
    - Handle  gracefully with retry
# 🐛 Bug Fix

## Fix fatal TryReadObjectRefStream crash on repeated global restarts

Root cause: `ray.shutdown()` destroys ObjectRefStreams while the
AsyncLoopThread event loop still holds active C++ watchers on them.
The watchers attempt to read from destroyed streams, triggering a
fatal `RAY_CHECK` failure in Ray core.

- Add `shutdown_async_loop()` to `async_utils.py` that stops the
  global event loop and **blocks** until its thread fully exits,
  ensuring no C++ watchers survive into `ray.shutdown()`
- Call `shutdown_async_loop()` in `_global_restart()` Phase 1
  (step 1.8) before `serve.shutdown()` / `ray.shutdown()`
- Add `_cancel_pending_tasks()` to force-cancel all tracked
  ObjectRefs before shutdown, unblocking the main thread from
  stale `await task_ref` calls
- Track pending ObjectRefs in `_pending_task_refs` with a lock
  for thread-safe access between main and HealthChecker threads
- Guard `_pending_task_refs` and its lock with `hasattr` to
  survive `self.__init__()` re-invocation during restart

---

# 🔩 Chore

## Qwen3.5-9B training config adjustments

- Rename script to `run-qwen35-9B-8xgpu-openr1mm-async.sh`
- Reduce `--sglang-mem-fraction-static` from 0.8 to 0.6 to
  avoid SGLang OOM on hybrid GDN model (large Mamba cache)
- Disable ClearML auto-connect for tensorboard/pytorch to
  prevent framework conflict
# 🐛 Bug Fix

## Fix ~20 GB memory leak on _is_pp_src_rank for MoE models

- Pass `param_weight=None` instead of `param_weight=param` in
  `WeightConversionTask` constructor within `_convert_to_hf_bridge()`
- The `param_weight` field is only used for HF→Megatron (load)
  direction, not Megatron→HF (export); storing the EP-gathered
  tensor in the frozen dataclass kept ~20 GB alive indefinitely
  on `_is_pp_src_rank` via `self._bridge_task_map` cache
- Add `torch.cuda.empty_cache()` at end of
  `update_weights_for_rollout()` to release fragmented reserved
  memory from all_gather + HF-convert buffers

---

# ⚡ Performance

## Enable optimizer CPU offload for Qwen3-30B-A3B async

- Add `--optimizer-cpu-offload` and related flags to reduce GPU
  memory pressure during training
- Disable MoE aux loss to avoid algorithm performance degradation
- Skip eval before first train step
@Yangruipis Yangruipis merged commit 5a9d977 into main Apr 23, 2026
5 checks passed
@Yangruipis Yangruipis deleted the sync/from-gitlab branch April 23, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants