Skip to content

sync: gitlab/main -> github/main#29

Merged
Yangruipis merged 7 commits into
mainfrom
sync/from-gitlab
May 9, 2026
Merged

sync: gitlab/main -> github/main#29
Yangruipis merged 7 commits into
mainfrom
sync/from-gitlab

Conversation

@Yangruipis
Copy link
Copy Markdown
Collaborator

Routine internal -> external sync.

dirtyDan0 and others added 7 commits May 9, 2026 18:35
# 🔩 Chore

## Remove dead helpers from data module

- Delete the unreferenced `filter_long_prompt` helper from `relax/utils/data/data.py`
- Delete the dead `_build_messages` helper that was shadowed by `relax/utils/data/data_utils.py`
- Delete the unused `process_rollout_data` helper and the imports it required
# 🐛 Bug Fix

## Keep multimodal prompt building non-destructive

- Build multimodal message content without mutating cached prompt rows
- Prevent reused raw samples from carrying expanded message content into later reads

## Support sliced eager dataset paths

- Parse per-file generalized slice syntax in eager file readers
- Keep multi-file eager path behavior aligned with streaming path semantics
The rollout component exits its main loop on the final training step, leaving the eval handler un-awaited. This caused a race condition where the controller's atexit shutdown tore down SGLang engines mid-flight. This fix blocks until the evaluation finishes at the end of training.
# ⭐ Feature

## Migrate from Megatron-LM to Megatron-Bridge

- Replace direct Megatron-LM checkout with Megatron-Bridge (commit 2faedbf6) in Dockerfile
- Upgrade transformer_engine from 2.10.0 to 2.14.1
- Archive old megatron patch (3714d81d) and add new patch for 20260506-85bced0ae

## Adapt Relax backend to Megatron-Bridge API changes

- Update vocab_size_with_padding import with fallback for new module path
- Rename enable_gloo_process_groups to use_gloo_process_groups
- Rename norm_epsilon to layernorm_epsilon in HF config validation
- Accept **kwargs in wrapped_provider for new model_provider signature
- Relax partition_stride assertion for GLU/SwiGLU linear_fc1 layers (stride=2)
- Guard checkpoint_write_patch against removed write_preloaded_data_multiproc
# ⭐ Feature

## Add Qwen3.6 model support with automatic expert format detection

- Add Qwen3.6-35B-A3B model configuration script with MoE parameters (256 experts, 8-way routing)
- Implement MTP MoE expert weight format detection in Qwen35VL bridge
  - Qwen3.5: per-expert storage (gate_proj/up_proj/down_proj per expert)
  - Qwen3.6: packed format (gate_up_proj/down_proj shared tensor)
- Add training script for Qwen3.6-35B-A3B 8xGPU colocate mode with multimodal support
- Extend Megatron bridge patch with format-aware weight mappings

---

# 🐛 Bug Fix

## Fix multimodal data counting and training script paths

- Fix remain_data counter for pre-structured multimodal content (was skipping already-processed items)
- Remove invalid dataset slice notation (@[0:1000]) from PROMPT_SET path in training script
# ⭐ Feature

## Add rollout reward field metrics

- Aggregate numeric fields from reward dictionaries during rollout logging
- Skip the primary reward key and raw_reward to preserve existing reward metrics
- Reuse the shared helper from the SGLang rollout metrics path
# 🐛 Bug Fix

## Strip consecutive `<|image_pad|>` tokens in pre-tokenized prompts

- Add `QwenVLImageProcessor._strip_image_token` static helper that collapses
  runs of `<|image_pad|>` (token id 151655) into a single placeholder while
  leaving the surrounding `<|vision_start|>`/`<|vision_end|>` markers intact.
- Apply the helper to `prompt` before calling `load_mm_data` in
  `process_mm_data_async`, so pre-tokenized `input_ids` (where each image is
  already expanded to N image-pad tokens, one per visual patch) no longer
  collide with `load_mm_data` re-expanding the placeholder itself. Without
  the collapse, the pipeline saw `N x M` image-pad tokens and miscounted
  positions, breaking mrope bookkeeping.
- Raw text (`str`) prompts are passed through unchanged.
@Yangruipis Yangruipis merged commit 2100c15 into main May 9, 2026
5 checks passed
@Yangruipis Yangruipis deleted the sync/from-gitlab branch May 9, 2026 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants