Skip to content

sync: gitlab/main -> github/main#21

Merged
Yangruipis merged 6 commits into
mainfrom
sync/from-gitlab
Apr 28, 2026
Merged

sync: gitlab/main -> github/main#21
Yangruipis merged 6 commits into
mainfrom
sync/from-gitlab

Conversation

@Yangruipis
Copy link
Copy Markdown
Collaborator

Routine internal -> external sync.

NINGBENZHE and others added 6 commits April 28, 2026 11:08
# 🐛 Bug Fix

## Correct judge prompt spelling in DeepEyes reward

- replace `Judgement` and `Judement` with `Judgment` in few-shot examples and prompt text
- update the prompt suffix to request `Judgment:` consistently
- keep response parsing backward compatible with both `Judgment:` and `Judgement:` labels
    # ⭐ Feature

    ## Add unified device abstraction layer (`relax/utils/device.py`)

    - Introduce `AcceleratorType` enum: CUDA, NPU, XPU, PPU, ROCM, CPU
    - Auto-detect hardware via `_detect_accelerator()` with priority-based probing
    - Support `RELAX_DEVICE_TYPE` env var override for debugging
    - Provide 25+ thin-wrapper APIs: `current_device()`, `set_device()`, `synchronize()`,
      `empty_cache()`, `Stream()`, `Event()`, `stream_context()`, `is_initialized()`, etc.
    - Map distributed backends: CUDA→nccl, NPU→hccl, XPU→xccl, PPU→eccl
    - Map Ray resource names: CUDA/ROCm→GPU, NPU→NPU, XPU→XPU
    - Map visible-devices env vars per accelerator type
    - Abstract NUMA affinity with graceful degradation for non-CUDA backends

    ---

    # ♻️ Refactor

    ## Replace hardcoded `torch.cuda.*` calls across 20+ files

    - Replace `torch.cuda.current_device()` → `device_utils.current_device()`
    - Replace `torch.cuda.set_device()` → `device_utils.set_device()`
    - Replace `torch.cuda.synchronize()` → `device_utils.synchronize()`
    - Replace `torch.cuda.empty_cache()` → `device_utils.empty_cache()`
    - Replace `torch.cuda.Stream/Event` → `device_utils.Stream()/Event()`
    - Replace `torch.cuda.mem_get_info()` → `device_utils.mem_get_info()`
    - Replace `torch.cuda.device_count()` → `device_utils.device_count()`
    - Replace `torch.device("cuda:...")` → `device_utils.make_current_torch_device()`
    - Replace `device="cuda"` → `device=device_utils.get_device_name()`
    - Replace `"nccl"` backend → `device_utils.get_dist_backend()`
    - Replace `"GPU"` Ray resource → `device_utils.get_ray_accelerator_name()`
    - Replace `CUDA_VISIBLE_DEVICES` → `device_utils.get_visible_devices_env_var()`
    - Wrap CUDA-specific memory profiling APIs with `hasattr` guards
    - Add CUDA-only annotation to `int4_qat/setup.py` kernel build script
# ⭐ Feature

## Add Qwen3.5-9B single-node fully-async training script

- Add `run_deepeyes_qwen35_9B_async.sh` for 8xGPU fully-async DeepEyes training
- Resource layout: actor(4) + rollout(2) + reference(1) + actor_fwd(1)
- Use `--use-dynamic-batch-size` and `--no-rope-fusion` per latest Qwen3.5 conventions

---

# 🐛 Bug Fix

## Add 0-1000 normalized bbox coordinate conversion

- Qwen-VL/Qwen2-VL/Qwen3-VL output 0-1000 normalized coords but `_maybe_resize_bbox` treated them as absolute pixels
- Add coordinate conversion step before clamping in `_maybe_resize_bbox`
- Add `normalize_bbox` parameter to `DeepeyesEnv` (default True) for model-specific control
- Qwen2.5-VL users can set `normalize_bbox: false` since it outputs absolute pixel coords
- Wire `normalize_bbox` through `build_env` from custom config
…sion

# 🐛 Bug Fix

## Fix IndexError on 1D tensor transpose in fully-async weight sync

- Qwen3.5 Bridge outputs expert gate_up_proj as 2D [2*H, D] (cat, no transpose),
  unlike Qwen3-VL which outputs 3D [2, D_out, D_in] (stack + transpose)
- Qwen3.5 ExpertMLPDownProjMapping inherits AutoMapping (no transpose),
  unlike Qwen3-VL which overrides megatron_to_hf with transpose
- Add ndim-based branching in _convert_to_hf_bridge post-processing:
  3D → Qwen3-VL path (undo transpose + index), 2D → Qwen3.5 path (chunk)
- Detect bridge_expert_transposes_down at init time via __dict__ introspection
  to decide whether down_proj needs un-transpose

---

# ✅ Tests

## Add Qwen3.5 Bridge expert weight conversion tests

- Add TestQwen35BridgeMappingOutput: verify 2D cat output, no-transpose
  down_proj, and megatron_to_hf override detection
- Add TestQwen35PostProcessingCorrectness: end-to-end gate_up split and
  down_proj passthrough correctness
- Update _apply_expert_postprocessing helper to accept bridge_expert_transposes_down param
@Yangruipis
Copy link
Copy Markdown
Collaborator Author

Closes #18 #17

@Yangruipis Yangruipis merged commit 207ace5 into main Apr 28, 2026
5 checks passed
@Yangruipis Yangruipis deleted the sync/from-gitlab branch April 28, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants