Skip to content

[BUG]Error when training Qwen3.5 with Megatron #1317

@leekum2018

Description

@leekum2018

Packages: Areal: 1.04, Transformers: 5.8, megatron-core:0.17, mbridge: 0.15.1
The error information is as below

`qwen3.5 model --- mtp_args:{'mtp_num_layers': 1, 'mtp_loss_scaling_factor': 0.1}
20260509-13:26:32.821 [MegatronEngine Rank 0] INFO: Using mbridge to create models and hf model save/load in MegatronEngine.�[0m

number of parameters on (tensor, pipeline) model parallel rank (0, 0): 4659865088
Traceback (most recent call last):
File "/mnt/projectsareal_peojects/areal/infra/rpc/guard/engine_blueprint.py", line 538, in execute_in_engine_thread
result = method(*args_bcast, **kwargs_bcast)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/projectsareal_peojects/areal/engine/megatron_engine.py", line 388, in initialize
self._load_model_from_hf(self.config.path)
File "/mnt/projectsareal_peojects/areal/engine/megatron_engine.py", line 1910, in _load_model_from_hf
load_weights_from_hf_with_mbridge_fast(
File "/mnt/projectsareal_peojects/areal/models/mcore/hf_load.py", line 738, in load_weights_from_hf_with_mbridge_fast
for _ in results:
^^^^^^^
File "/mnt/users_home/local/xw/miniconda3/envs/areal/lib/python3.12/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/users_home/local/xw/miniconda3/envs/areal/lib/python3.12/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/mnt/users_home/local/xw/miniconda3/envs/areal/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/mnt/users_home/local/xw/miniconda3/envs/areal/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/mnt/users_home/local/xw/miniconda3/envs/areal/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/projectsareal_peojects/areal/models/mcore/hf_load.py", line 735, in
lambda kwargs: _load_weight_with_bridge_worker(**kwargs), worker_args
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/projectsareal_peojects/areal/models/mcore/hf_load.py", line 537, in _load_weight_with_bridge_worker
param_to_load = _weight_to_mcore_tp(
^^^^^^^^^^^^^^^^^^^^
File "/mnt/projectsareal_peojects/areal/models/mcore/hf_load.py", line 444, in _weight_to_mcore_tp
res = _slice_generic_weight(
^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/projectsareal_peojects/areal/models/mcore/hf_load.py", line 254, in _slice_generic_weight
assert len(hf_weights_safe_slice) == 1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
�[31m�[1;38;2;180;140;80m(AReaL)�[0m �[31m20260509-13:26:35.696 EngineBP ERROR: Engine method 'initialize' failed:
Traceback (most recent call last):
File "/mnt/projectsareal_peojects/areal/infra/rpc/guard/engine_blueprint.py", line 538, in execute_in_engine_thread
result = method(*args_bcast, **kwargs_bcast)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/projectsareal_peojects/areal/engine/megatron_engine.py", line 388, in initialize
self._load_model_from_hf(self.config.path)
File "/mnt/projectsareal_peojects/areal/engine/megatron_engine.py", line 1910, in _load_model_from_hf
load_weights_from_hf_with_mbridge_fast(
File "/mnt/projectsareal_peojects/areal/models/mcore/hf_load.py", line 738, in load_weights_from_hf_with_mbridge_fast
for _ in results:
^^^^^^^
File "/mnt/users_home/local/xw/miniconda3/envs/areal/lib/python3.12/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/users_home/local/xw/miniconda3/envs/areal/lib/python3.12/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/mnt/users_home/local/xw/miniconda3/envs/areal/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/mnt/users_home/local/xw/miniconda3/envs/areal/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/mnt/users_home/local/xw/miniconda3/envs/areal/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/projectsareal_peojects/areal/models/mcore/hf_load.py", line 735, in
lambda kwargs: _load_weight_with_bridge_worker(**kwargs), worker_args
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/projectsareal_peojects/areal/models/mcore/hf_load.py", line 537, in _load_weight_with_bridge_worker
param_to_load = _weight_to_mcore_tp(
^^^^^^^^^^^^^^^^^^^^
File "/mnt/projectsareal_peojects/areal/models/mcore/hf_load.py", line 444, in _weight_to_mcore_tp
res = _slice_generic_weight(
^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/projectsareal_peojects/areal/models/mcore/hf_load.py", line 254, in _slice_generic_weight
assert len(hf_weights_safe_slice) == 1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError`

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions