Conversation
Summary of ChangesHello @yueming-yuan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the model support by integrating the GLM-5 architecture, which features a novel Multi-Latent Attention mechanism. It introduces the necessary infrastructure, including a custom Docker environment, specific patches for external libraries, and new weight conversion logic to correctly handle GLM-5's unique components. Furthermore, it enhances distributed training capabilities with advanced context parallelism strategies and provides comprehensive scripts to facilitate training and evaluation of GLM-5 models across different scales and configurations. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces support for the GLM-5 model, including a new model implementation, custom CUDA kernels via TileLang, and integration with the training pipeline. However, a security audit identified medium-severity vulnerabilities, specifically command injection in the training script scripts/run_glm5_744b_a40b.py due to unsanitized CLI arguments, and potential sensitive information leakage in miles/utils/typer_utils.py which prints all CLI arguments, risking exposure of secrets like W&B API keys. Additionally, critical issues were found with hardcoded dimensions in the CUDA kernels and weight conversion scripts that will cause failures when running with the GLM-5 configuration, and a bug in the autograd function for the indexer will lead to runtime errors during the backward pass.
| def backward(ctx, grad_scores, grad_indices): | ||
| index_q, index_k, weights, cu_seqlen_ks, cu_seqlen_ke, topk_indices = ctx.saved_tensors | ||
| grad_q, grad_w, grad_k = indexer_bwd_interface(index_q, weights, index_k, topk_indices, grad_scores) | ||
| return grad_q, grad_k, grad_w, None, None, None, None, None, None, None |
There was a problem hiding this comment.
The backward method returns 10 values, but the forward method only receives 7 arguments (excluding ctx): index_q, index_k, weights, cu_seqlen_ks, cu_seqlen_ke, topk, and topk_indices. In PyTorch, the backward method must return exactly one gradient for each input to forward. This mismatch will cause a RuntimeError during training.
| return grad_q, grad_k, grad_w, None, None, None, None, None, None, None | |
| return grad_q, grad_k, grad_w, None, None, None, None |
| assert kv.shape[-1] == dim_plus_tail_dim | ||
| assert kv.shape[0] == B | ||
| # dim should be assigned | ||
| D = 512 |
There was a problem hiding this comment.
| batch, seq_len, heads, dim_plus_tail_dim = q.shape | ||
| _, seq_len_kv, kv_group, _ = kv.shape | ||
|
|
||
| assert dim_plus_tail_dim == 576, "you should assign dim otherwise" |
There was a problem hiding this comment.
| U.convert_checkpoint( | ||
| model_name=args.model_name, | ||
| megatron_model_type=args.megatron_model_type, | ||
| num_gpus_per_node=num_gpus_per_node, | ||
| multinode=multinode, | ||
| num_nodes=num_nodes, | ||
| extra_args=extra_args, | ||
| dir_dst=args.model_dir, | ||
| megatron_path=args.megatron_path, | ||
| ) |
| U.execute_train( | ||
| train_args=train_args, | ||
| config=args, | ||
| num_gpus_per_node=args.num_gpus_per_node, | ||
| megatron_model_type=args.megatron_model_type, | ||
| extra_env_vars={ | ||
| **sglang_extra_env_vars, | ||
| "INDEXER_ROPE_NEOX_STYLE": "0", | ||
| "NVSHMEM_DISABLE_NCCL": "1", | ||
| }, | ||
| megatron_path=args.megatron_path, | ||
| ) |
|
|
||
|
|
||
| def _prepare_download(args: ScriptArgs): | ||
| U.exec_command(f"mkdir -p {args.model_dir} {args.data_dir}") |
| def fuse_rope(q, cu_seqlens, gathered=False): | ||
| # worse precision than apex. | ||
| # from megatron.core.extensions.transformer_engine import fused_apply_rotary_pos_emb_thd | ||
| from apex.transformer.functional import fused_apply_rotary_pos_emb_thd |
| tp_group=self.pg_collection.tp, | ||
| ) | ||
|
|
||
| self.index_topk = 2048 |
| hidden_size=self.config.index_head_dim, | ||
| config=self.config, | ||
| # The layernorm eps is hardcoded at the moment | ||
| eps=1e-6, |
| if "self_attention.wq_b.weight" in mcore_weights_name: | ||
| hf_names = self._weight_name_mapping_mcore_to_hf(mcore_weights_name) | ||
| wq_b = mcore_weights | ||
| wq_b = wq_b.view(-1, 128, wq_b.shape[-1]) # hard code 128 |
# Conflicts: # miles_plugins/mbridge/__init__.py
Acknowledgement
Many thanks to GLM-5 team, @zhuzilin @lilei199908 for their great work and open-source efforts.
THUDM/slime#1599
Supported/verified features:
Known issues:
Usage
radixark/miles:glm5dockermiles/scripts/run_glm5_744b_a40b.pyfor script usage