Skip to content

ML-GSAI/Width-Depth-muP

Repository files navigation

Spectral Condition for μP under Width–Depth Scaling

This is the official implementation for our paper "Spectral Condition for μP under Width–Depth Scaling". In this work, we:

  1. Establish a unified spectral condition for μP under joint width–depth scaling.

  2. Derive a general implementation recipe that translates spectral constraints into concrete hyperparameter (HP) parameterizations (e.g., learning rate, weight decay) across a broad class of optimizers.

  3. Validate the theory empirically on nanoGPT trained with Muon-Kimi, showing that the resulting μP formulation achieves scale-invariant feature learning and robust hyperparameter transfer.

In this repository, we provide the code and introduction to reproduce our experiments on training nanoGPT by Muon-Kimi&AdamW (Fig. 1).

fig

Dependencies and Dataset

Intall dependencies:

pip install torch==2.1.0 transformers==4.33.0 datasets==3.6.0 tiktoken==0.9.0 numpy==1.26.4 wandb

Prepare the OpenWebText dataset following nanoGPT. This downloads and tokenizes the dataset. It will create a train.bin and val.bin which holds the GPT2 BPE token ids in one sequence, stored as raw uint16 bytes. Then we're ready to kick off training.

python data/openwebtext/prepare.py

Implementation of μP

To make everything easily searchable, each of the critical changes to implement μP of Muon-Kimi&AdamW from the paper are marked with

### Begin muP code ###
<code for muP change>
### End muP code ###

Coordinate-Check for Feature Learning

The "mup_examples" folder contains code to reproduce data points in Fig. 1 (a,b).

bash mup_examples/coord_check_moonshot_width/mup/run.sh
bash mup_examples/coord_check_moonshot_width/sp/run.sh
bash mup_examples/coord_check_moonshot_depth/mup/run.sh
bash mup_examples/coord_check_moonshot_depth/sp/run.sh

After that, use "coord_check.ipynb" to plot the figures.

HP Transfer

Width-wise Transfer Fig.1 (c)

SP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_transfer_wf.py \
--out_dir=out/transfer_moonshot_wf/h4_lr-6 \
--wandb_run_name=transfer_moonshot_wf_h4_lr-6 \
--n_head=4 \ # sweep
--n_embd=256 \ # 64 x n_head
--learning_rate=0.015625 # sweep

μP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_mup_transfer_wf.py \
--out_dir=out/transfer_moonshot_mup_wf/h4_lr-6 \
--wandb_run_name=transfer_moonshot_mup_wf_h4_lr-6 \
--n_head=4 \ # sweep
--n_embd=256 \ # 64 x n_head
--mup_width_multiplier=1.0 \ # n_head/4
--learning_rate=0.015625 # sweep

Depth-wise transfer Fig.1 (d)

SP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_transfer_df.py \
--out_dir=out/transfer_moonshot_df/L4_lr-5 \
--wandb_run_name=transfer_moonshot_df_L4_lr-5 \
--n_layer=4 \ # sweep
--learning_rate=0.03125 # sweep

μP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_mup_transfer_df.py \
--out_dir=out/transfer_moonshot_mup_df/L4_lr-5 \
--wandb_run_name=transfer_moonshot_mup_df_L4_lr-5 \
--n_layer=4 \ # sweep
--depth_multiplier=1.0 \ # n_layer/4
--learning_rate=0.03125 # sweep

Depth-wise transfer w.o. Layernorm (Fig. 2 in paper)

SP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_transfer_df.py \
--wandb_project=Nanogpt-muP-noln-mooshot-transfer-df \
--out_dir=out/transfer_noln_moonshot_sp_df/L4_lr-5 \
--wandb_run_name=transfer_noln_moonshot_sp_df_L4_lr-5 \
--layernorm=False \
--n_layer=4 \ # sweep
--learning_rate=0.03125 # sweep

μP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_mup_transfer_df.py \
--wandb_project=Nanogpt-muP-noln-mooshot-transfer-df \
--out_dir=out/transfer_noln_moonshot_mup_df/L4_lr-5 \
--wandb_run_name=transfer_noln_moonshot_mup_df_L4_lr-5 \
--layernorm=False \
--n_layer=4 \ # sweep
--depth_multiplier=1.0 \ # n_layer/4
--learning_rate=0.03125 # sweep

Acknowledgement

This project is based on the remarkable nanoGPT and CompleteP repos. Thanks for their great work!

Citation

If our paper "Spectral Condition for μP under Width–Depth Scaling" or this repository was useful to you, please cite:

TODO

About

Official implementation for our paper "Spectral Condition for μP under Width–Depth Scaling".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages