This is the official implementation for our paper "Spectral Condition for μP under Width–Depth Scaling". In this work, we:
-
Establish a unified spectral condition for μP under joint width–depth scaling.
-
Derive a general implementation recipe that translates spectral constraints into concrete hyperparameter (HP) parameterizations (e.g., learning rate, weight decay) across a broad class of optimizers.
-
Validate the theory empirically on nanoGPT trained with Muon-Kimi, showing that the resulting μP formulation achieves scale-invariant feature learning and robust hyperparameter transfer.
In this repository, we provide the code and introduction to reproduce our experiments on training nanoGPT by Muon-Kimi&AdamW (Fig. 1).
Intall dependencies:
pip install torch==2.1.0 transformers==4.33.0 datasets==3.6.0 tiktoken==0.9.0 numpy==1.26.4 wandbPrepare the OpenWebText dataset following nanoGPT. This downloads and tokenizes the dataset. It will create a train.bin and val.bin which holds the GPT2 BPE token ids in one sequence, stored as raw uint16 bytes. Then we're ready to kick off training.
python data/openwebtext/prepare.pyTo make everything easily searchable, each of the critical changes to implement μP of Muon-Kimi&AdamW from the paper are marked with
### Begin muP code ###
<code for muP change>
### End muP code ###The "mup_examples" folder contains code to reproduce data points in Fig. 1 (a,b).
bash mup_examples/coord_check_moonshot_width/mup/run.sh
bash mup_examples/coord_check_moonshot_width/sp/run.sh
bash mup_examples/coord_check_moonshot_depth/mup/run.sh
bash mup_examples/coord_check_moonshot_depth/sp/run.shAfter that, use "coord_check.ipynb" to plot the figures.
SP:
torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_transfer_wf.py \
--out_dir=out/transfer_moonshot_wf/h4_lr-6 \
--wandb_run_name=transfer_moonshot_wf_h4_lr-6 \
--n_head=4 \ # sweep
--n_embd=256 \ # 64 x n_head
--learning_rate=0.015625 # sweepμP:
torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_mup_transfer_wf.py \
--out_dir=out/transfer_moonshot_mup_wf/h4_lr-6 \
--wandb_run_name=transfer_moonshot_mup_wf_h4_lr-6 \
--n_head=4 \ # sweep
--n_embd=256 \ # 64 x n_head
--mup_width_multiplier=1.0 \ # n_head/4
--learning_rate=0.015625 # sweepSP:
torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_transfer_df.py \
--out_dir=out/transfer_moonshot_df/L4_lr-5 \
--wandb_run_name=transfer_moonshot_df_L4_lr-5 \
--n_layer=4 \ # sweep
--learning_rate=0.03125 # sweepμP:
torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_mup_transfer_df.py \
--out_dir=out/transfer_moonshot_mup_df/L4_lr-5 \
--wandb_run_name=transfer_moonshot_mup_df_L4_lr-5 \
--n_layer=4 \ # sweep
--depth_multiplier=1.0 \ # n_layer/4
--learning_rate=0.03125 # sweepSP:
torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_transfer_df.py \
--wandb_project=Nanogpt-muP-noln-mooshot-transfer-df \
--out_dir=out/transfer_noln_moonshot_sp_df/L4_lr-5 \
--wandb_run_name=transfer_noln_moonshot_sp_df_L4_lr-5 \
--layernorm=False \
--n_layer=4 \ # sweep
--learning_rate=0.03125 # sweepμP:
torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_mup_transfer_df.py \
--wandb_project=Nanogpt-muP-noln-mooshot-transfer-df \
--out_dir=out/transfer_noln_moonshot_mup_df/L4_lr-5 \
--wandb_run_name=transfer_noln_moonshot_mup_df_L4_lr-5 \
--layernorm=False \
--n_layer=4 \ # sweep
--depth_multiplier=1.0 \ # n_layer/4
--learning_rate=0.03125 # sweepThis project is based on the remarkable nanoGPT and CompleteP repos. Thanks for their great work!
If our paper "Spectral Condition for μP under Width–Depth Scaling" or this repository was useful to you, please cite:
TODO