Spectral Condition for μP under Width–Depth Scaling

This is the official implementation for our paper "Spectral Condition for μP under Width–Depth Scaling". In this work, we:

Establish a unified spectral condition for μP under joint width–depth scaling.
Derive a general implementation recipe that translates spectral constraints into concrete hyperparameter (HP) parameterizations (e.g., learning rate, weight decay) across a broad class of optimizers.
Validate the theory empirically on nanoGPT trained with Muon-Kimi, showing that the resulting μP formulation achieves scale-invariant feature learning and robust hyperparameter transfer.

In this repository, we provide the code and introduction to reproduce our experiments on training nanoGPT by Muon-Kimi&AdamW (Fig. 1).

Dependencies and Dataset

Intall dependencies:

pip install torch==2.1.0 transformers==4.33.0 datasets==3.6.0 tiktoken==0.9.0 numpy==1.26.4 wandb

Prepare the OpenWebText dataset following nanoGPT. This downloads and tokenizes the dataset. It will create a train.bin and val.bin which holds the GPT2 BPE token ids in one sequence, stored as raw uint16 bytes. Then we're ready to kick off training.

python data/openwebtext/prepare.py

Implementation of μP

To make everything easily searchable, each of the critical changes to implement μP of Muon-Kimi&AdamW from the paper are marked with

### Begin muP code ###
<code for muP change>
### End muP code ###

Coordinate-Check for Feature Learning

The "mup_examples" folder contains code to reproduce data points in Fig. 1 (a,b).

bash mup_examples/coord_check_moonshot_width/mup/run.sh
bash mup_examples/coord_check_moonshot_width/sp/run.sh
bash mup_examples/coord_check_moonshot_depth/mup/run.sh
bash mup_examples/coord_check_moonshot_depth/sp/run.sh

After that, use "coord_check.ipynb" to plot the figures.

HP Transfer

Width-wise Transfer Fig.1 (c)

SP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_transfer_wf.py \
--out_dir=out/transfer_moonshot_wf/h4_lr-6 \
--wandb_run_name=transfer_moonshot_wf_h4_lr-6 \
--n_head=4 \ # sweep
--n_embd=256 \ # 64 x n_head
--learning_rate=0.015625 # sweep

μP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_mup_transfer_wf.py \
--out_dir=out/transfer_moonshot_mup_wf/h4_lr-6 \
--wandb_run_name=transfer_moonshot_mup_wf_h4_lr-6 \
--n_head=4 \ # sweep
--n_embd=256 \ # 64 x n_head
--mup_width_multiplier=1.0 \ # n_head/4
--learning_rate=0.015625 # sweep

Depth-wise transfer Fig.1 (d)

SP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_transfer_df.py \
--out_dir=out/transfer_moonshot_df/L4_lr-5 \
--wandb_run_name=transfer_moonshot_df_L4_lr-5 \
--n_layer=4 \ # sweep
--learning_rate=0.03125 # sweep

μP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_mup_transfer_df.py \
--out_dir=out/transfer_moonshot_mup_df/L4_lr-5 \
--wandb_run_name=transfer_moonshot_mup_df_L4_lr-5 \
--n_layer=4 \ # sweep
--depth_multiplier=1.0 \ # n_layer/4
--learning_rate=0.03125 # sweep

Depth-wise transfer w.o. Layernorm (Fig. 2 in paper)

SP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_transfer_df.py \
--wandb_project=Nanogpt-muP-noln-mooshot-transfer-df \
--out_dir=out/transfer_noln_moonshot_sp_df/L4_lr-5 \
--wandb_run_name=transfer_noln_moonshot_sp_df_L4_lr-5 \
--layernorm=False \
--n_layer=4 \ # sweep
--learning_rate=0.03125 # sweep

μP:

torchrun --standalone --master_port=${MASTER_PORT} \
--master_addr=${MASTER_ADDR} \
--nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
train_moonshot.py \
config/train_gpt2_moonshot_mup_transfer_df.py \
--wandb_project=Nanogpt-muP-noln-mooshot-transfer-df \
--out_dir=out/transfer_noln_moonshot_mup_df/L4_lr-5 \
--wandb_run_name=transfer_noln_moonshot_mup_df_L4_lr-5 \
--layernorm=False \
--n_layer=4 \ # sweep
--depth_multiplier=1.0 \ # n_layer/4
--learning_rate=0.03125 # sweep

Acknowledgement

This project is based on the remarkable nanoGPT and CompleteP repos. Thanks for their great work!

Citation

If our paper "Spectral Condition for μP under Width–Depth Scaling" or this repository was useful to you, please cite:

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
config		config
data/openwebtext		data/openwebtext
mup_examples		mup_examples
optimizers		optimizers
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
configurator.py		configurator.py
coord_check.ipynb		coord_check.ipynb
csv_logging.py		csv_logging.py
model.py		model.py
train_moonshot.py		train_moonshot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spectral Condition for μP under Width–Depth Scaling

Dependencies and Dataset

Implementation of μP

Coordinate-Check for Feature Learning

HP Transfer

Width-wise Transfer Fig.1 (c)

Depth-wise transfer Fig.1 (d)

Depth-wise transfer w.o. Layernorm (Fig. 2 in paper)

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Spectral Condition for μP under Width–Depth Scaling

Dependencies and Dataset

Implementation of μP

Coordinate-Check for Feature Learning

HP Transfer

Width-wise Transfer Fig.1 (c)

Depth-wise transfer Fig.1 (d)

Depth-wise transfer w.o. Layernorm (Fig. 2 in paper)

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages