This codebase helps Snellius users to quickly set up their LLM pretraining tasks.
A few Snellius-specific pointers:
- GPU: NVIDIA H100 SXM5 94GB (4 GPUs per node)
- Interconnect: Infiniband HDR200
- Persistent storage is provided upon grant agreement under /projects/0/prjsXXXX
- Temporary storage (/scratch-shared/$USER/) for model checkpointing and log saving
- Operating system: Red Hat Enterprise Linux 9.4 (Plow)
To run this tutorial you must clone this repository in your home directory
git clone https://github.com/SURF-ML/Megatron-LM-SnelliusPlease ensure that you obtain the following file hierarchy:
root (you are here)/
├── 0_build_container/ --- build the container
├── 1_download_data/ ----- download the dataset
├── 2_tokenization/ ------ prepare the tokens
├── 3_train/ ------------- train the model
├── Megatron-LM/ --------- Megatron-LM codebase submodule
If the Megatron-LM directory is empty you can download the code following:
git submodule update --initMegatron LM is actively developped and breaking changes might be introduced in the library. To avoid using those changes you can checkout a particular commit of the library. If you encounter any issues with the Megatron library try the following steps:
cd Megatron-LM
git checkout 1f6cde85d23ff0c307a47bbdd8bfd778b95a161f
cd ../Most of the script in the tutorial will ask you to specify the path to your project space. This path can be added in the bash file as:
export PROJECT_SPACE=/projects/0/prjsXXXXYou can also export this path in your .bashrc to have the project space persistent so you can always change directory easy with cd $PROJECT_SPACE:
echo 'export PROJECT_SPACE=/projects/0/prjsXXXX' >> ~/.bashrc
source ~/.bashrc
The tutorial consists of 4 parts:
Thanks to @spyysalo original LUMI Megatron-LM guide here which has tremendously helped this guide