What does it take to train massive foundation LLMs? (>1b)
We are essentially distilling trillions of bytes of information roughly into ~0.001 times the storage. This has to a rather sophisticated compression algortihm to retain the important aspects of digital data! What gets retained and what gets lost? I've had ample experience experience fine-tuning with and without adapters. That procedure, however, operates at a stage where the model has already absorbed basic language modelling capabilities such as grammar, typography, sentence structure, etc. This project will explore the fundamentals of pretraining, and training at scale in general. I will start by exploring methodologies that make training at scale possible.
[DISCLAIMER] Some projects in this repository such as MLA are not specifically used during training, but I thought they are still useful in scaling these models.