Training Massive FMs

What does it take to train massive foundation LLMs? (>1b)

We are essentially distilling trillions of bytes of information roughly into ~0.001 times the storage. This has to a rather sophisticated compression algortihm to retain the important aspects of digital data! What gets retained and what gets lost? I've had ample experience experience fine-tuning with and without adapters. That procedure, however, operates at a stage where the model has already absorbed basic language modelling capabilities such as grammar, typography, sentence structure, etc. This project will explore the fundamentals of pretraining, and training at scale in general. I will start by exploring methodologies that make training at scale possible.

[DISCLAIMER] Some projects in this repository such as MLA are not specifically used during training, but I thought they are still useful in scaling these models.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
MLA		MLA
InternVL&InternVideo.md		InternVL&InternVideo.md
MLA.md		MLA.md
README.md		README.md
REDUCING ACTIVATION RECOMPUTATION IN LARGE TRANSFORMER MODELS.md		REDUCING ACTIVATION RECOMPUTATION IN LARGE TRANSFORMER MODELS.md
training-InternVideo2.md		training-InternVideo2.md
training-megatron-turing-530B.md		training-megatron-turing-530B.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training Massive FMs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Training Massive FMs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages