LM-from-scratch

An end-to-end PyTorch pipeline for training GPT-style transformers—from data preparation and tokenization through distributed pretraining (DDP, FSDP1/FSDP2) to supervised fine-tuning—with minimal dependencies. Optimized for GPUs with 20–30 GB VRAM.

Overview

LM-from-scratch offers a hands-on, minimal-dependency training workflow using only PyTorch and Transformers. Ideal for developers who want full control and transparency, it runs efficiently on modest multi-GPU setups and is ready for scaling up.

Features

🎯 Multiple Training Modes
Choose from DDP, FSDP v1, FSDP v2, or supervised fine-tuning (SFT) with zero high-level wrappers.
🔄 Adaptive Batching by Token Count
Employs bucketed batching that balances memory usage across context lengths.
💾 Flexible Checkpointing
Implementation of 2 different distributed checkpointing strategies
🧰 Custom Collation & Masked Loss
Special-token-aware collator for instruction-tuning corpora with masked loss on assistant tokens.

Quick Start

git clone https://github.com/dastin359/LM-from-scratch.git
cd LM-from-scratch
pip install -r requirements.txt

# Pretrain using DDP on 4 GPUs
torchrun --nproc_per_node=4 ddp_pretrain.py \
  --save_dir=my_run --log_dir=my_log

# Pretrain using FSDP1 on 4 GPUs
torchrun --nproc_per_node=4 fsdp1_pretrain.py \
  --save_dir=my_run --log_dir=my_log

# Pretrain using FSDP2 on 4 GPUs
torchrun --nproc_per_node=4 fsdp2_pretrain.py \
  --save_dir=my_run --log_dir=my_log

# Perform supervised fine-tuning
torchrun --nproc_per_node=4 fsdp2_sft.py \
  --save_dir=sft_run --log_dir=sft_log

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md
app_state.py		app_state.py
ddp_pretrain.py		ddp_pretrain.py
fsdp1_pretrain.py		fsdp1_pretrain.py
fsdp2_pretrain.py		fsdp2_pretrain.py
fsdp2_sft.py		fsdp2_sft.py
requirements.txt		requirements.txt
technical_deep_dive.md		technical_deep_dive.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LM-from-scratch

Overview

Features

Quick Start

Read More

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LM-from-scratch

Overview

Features

Quick Start

Read More

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages