A small decoder-only transformer built from scratch in PyTorch, trained on Edgar Allan Poe's complete works.
4.8 million parameters. Runs on a laptop. Produces text that sounds like gothic horror written by someone having a stroke.
- scrape.py - Scrapes 131 works (stories + poems) from poemuseum.org
- tokenizer.py - Character-level and BPE tokenizers, no libraries
- model.py - GPT-style transformer (multi-head attention, pre-norm, weight tying)
- train.py - Training loop with AdamW, cosine schedule, checkpointing
- generate.py - Load a model and generate text with temperature/top-k sampling
- data.py - Corpus loader
python3 -m venv .venv && source .venv/bin/activate
pip install torch numpy requests beautifulsoup4
# Scrape the corpus
python scrape.py
# Train (char-level)
python train.py --tokenizer char --epochs 5
# Train (BPE)
python train.py --tokenizer bpe --bpe-vocab-size 512 --epochs 5
# Generate
python generate.py --prompt "Once upon a midnight dreary"
python generate.py --checkpoint checkpoints/bpe_best.pt \
--tokenizer-path checkpoints/bpe_tokenizer.json \
--tokenizer-type bpe \
--prompt "I found myself in a dark chamber"Char-level, epoch 5:
The death of a beautiful woman, through the crowd, through seventy
thousand ways to shall examine the blood-room in the pitcher bere
intended to transmiss of it the very first tenef o'clock, through a
shadow and hard, rumors and his parcel, into the innermost regions
of impetuous apparatus
BPE, epoch 4:
In this stage of my beeting I became aware of a dull, sullen
glow-satisfied with a fashion of great genius, pursues of the kingdom
BPE, epoch 5:
Once upon a midnight dreary upon the lips of the axis. It was then,
fully the musical inclined atmosphere — a small portion of the main
drift to the glittering of the night.
| Char | BPE | |
|---|---|---|
| Vocab size | 146 | 512 |
| d_model | 256 | 256 |
| Heads | 8 | 8 |
| Layers | 6 | 6 |
| Params | 4,809,216 | 4,902,912 |
| Training time (5 epochs, MPS) | ~7.5h | ~3.5h |
Both models overfit after epoch 1. Best validation loss is always the first epoch. With 4.8M parameters and 1.9M characters, the model has more capacity than data.
Written up at danieljohnmorris.com/writing/building-a-tiny-llm-from-scratch.