A from-scratch implementation of the original transformer from Vaswani et al., 2017. Encoder-decoder architecture trained on English-to-French translation.
11.5 million parameters. Runs on a laptop. Translates simple sentences with surprising accuracy.
- model.py - Encoder-decoder transformer (multi-head attention, sinusoidal positional encoding, cross-attention)
- tokenizer.py - Word-level tokenizers for source and target languages
- data.py - Parallel corpus loader and batching
- train.py - Training loop with Noam LR schedule and label smoothing
- translate.py - Greedy decoding with interactive mode
- download_data.py - Downloads Tatoeba English-French sentence pairs
python3 -m venv .venv && source .venv/bin/activate
pip install torch numpy
# Download the data (240k English-French sentence pairs from Tatoeba)
python download_data.py
# Train (uses 50k pairs, ~40 minutes on Apple MPS)
python train.py --epochs 20
# Translate
python translate.py --sentence "The cat is on the table."
python translate.py --interactiveEN: The cat is on the table. FR: le chat est sur la table.
EN: I love you. FR: je t'aime.
EN: She has a beautiful house. FR: elle a une belle maison.
EN: Where is the train station? FR: où est la gare?
EN: I don't understand. FR: je ne comprends pas.
EN: He likes to read books. FR: il aime lire des livres.
EN: It is raining. FR: il pleut.
EN: We are happy. FR: nous sommes heureux.
| Value | |
|---|---|
| d_model | 256 |
| Heads | 8 |
| Encoder layers | 4 |
| Decoder layers | 4 |
| d_ff | 512 |
| Source vocab | 6,033 |
| Target vocab | 9,201 |
| Parameters | 11,536,113 |
| Training time (20 epochs, MPS) | ~40 min |
| Best val loss | 2.6820 (epoch 20) |
Unlike decoder-only models trained on small corpora, this model doesn't overfit aggressively. With 50k sentence pairs and 11.5M parameters, validation loss keeps improving through all 20 epochs.
Written up at danieljohnmorris.com/writing/attention-is-all-you-need.