Skip to content

danieljohnmorris/attention-is-all-you-need

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Attention Is All You Need

A from-scratch implementation of the original transformer from Vaswani et al., 2017. Encoder-decoder architecture trained on English-to-French translation.

11.5 million parameters. Runs on a laptop. Translates simple sentences with surprising accuracy.

What's in here

  • model.py - Encoder-decoder transformer (multi-head attention, sinusoidal positional encoding, cross-attention)
  • tokenizer.py - Word-level tokenizers for source and target languages
  • data.py - Parallel corpus loader and batching
  • train.py - Training loop with Noam LR schedule and label smoothing
  • translate.py - Greedy decoding with interactive mode
  • download_data.py - Downloads Tatoeba English-French sentence pairs

Quick start

python3 -m venv .venv && source .venv/bin/activate
pip install torch numpy

# Download the data (240k English-French sentence pairs from Tatoeba)
python download_data.py

# Train (uses 50k pairs, ~40 minutes on Apple MPS)
python train.py --epochs 20

# Translate
python translate.py --sentence "The cat is on the table."
python translate.py --interactive

Sample output (epoch 20)

EN: The cat is on the table.     FR: le chat est sur la table.
EN: I love you.                  FR: je t'aime.
EN: She has a beautiful house.   FR: elle a une belle maison.
EN: Where is the train station?  FR: où est la gare?
EN: I don't understand.          FR: je ne comprends pas.
EN: He likes to read books.      FR: il aime lire des livres.
EN: It is raining.               FR: il pleut.
EN: We are happy.                FR: nous sommes heureux.

Model config

Value
d_model 256
Heads 8
Encoder layers 4
Decoder layers 4
d_ff 512
Source vocab 6,033
Target vocab 9,201
Parameters 11,536,113
Training time (20 epochs, MPS) ~40 min
Best val loss 2.6820 (epoch 20)

Unlike decoder-only models trained on small corpora, this model doesn't overfit aggressively. With 50k sentence pairs and 11.5M parameters, validation loss keeps improving through all 20 epochs.

Blog post

Written up at danieljohnmorris.com/writing/attention-is-all-you-need.

About

From-scratch encoder-decoder transformer (Vaswani et al., 2017) for English-French translation. 11.5M params, PyTorch, trains on a laptop.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages