A working, from-scratch PyTorch implementation of a Byte Latent Transformer (BLT) (Meta FAIR, Pagnoni et al. 2024 — "Patches Scale Better Than Tokens", arXiv:2412.09871), trained end-to-end on TinyStories on a single 6 GB consumer GPU.
BLT is tokenizer-free: it reads raw bytes (0–255) and dynamically groups them into patches, spending compute where the next byte is hard to predict and skimming where it's easy. No BPE, no vocab file — just bytes.
Trained model: 🤗 sssssaud/blt-llm-tinystories-55m
| Metric | Value |
|---|---|
| Held-out bits-per-byte (BPB) | 0.71 (best 0.7078) |
| Untrained baseline | 8.0 (= log₂256) |
| Main model | 55.4M params |
| Entropy patcher (frozen) | 1.7M params |
| Trained on | TinyStoriesV2-GPT4 train split (~2.2B bytes), 20k steps, ~2.4 h on an RTX 3050 6 GB |
Sample (prompt "Once upon a time"):
Once upon a time, in a small house, there was a boy named Tim. One day, Tim went to the store with his mom. They needed to buy a toy… Tim said to his mom, "Mom, can I give t…
git clone https://github.com/shaikh-saud705/blt-llm.git
cd blt-llm
git lfs pull # fetch the 222M trained weights (needs git-lfs)
pip install -r requirements.txt
# generate text from the trained model:
python -m blt.generate --prompt "Once upon a time" --max-new 300 --temperature 0.7Generation needs three files in checkpoints/ (all included): blt_model_weights.pt
(the model), entropy_model.pt (frozen patcher), patcher_threshold.json (θ). Runs on CPU
if you have no GPU (just slower).
Three modules joined by two cross-attentions; all attention masks are derived from
patch_ids (variable-length patches — no fixed reshape):
bytes → Local Encoder (1×256, windowed) ──┐
├─ encoder cross-attn: pool bytes → patches (seq shortens)
Latent Global Transformer (6×768, block-causal over patches) ← holds the bulk
├─ decoder cross-attn: expand patches → bytes (seq grows)
bytes ← Local Decoder (4×256, windowed) ──┘ → 256-way next-byte logits
- Dynamic patching by a separate, frozen entropy byte-LM that scores next-byte
Shannon entropy; a byte starts a new patch when
H > θ(θ=1.09 → avg ≈ 4.4 bytes/patch). It's causal, so it works during generation. - Shared style: RMSNorm, SwiGLU, RoPE (θ=500000) in self-attention only, tied byte embedding/output. Hash n-gram embeddings (n=3..8). k = hG/hE = 3.
- Strictly causal: the decoder cross-attn has each byte attend the previous patch's
output, and the implementation passes a no-future-leakage gate
(
blt/tests/test_causality.py: gradient + perturbation tests, exact-zero leakage).
blt/
config.py # tiny + training configs
data.py # TinyStories loading + byte batching
entropy_model.py # the small frozen byte-LM (patcher)
patcher.py # entropy → patch boundaries; threshold tuning
ngram_hash.py # RollPolyHash n-gram embeddings
encoder.py # local encoder + encoder cross-attention (pooling)
global_transformer.py# latent global transformer
decoder.py # local decoder + decoder cross-attention (expanding)
model.py # assembles the BLT
train.py # end-to-end training loop (BPB, checkpoints, resume)
generate.py # autoregressive byte generation (causal patching)
tests/ # shapes, overfit, real-patcher, causality gate
checkpoints/ # trained weights (blt_model_weights.pt via git-LFS) + patcher + θ
BLT_LLM.md # the full build brief / spec
PROGRESS.md # build log (what's done, how to reproduce)
python -m blt.train_entropy --steps 8000 # train the entropy patcher (Phase 0)
python -m blt.tests.test_causality # no-future-leakage HARD GATE
python -m blt.train --data train --steps 20000 --batch-size 8 --seq-len 512 --grad-accum 2Phases 0–3 complete (architecture proven end-to-end). Phase 4 = scale to ~1.5B on a
rented GPU (≥32 GB) — see BLT_LLM.md.
MIT. The BLT architecture is from Meta FAIR's paper (linked above); this is an independent from-scratch reimplementation.