BLT-LLM — a Byte Latent Transformer, from scratch

A working, from-scratch PyTorch implementation of a Byte Latent Transformer (BLT) (Meta FAIR, Pagnoni et al. 2024 — "Patches Scale Better Than Tokens", arXiv:2412.09871), trained end-to-end on TinyStories on a single 6 GB consumer GPU.

BLT is tokenizer-free: it reads raw bytes (0–255) and dynamically groups them into patches, spending compute where the next byte is hard to predict and skimming where it's easy. No BPE, no vocab file — just bytes.

Trained model: 🤗 sssssaud/blt-llm-tinystories-55m

Result

Metric	Value
Held-out bits-per-byte (BPB)	0.71 (best 0.7078)
Untrained baseline	8.0 (= log₂256)
Main model	55.4M params
Entropy patcher (frozen)	1.7M params
Trained on	TinyStoriesV2-GPT4 train split (~2.2B bytes), 20k steps, ~2.4 h on an RTX 3050 6 GB

Sample (prompt "Once upon a time"):

Once upon a time, in a small house, there was a boy named Tim. One day, Tim went to the store with his mom. They needed to buy a toy… Tim said to his mom, "Mom, can I give t…

Quickstart

git clone https://github.com/shaikh-saud705/blt-llm.git
cd blt-llm
git lfs pull                       # fetch the 222M trained weights (needs git-lfs)
pip install -r requirements.txt

# generate text from the trained model:
python -m blt.generate --prompt "Once upon a time" --max-new 300 --temperature 0.7

Generation needs three files in checkpoints/ (all included): blt_model_weights.pt (the model), entropy_model.pt (frozen patcher), patcher_threshold.json (θ). Runs on CPU if you have no GPU (just slower).

Architecture

Three modules joined by two cross-attentions; all attention masks are derived from patch_ids (variable-length patches — no fixed reshape):

bytes → Local Encoder (1×256, windowed) ──┐
                                          ├─ encoder cross-attn: pool bytes → patches (seq shortens)
        Latent Global Transformer (6×768, block-causal over patches)  ← holds the bulk
                                          ├─ decoder cross-attn: expand patches → bytes (seq grows)
bytes ← Local Decoder (4×256, windowed) ──┘ → 256-way next-byte logits

Dynamic patching by a separate, frozen entropy byte-LM that scores next-byte Shannon entropy; a byte starts a new patch when H > θ (θ=1.09 → avg ≈ 4.4 bytes/patch). It's causal, so it works during generation.
Shared style: RMSNorm, SwiGLU, RoPE (θ=500000) in self-attention only, tied byte embedding/output. Hash n-gram embeddings (n=3..8). k = hG/hE = 3.
Strictly causal: the decoder cross-attn has each byte attend the previous patch's output, and the implementation passes a no-future-leakage gate (blt/tests/test_causality.py: gradient + perturbation tests, exact-zero leakage).

Repo layout

blt/
  config.py            # tiny + training configs
  data.py              # TinyStories loading + byte batching
  entropy_model.py     # the small frozen byte-LM (patcher)
  patcher.py           # entropy → patch boundaries; threshold tuning
  ngram_hash.py        # RollPolyHash n-gram embeddings
  encoder.py           # local encoder + encoder cross-attention (pooling)
  global_transformer.py# latent global transformer
  decoder.py           # local decoder + decoder cross-attention (expanding)
  model.py             # assembles the BLT
  train.py             # end-to-end training loop (BPB, checkpoints, resume)
  generate.py          # autoregressive byte generation (causal patching)
  tests/               # shapes, overfit, real-patcher, causality gate
checkpoints/           # trained weights (blt_model_weights.pt via git-LFS) + patcher + θ
BLT_LLM.md             # the full build brief / spec
PROGRESS.md            # build log (what's done, how to reproduce)

Reproduce

python -m blt.train_entropy --steps 8000      # train the entropy patcher (Phase 0)
python -m blt.tests.test_causality            # no-future-leakage HARD GATE
python -m blt.train --data train --steps 20000 --batch-size 8 --seq-len 512 --grad-accum 2

Status

Phases 0–3 complete (architecture proven end-to-end). Phase 4 = scale to ~1.5B on a rented GPU (≥32 GB) — see BLT_LLM.md.

License

MIT. The BLT architecture is from Meta FAIR's paper (linked above); this is an independent from-scratch reimplementation.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
artifacts		artifacts
blt		blt
checkpoints		checkpoints
docs		docs
.gitattributes		.gitattributes
.gitignore		.gitignore
BLT_LLM.md		BLT_LLM.md
CLAUDE.md		CLAUDE.md
PROGRESS.md		PROGRESS.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BLT-LLM — a Byte Latent Transformer, from scratch

Result

Quickstart

Architecture

Repo layout

Reproduce

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BLT-LLM — a Byte Latent Transformer, from scratch

Result

Quickstart

Architecture

Repo layout

Reproduce

Status

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages