Skip to content

sssssaud/blt-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BLT-LLM — a Byte Latent Transformer, from scratch

A working, from-scratch PyTorch implementation of a Byte Latent Transformer (BLT) (Meta FAIR, Pagnoni et al. 2024 — "Patches Scale Better Than Tokens", arXiv:2412.09871), trained end-to-end on TinyStories on a single 6 GB consumer GPU.

BLT is tokenizer-free: it reads raw bytes (0–255) and dynamically groups them into patches, spending compute where the next byte is hard to predict and skimming where it's easy. No BPE, no vocab file — just bytes.

Trained model: 🤗 sssssaud/blt-llm-tinystories-55m

Result

Metric Value
Held-out bits-per-byte (BPB) 0.71 (best 0.7078)
Untrained baseline 8.0 (= log₂256)
Main model 55.4M params
Entropy patcher (frozen) 1.7M params
Trained on TinyStoriesV2-GPT4 train split (~2.2B bytes), 20k steps, ~2.4 h on an RTX 3050 6 GB

Sample (prompt "Once upon a time"):

Once upon a time, in a small house, there was a boy named Tim. One day, Tim went to the store with his mom. They needed to buy a toy… Tim said to his mom, "Mom, can I give t…

Quickstart

git clone https://github.com/shaikh-saud705/blt-llm.git
cd blt-llm
git lfs pull                       # fetch the 222M trained weights (needs git-lfs)
pip install -r requirements.txt

# generate text from the trained model:
python -m blt.generate --prompt "Once upon a time" --max-new 300 --temperature 0.7

Generation needs three files in checkpoints/ (all included): blt_model_weights.pt (the model), entropy_model.pt (frozen patcher), patcher_threshold.json (θ). Runs on CPU if you have no GPU (just slower).

Architecture

Three modules joined by two cross-attentions; all attention masks are derived from patch_ids (variable-length patches — no fixed reshape):

bytes → Local Encoder (1×256, windowed) ──┐
                                          ├─ encoder cross-attn: pool bytes → patches (seq shortens)
        Latent Global Transformer (6×768, block-causal over patches)  ← holds the bulk
                                          ├─ decoder cross-attn: expand patches → bytes (seq grows)
bytes ← Local Decoder (4×256, windowed) ──┘ → 256-way next-byte logits
  • Dynamic patching by a separate, frozen entropy byte-LM that scores next-byte Shannon entropy; a byte starts a new patch when H > θ (θ=1.09 → avg ≈ 4.4 bytes/patch). It's causal, so it works during generation.
  • Shared style: RMSNorm, SwiGLU, RoPE (θ=500000) in self-attention only, tied byte embedding/output. Hash n-gram embeddings (n=3..8). k = hG/hE = 3.
  • Strictly causal: the decoder cross-attn has each byte attend the previous patch's output, and the implementation passes a no-future-leakage gate (blt/tests/test_causality.py: gradient + perturbation tests, exact-zero leakage).

Repo layout

blt/
  config.py            # tiny + training configs
  data.py              # TinyStories loading + byte batching
  entropy_model.py     # the small frozen byte-LM (patcher)
  patcher.py           # entropy → patch boundaries; threshold tuning
  ngram_hash.py        # RollPolyHash n-gram embeddings
  encoder.py           # local encoder + encoder cross-attention (pooling)
  global_transformer.py# latent global transformer
  decoder.py           # local decoder + decoder cross-attention (expanding)
  model.py             # assembles the BLT
  train.py             # end-to-end training loop (BPB, checkpoints, resume)
  generate.py          # autoregressive byte generation (causal patching)
  tests/               # shapes, overfit, real-patcher, causality gate
checkpoints/           # trained weights (blt_model_weights.pt via git-LFS) + patcher + θ
BLT_LLM.md             # the full build brief / spec
PROGRESS.md            # build log (what's done, how to reproduce)

Reproduce

python -m blt.train_entropy --steps 8000      # train the entropy patcher (Phase 0)
python -m blt.tests.test_causality            # no-future-leakage HARD GATE
python -m blt.train --data train --steps 20000 --batch-size 8 --seq-len 512 --grad-accum 2

Status

Phases 0–3 complete (architecture proven end-to-end). Phase 4 = scale to ~1.5B on a rented GPU (≥32 GB) — see BLT_LLM.md.

License

MIT. The BLT architecture is from Meta FAIR's paper (linked above); this is an independent from-scratch reimplementation.

About

Byte Latent Transformer (BLT) LLM built from scratch in PyTorch — tokenizer-free, byte-level, trained end-to-end on TinyStories to 0.71 BPB. Clone & run.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages