Skip to content

codewithfourtix/substrate

Repository files navigation

SUBSTRATE

SUBSTRATE

Build Your Own GPT From Scratch

The Complete Guide to Transformer Architecture

JavaScript Educational MIT License

Pure JavaScript • No Dependencies • Learn by Building


The Complete Architecture Flow

Architecture Flow Diagram

This diagram shows the complete data flow through a Transformer model.


What is SUBSTRATE?

SUBSTRATE is a complete learning system for understanding Large Language Models. You build every component from scratch - no magic frameworks hiding the details.

What you'll build:

  • Tokenizer: text to numbers
  • Embeddings: numbers to vectors
  • Attention: the breakthrough mechanism
  • Transformer blocks: core units
  • Full LLM: complete GPT model
  • Training: learning algorithm
  • Inference: text generation
  • Visualizations: understanding the model

Project Structure

substrate/
├── 01_tokenizer/
├── 02_embeddings/
├── 03_attention/
├── 04_transformer_block/
├── 05_full_model/
├── 06_inference/
├── 07_visualizations/
├── notebooks/
└── README.md

Quick Start

  1. Clone the repository
git clone https://github.com/codewithfourtix/substrate.git
cd substrate
  1. Start with Module 1
Open 01_tokenizer/README.md
  1. Run the complete simulation
Open notebooks/01_full_simulation.md

Learning Modules

Module Topic Duration Difficulty
01 Tokenization 15 min Beginner
02 Embeddings 20 min Beginner
03 Attention 45 min Intermediate
04 Transformer Block 30 min Intermediate
05 Full Model 25 min Intermediate
06 Inference 20 min Beginner
07 Visualizations 15 min Beginner

Core Concepts

Tokenization

Convert text into numbers that models can process.

"hello" → [7, 2, 3, 3, 4]

Key concepts:

  • Vocabulary: Dictionary of all tokens
  • Token IDs: Unique numbers for each token
  • Padding: Make sequences same length

Embeddings

Convert token IDs into dense vectors that capture meaning.

Token 5 → [0.2, -0.1, 0.5, 0.3, ...]

Plus positional encoding: Tell the model word position using sine/cosine.

Attention

The breakthrough mechanism. Tokens look at other tokens and learn relationships.

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Multi-head attention: 8+ heads learning different patterns simultaneously.

Causal masking: During generation, can only look at previous tokens.

Transformer Block

Stack of: Attention → LayerNorm → FeedForward → LayerNorm

Repeated N times (2 for small models, 96 for GPT-3).

Training

Model predicts next token. Loss decreases as it learns.

Inference

Generate text one token at a time using:

  • Greedy decoding: fastest
  • Beam search: better quality
  • Temperature sampling: creative

Real-World Context

The architecture in SUBSTRATE is identical to:

  • GPT-2 (1.5B parameters)
  • GPT-3 (175B parameters)
  • BERT
  • ChatGPT
  • LLaMA

The only differences: size, training data, hardware.


Getting Started

Option 1: Follow the Path

  1. Read 01_tokenizer/README.md
  2. Study the code
  3. Run notebook 01_full_simulation.md
  4. Build your own model

Option 2: Dive into Notebooks

  • 01_full_simulation.md - Complete walkthrough
  • 02_attention_deep_dive.md - Understand attention deeply
  • 03_train_on_tiny_shakespeare.md - Train on real data

FAQ

How long does this take? 3 hours for complete understanding.

Do I need GPU? No, runs on CPU.

Can I train on my data? Absolutely, just replace the training data.

Is this production-ready? No, it's for learning. Real systems use PyTorch/TensorFlow.

What's the hardest part? Attention mechanism. But once you understand it everything clicks.


What You'll Learn

  • How tokenization works
  • How embeddings capture meaning
  • How attention lets tokens communicate
  • How transformers process text
  • How to train an LLM
  • How generation works
  • How to debug LLM issues

Ready to Build Your Own LLM?

Start with Module 01

Built with care for learning.

MIT License

About

Build your own GPT from scratch in pure JavaScript - learn LLM architecture by coding every component

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors