SUBSTRATE

Build Your Own GPT From Scratch

The Complete Guide to Transformer Architecture

Pure JavaScript • No Dependencies • Learn by Building

The Complete Architecture Flow

This diagram shows the complete data flow through a Transformer model.

What is SUBSTRATE?

SUBSTRATE is a complete learning system for understanding Large Language Models. You build every component from scratch - no magic frameworks hiding the details.

What you'll build:

Tokenizer: text to numbers
Embeddings: numbers to vectors
Attention: the breakthrough mechanism
Transformer blocks: core units
Full LLM: complete GPT model
Training: learning algorithm
Inference: text generation
Visualizations: understanding the model

Project Structure

substrate/
├── 01_tokenizer/
├── 02_embeddings/
├── 03_attention/
├── 04_transformer_block/
├── 05_full_model/
├── 06_inference/
├── 07_visualizations/
├── notebooks/
└── README.md

Quick Start

Clone the repository

git clone https://github.com/codewithfourtix/substrate.git
cd substrate

Start with Module 1

Open 01_tokenizer/README.md

Run the complete simulation

Open notebooks/01_full_simulation.md

Learning Modules

Module	Topic	Duration	Difficulty
01	Tokenization	15 min	Beginner
02	Embeddings	20 min	Beginner
03	Attention	45 min	Intermediate
04	Transformer Block	30 min	Intermediate
05	Full Model	25 min	Intermediate
06	Inference	20 min	Beginner
07	Visualizations	15 min	Beginner

Core Concepts

Tokenization

Convert text into numbers that models can process.

"hello" → [7, 2, 3, 3, 4]

Key concepts:

Vocabulary: Dictionary of all tokens
Token IDs: Unique numbers for each token
Padding: Make sequences same length

Embeddings

Convert token IDs into dense vectors that capture meaning.

Token 5 → [0.2, -0.1, 0.5, 0.3, ...]

Plus positional encoding: Tell the model word position using sine/cosine.

Attention

The breakthrough mechanism. Tokens look at other tokens and learn relationships.

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Multi-head attention: 8+ heads learning different patterns simultaneously.

Causal masking: During generation, can only look at previous tokens.

Transformer Block

Stack of: Attention → LayerNorm → FeedForward → LayerNorm

Repeated N times (2 for small models, 96 for GPT-3).

Training

Model predicts next token. Loss decreases as it learns.

Inference

Generate text one token at a time using:

Greedy decoding: fastest
Beam search: better quality
Temperature sampling: creative

Real-World Context

The architecture in SUBSTRATE is identical to:

GPT-2 (1.5B parameters)
GPT-3 (175B parameters)
BERT
ChatGPT
LLaMA

The only differences: size, training data, hardware.

Getting Started

Option 1: Follow the Path

Read 01_tokenizer/README.md
Study the code
Run notebook 01_full_simulation.md
Build your own model

Option 2: Dive into Notebooks

01_full_simulation.md - Complete walkthrough
02_attention_deep_dive.md - Understand attention deeply
03_train_on_tiny_shakespeare.md - Train on real data

FAQ

How long does this take? 3 hours for complete understanding.

Do I need GPU? No, runs on CPU.

Can I train on my data? Absolutely, just replace the training data.

Is this production-ready? No, it's for learning. Real systems use PyTorch/TensorFlow.

What's the hardest part? Attention mechanism. But once you understand it everything clicks.

What You'll Learn

How tokenization works
How embeddings capture meaning
How attention lets tokens communicate
How transformers process text
How to train an LLM
How generation works
How to debug LLM issues

Ready to Build Your Own LLM?

Start with Module 01

Built with care for learning.

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SUBSTRATE

Build Your Own GPT From Scratch

The Complete Architecture Flow

What is SUBSTRATE?

Project Structure

Quick Start

Learning Modules

Core Concepts

Tokenization

Embeddings

Attention

Transformer Block

Training

Inference

Real-World Context

Getting Started

Option 1: Follow the Path

Option 2: Dive into Notebooks

FAQ

What You'll Learn

Ready to Build Your Own LLM?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
01_tokenizer		01_tokenizer
02_embeddings		02_embeddings
03_attention		03_attention
04_transformer_block		04_transformer_block
05_full_model		05_full_model
06_inference		06_inference
07_visualizations		07_visualizations
assets		assets
notebooks		notebooks
README.md		README.md
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

SUBSTRATE

Build Your Own GPT From Scratch

The Complete Architecture Flow

What is SUBSTRATE?

Project Structure

Quick Start

Learning Modules

Core Concepts

Tokenization

Embeddings

Attention

Transformer Block

Training

Inference

Real-World Context

Getting Started

Option 1: Follow the Path

Option 2: Dive into Notebooks

FAQ

What You'll Learn

Ready to Build Your Own LLM?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages