Learn how the Attention mechanism works and how ChatGPT was built — step by step, from scratch.
This project by HERE AND NOW AI is a hands-on tutorial based on the foundational paper "Attention Is All You Need" (Vaswani et al., 2017). Every concept is explained in plain English with analogies, and every Python script is self-contained and runnable.
- Python 3.8+
- NumPy (the only dependency!)
- No GPU required — everything runs on CPU in seconds
- No prior deep learning knowledge needed
# 1. Clone or navigate to this folder
cd transformers-simplified
# 2. Install dependencies
pip install -r requirements.txtTransformers Explained, Self-Attention mechanism, LLM from scratch, Generative AI tutorial, GPT architecture, Neural Networks, AI Education, HERE AND NOW AI.
Follow the chapters in order. Each chapter has:
- 📖
concepts.md— Read this first (theory + analogies) - 🐍
*.py— Then run the code (hands-on demo)
All scripts are designed by HERE AND NOW AI for local execution and maximum educational clarity.
"AI is Good"
START HERE
│
▼
┌──────────────────────────────────────────────────────────┐
│ Chapter 1: Introduction │
│ 📖 concepts.md → Why Transformers replaced RNNs │
│ 🐍 why_transformers.py → Sequential vs parallel demo │
└──────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Chapter 2: Word Embeddings │
│ 📖 concepts.md → Turning words into numbers │
│ 🐍 word_embeddings.py → Embeddings & similarity demo │
└──────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Chapter 3: Positional Encoding │
│ 📖 concepts.md → How the model knows word order │
│ 🐍 positional_encoding.py → Sin/cos wave visualization │
└──────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Chapter 4: Self-Attention ⭐ (THE core concept) │
│ 📖 concepts.md → Query, Key, Value explained │
│ 🐍 self_attention.py → Attention computed step-by-step │
└──────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Chapter 5: Multi-Head Attention │
│ 📖 concepts.md → Multiple perspectives in parallel │
│ 🐍 multi_head_attention.py → Multi-head from scratch │
└──────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Chapter 6: Transformer Block │
│ 📖 concepts.md → LayerNorm, FFN, residual connections│
│ 🐍 transformer_block.py → Complete encoder block │
└──────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Chapter 7: Full Transformer │
│ 📖 concepts.md → Encoder-Decoder + masked attention │
│ 🐍 mini_transformer.py → End-to-end mini Transformer │
└──────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Chapter 8: GPT & ChatGPT 🔥 │
│ 📖 concepts.md → GPT-1→GPT-4, RLHF, AI revolution │
│ 🐍 mini_gpt.py → Decoder-only text generator │
└──────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Chapter 9: Bonus — Real-World AI Demos 🎁 │
│ 🐍 simple_chatbot.py → Chat with a local LLM │
│ 🐍 get_word_vector.py → Fetch word embeddings │
│ 🐍 visualize_embeddings.py→ 2D embedding visualization │
│ 🐍 word_distance.py → Cosine similarity & analogy │
│ 🐍 kokoro_tts.py → Text-to-Speech with Kokoro │
│ 🐍 whisper_stt.py → Speech-to-Text with Whisper │
│ 🐍 text_to_image.py → Image gen with SDXL Turbo │
└──────────────────────────────────────────────────────────┘
▼
🎉 DONE!
transformer_simplified/
│
├── README.md ← You are here
├── requirements.txt ← pip install -r requirements.txt
├── attention_is_all_you_need.pdf ← The original paper (reference)
│
├── 01_introduction/
│ ├── concepts.md ← Why Transformers changed everything
│ └── why_transformers.py ← Sequential vs parallel processing
│
├── 02_word_embeddings/
│ ├── concepts.md ← Words → vectors, cosine similarity
│ └── word_embeddings.py ← Build embeddings, word arithmetic
│
├── 03_positional_encoding/
│ ├── concepts.md ← Sin/cos position signals
│ └── positional_encoding.py ← Compute & visualize encodings
│
├── 04_self_attention/
│ ├── concepts.md ← Q, K, V — the core mechanism
│ └── self_attention.py ← Attention from scratch
│
├── 05_multi_head_attention/
│ ├── concepts.md ← Parallel attention heads
│ └── multi_head_attention.py ← Multi-head demo
│
├── 06_transformer_block/
│ ├── concepts.md ← Residuals, LayerNorm, FFN
│ └── transformer_block.py ← Complete encoder block
│
├── 07_full_transformer/
│ ├── concepts.md ← Encoder + Decoder architecture
│ └── mini_transformer.py ← Masked & cross-attention
│
├── 08_gpt_and_chatgpt/
│ ├── concepts.md ← GPT timeline, RLHF, ChatGPT
│ └── mini_gpt.py ← Decoder-only text generation
│
├── 09_bonus/
│ ├── words.md ← Example words for embedding demos
│ ├── simple_chatbot.py ← Interactive chatbot using Ollama
│ ├── get_word_vector.py ← Fetch word embeddings from Ollama
│ ├── visualize_embeddings.py ← PCA visualization of word vectors
│ ├── word_distance.py ← Cosine similarity & analogy tool
│ ├── kokoro_tts.py ← Text-to-Speech with Kokoro ONNX
│ ├── whisper_stt.py ← Speech-to-Text with Whisper
│ └── text_to_image.py ← Image generation with SDXL Turbo
│
└── attention_is_all_you_need.pdf ← The original paper (reference)
# Run each chapter in order
python 01_introduction/why_transformers.py
python 02_word_embeddings/word_embeddings.py
python 03_positional_encoding/positional_encoding.py
python 04_self_attention/self_attention.py
python 05_multi_head_attention/multi_head_attention.py
python 06_transformer_block/transformer_block.py
python 07_full_transformer/mini_transformer.py
python 08_gpt_and_chatgpt/mini_gpt.py
# Bonus demos (require additional dependencies — see requirements.txt)
python 09_bonus/simple_chatbot.py
python 09_bonus/get_word_vector.py
python 09_bonus/visualize_embeddings.py
python 09_bonus/word_distance.py
python 09_bonus/kokoro_tts.py
python 09_bonus/whisper_stt.py
python 09_bonus/text_to_image.pyEach script prints a clean, annotated walkthrough explaining what it's computing at every step.
| Concept | One-Line Explanation |
|---|---|
| Embedding | Convert words to vectors so computers can process them |
| Positional Encoding | Add position info using sin/cos waves |
| Self-Attention | Let every word look at every other word to understand context |
| Q, K, V | Query (what I want), Key (what I offer), Value (my content) |
| Multi-Head | Run several attentions in parallel for richer understanding |
| LayerNorm | Normalize values for stable training |
| Residual Connection | Add input to output so information isn't lost |
| Feed-Forward Network | Two linear layers with ReLU — "thinking" time for each word |
| Masked Attention | Prevent the model from seeing future words during generation |
| Cross-Attention | Decoder reads the encoder's output |
| GPT | Decoder-only Transformer trained to predict the next word |
| RLHF | Use human feedback to align the model with human preferences |
This tutorial is based on:
"Attention Is All You Need" Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin NeurIPS 2017 — arXiv:1706.03762
The included attention_is_all_you_need.pdf is the original paper for reference.
Stay updated with the latest in AI and machine learning.
- Website: hereandnowai.com
- Email: info@hereandnowai.com
- Phone: +91 996 296 1000
- Socials:
This project is for educational purposes. Feel free to use, modify, and share.
"AI is Good" — HERE AND NOW AI
