Small Language Model (SLM) from Scratch

This repository contains a Python implementation of a small language model (SLM) built from scratch using PyTorch. The project aims to demonstrate the core components of a transformer-based language model, including data preparation, model architecture, training, and text generation.

Introduction

This project is an educational endeavor to construct a small language model (SLM) without relying on high-level libraries like Hugging Face Transformers for the model definition itself. The goal is to understand the fundamental building blocks of modern language models, such as self-attention mechanisms, feed-forward networks, and positional embeddings. The model is designed to be relatively small, with a target parameter size of 10-15 million, making it suitable for training on consumer-grade hardware or for quick experimentation.

The model is trained on the TinyStories dataset, a synthetic dataset specifically designed for training small language models. This dataset consists of short stories generated by larger language models (GPT-3.5 and GPT-4) using a vocabulary limited to words typically understood by 3 to 4-year-olds. This constraint makes it an ideal dataset for training compact models and observing their ability to generate coherent and creative text within a simplified linguistic context.

Features

Custom Transformer Implementation: A complete implementation of a transformer block, including multi-head causal self-attention and a feed-forward network.
Layer Normalization: Custom implementation of Layer Normalization for better control and understanding of the normalization process.
Weight Tying: Implementation of weight tying between the token embedding layer and the language model head, a common practice in modern language models to reduce parameters and improve performance.
Data Tokenization and Preparation: Utilizes tiktoken for efficient tokenization (GPT-2 BPE encoding) and datasets library for loading and processing the TinyStories dataset. Data is stored in memory-mapped files (.bin) for efficient handling of large datasets.
Flexible Training Loop: A configurable training loop with features like learning rate warm-up, cosine annealing decay, gradient accumulation, and mixed-precision training (using torch.amp.autocast).
Model Checkpointing: Saves the best performing model based on validation loss.
Text Generation: Includes a generate method for sampling new text from the trained model, supporting temperature and top-k sampling.
Configuration Management: Centralized configuration for model architecture and training parameters, making it easy to experiment with different settings.
Reproducibility: Includes utilities for setting random seeds to ensure reproducible training runs.

Dataset

The model is trained on the TinyStories dataset. This dataset is available through the Hugging Face datasets library. It contains short stories with a limited vocabulary, making it suitable for training small language models. The data preparation pipeline tokenizes the stories and stores them in binary files (train.bin and validation.bin) for efficient access during training.

Model Architecture

The model architecture is a decoder-only transformer, similar to the GPT series of models. It consists of:

Token Embeddings: Converts input token IDs into dense vector representations.
Positional Embeddings: Adds positional information to the token embeddings, allowing the model to understand the order of words.
Transformer Blocks: Multiple layers of custom-implemented transformer blocks. Each block contains:
- Layer Normalization: Applied before attention and MLP layers.
- Causal Self-Attention: A multi-head self-attention mechanism that ensures tokens can only attend to previous tokens in the sequence. Flash Attention is used if available, otherwise a standard PyTorch implementation.
- MLP (Feed-Forward Network): A two-layer neural network with a GELU activation function.
Language Model Head: A linear layer that projects the output of the transformer blocks to the vocabulary size, predicting the next token.

Weight tying is used between the token embedding layer and the language model head, meaning they share the same weight matrix. This reduces the number of parameters and often improves performance.

Installation

To set up the development environment, follow these steps:

Clone the repository:

git clone https://github.com/waheebedrees/Small-Language-Model.git
cd Small-Language-Model

Create a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Training the Model

To train the model, run the train.py script:

python train.py

Optional arguments:

--data-dir: Directory to store processed data (default: ./data)
--output-dir: Directory to save models and plots (default: ./models)
--force-reload: Force re-download and re-tokenize the dataset.
--no-train: Skip training and only perform setup and text generation (useful for testing the pipeline).

Example:

python train.py --output-dir ./my_trained_models --force-reload

During training, the script will print training and validation losses periodically. The best model (based on validation loss) will be saved to models/best_model.pt.

Generating Text

After training, the script will automatically load the best saved model and demonstrate text generation with a few example prompts. You can also use the Trainer class programmatically to generate text.

Example of programmatic generation:

import torch
from src.model import GPT, GPTConfig
from src.data_loader import DataLoader
from src.trainer import Trainer
from src.config import ModelConfig, TrainingConfig

# Load configurations
model_config, training_config = get_default_configs()

# Initialize data loader
data_loader = DataLoader(
    batch_size=training_config.batch_size,
    block_size=training_config.block_size,
    device=training_config.device
)

# Create model
gpt_config = GPTConfig(
    vocab_size=model_config.vocab_size,
    block_size=model_config.block_size,
    n_layer=model_config.n_layer,
    n_head=model_config.n_head,
    n_embd=model_config.n_embd,
    dropout=model_config.dropout,
    bias=model_config.bias
)
model = GPT(gpt_config)

# Initialize trainer (only for loading model and generation)
trainer = Trainer(model, data_loader, training_config.__dict__)

# Load the trained model
trainer.load_best_model("models/best_model.pt")

# Generate text
prompt = "The little bear was very"
generated_text = trainer.generate_text(prompt, max_new_tokens=100, temperature=0.7, top_k=50)
print(f"Prompt: {prompt}")
print(f"Generated: {generated_text}")

References

[1] Ronen Eldan, Eyal Fisher. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherently? arXiv. https://arxiv.org/abs/2305.07759

[2] Andrej Karpathy. nanoGPT. GitHub repository. https://github.com/karpathy/nanoGPT

[3] Hugging Face datasets library. https://huggingface.co/docs/datasets/index

[4] OpenAI tiktoken library. https://github.com/openai/tiktoken

License

This project is licensed under the MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small Language Model (SLM) from Scratch

Table of Contents

Introduction

Features

Dataset

Model Architecture

Installation

Usage

Training the Model

Generating Text

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Small Language Model (SLM) from Scratch

Table of Contents

Introduction

Features

Dataset

Model Architecture

Installation

Usage

Training the Model

Generating Text

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages