Skip to content

d1pankarmedhi/smallLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SmallLM

A small, GPT like Language Model

PyTorch Python

This project provides a minimal, modular, and extensible framework for training and generating text with a transformer-based GPT like language model. It is small language model with close to 14M parameters.

Model Architecture BPE Tokenization

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/smallLM.git
    cd smallLM
  2. Install dependencies:

    pip install -r requirements.txt

Usage

1. Prepare Your Dataset

  • Place your training text file (e.g., data/data.txt) in the project directory.
  • Update the train_data_path in smalllm/config/config.py if needed.

2. Train the Model

python main.py train
  • Training progress and losses will be logged and plotted.
  • The best model checkpoint will be saved to the path specified in your config.

3. Generate Text

python main.py generate --query "Once upon a time" --max_new_tokens 100 --temperature 1.2
  • --query: The prompt to start generation.
  • --max_new_tokens: Number of tokens to generate (default: 100).
  • --temperature: Sampling temperature (default: 1.5).

Project Structure

smallLM/
β”‚
β”œβ”€β”€ main.py                      # CLI entry point
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”‚
└── smalllm/
    β”œβ”€β”€ config/
    β”‚   └── config.py            # Configuration class
    β”œβ”€β”€ dataset/
    β”‚   └── text_dataset.py      # Dataset and DataLoader utilities
    β”œβ”€β”€ logger.py                # Logger utility
    β”œβ”€β”€ model/
    β”‚   β”œβ”€β”€ __init__.py
    β”‚   └── lm.py                # LanguageModel definition
    β”œβ”€β”€ tokenizer.py             # Tokenizer class
    └── trainer.py               # Training and plotting utilities

Customization

  • Model: Edit smalllm/model/lm.py or adjust hyperparameters in smalllm/config/config.py.
  • Tokenizer: Swap out or extend smalllm/tokenizer.py for different tokenization strategies.
  • Dataset: Use any plain text file; the loader will handle splitting and batching.

License

MIT License

Acknowledgements

Inspired by GPT and nanoGPT projects.

About

🧱 A small, GPT like Language Model with BPE Tokenizer

Topics

Resources

License

Stars

Watchers

Forks

Languages