This project provides a minimal, modular, and extensible framework for training and generating text with a transformer-based GPT like language model. It is small language model with close to 14M parameters.
-
Clone the repository:
git clone https://github.com/yourusername/smallLM.git cd smallLM -
Install dependencies:
pip install -r requirements.txt
- Place your training text file (e.g.,
data/data.txt) in the project directory. - Update the
train_data_pathinsmalllm/config/config.pyif needed.
python main.py train- Training progress and losses will be logged and plotted.
- The best model checkpoint will be saved to the path specified in your config.
python main.py generate --query "Once upon a time" --max_new_tokens 100 --temperature 1.2--query: The prompt to start generation.--max_new_tokens: Number of tokens to generate (default: 100).--temperature: Sampling temperature (default: 1.5).
smallLM/
β
βββ main.py # CLI entry point
βββ README.md
βββ requirements.txt
β
βββ smalllm/
βββ config/
β βββ config.py # Configuration class
βββ dataset/
β βββ text_dataset.py # Dataset and DataLoader utilities
βββ logger.py # Logger utility
βββ model/
β βββ __init__.py
β βββ lm.py # LanguageModel definition
βββ tokenizer.py # Tokenizer class
βββ trainer.py # Training and plotting utilities
- Model: Edit
smalllm/model/lm.pyor adjust hyperparameters insmalllm/config/config.py. - Tokenizer: Swap out or extend
smalllm/tokenizer.pyfor different tokenization strategies. - Dataset: Use any plain text file; the loader will handle splitting and batching.
MIT License
Inspired by GPT and nanoGPT projects.

