SmallLM

A small, GPT like Language Model

This project provides a minimal, modular, and extensible framework for training and generating text with a transformer-based GPT like language model. It is small language model with close to 14M parameters.

Model Architecture	BPE Tokenization

Installation

Clone the repository:

git clone https://github.com/yourusername/smallLM.git
cd smallLM

Install dependencies:
```
pip install -r requirements.txt
```

Usage

1. Prepare Your Dataset

Place your training text file (e.g., data/data.txt) in the project directory.
Update the train_data_path in smalllm/config/config.py if needed.

2. Train the Model

python main.py train

Training progress and losses will be logged and plotted.
The best model checkpoint will be saved to the path specified in your config.

3. Generate Text

python main.py generate --query "Once upon a time" --max_new_tokens 100 --temperature 1.2

--query: The prompt to start generation.
--max_new_tokens: Number of tokens to generate (default: 100).
--temperature: Sampling temperature (default: 1.5).

Project Structure

smallLM/
│
├── main.py                      # CLI entry point
├── README.md
├── requirements.txt
│
└── smalllm/
    ├── config/
    │   └── config.py            # Configuration class
    ├── dataset/
    │   └── text_dataset.py      # Dataset and DataLoader utilities
    ├── logger.py                # Logger utility
    ├── model/
    │   ├── __init__.py
    │   └── lm.py                # LanguageModel definition
    ├── tokenizer.py             # Tokenizer class
    └── trainer.py               # Training and plotting utilities

Customization

Model: Edit smalllm/model/lm.py or adjust hyperparameters in smalllm/config/config.py.
Tokenizer: Swap out or extend smalllm/tokenizer.py for different tokenization strategies.
Dataset: Use any plain text file; the loader will handle splitting and batching.

License

MIT License

Acknowledgements

Inspired by GPT and nanoGPT projects.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
smalllm		smalllm
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SmallLM

Installation

Usage

1. Prepare Your Dataset

2. Train the Model

3. Generate Text

Project Structure

Customization

License

Acknowledgements

About

Uh oh!

Languages

License

d1pankarmedhi/smallLM

Folders and files

Latest commit

History

Repository files navigation

SmallLM

Installation

Usage

1. Prepare Your Dataset

2. Train the Model

3. Generate Text

Project Structure

Customization

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages