Note
pyLLM is an experimental Large Language Model (LLM) framework implemented from scratch using NumPy. Expect bugs.
-
Custom Tokenizer
Implements word-based tokenization with<PAD>,<UNK>, and<EOS>tokens. -
Transformer Model
Supports multi-head attention, feed-forward networks, and position encoding. -
Efficient Training Pipeline
Usesnumba-optimized softmax and cross-entropy loss for speed improvements. -
Text Generation
Implements token sampling with temperature scaling and repetition penalty. -
Minimal Dependencies
Usesnumpyandnumbafor efficient numerical computations.
- Python 3.8 or higher
- pip package manager
git clone https://github.com/rustyspottedcatt/pyLLM.git
cd pyLLMpip install -r requirements.txtfrom tokenizer import Tokenizer
vocab = {"hello": 0, "world": 1, "<UNK>": 2, "<PAD>": 3, "<EOS>": 4}
tokenizer = Tokenizer(vocab)
text = "hello world!"
tokens = tokenizer.tokenize(text)
print(tokens) # Output: ['hello', 'world', '!']
token_ids = tokenizer.encode(tokens)
print(token_ids) # Output: [0, 1, 2]
decoded_text = tokenizer.decode(token_ids)
print(decoded_text) # Output: "hello world <UNK>"from vocab import build_vocab
corpus = "Hello world! This is a simple LLM implementation."
vocab = build_vocab(corpus, vocab_size=5000)
print(f"Vocabulary Size: {len(vocab)}")from main import train_model, Transformer
import numpy as np
# Define Model Parameters
vocab_size = 5000
embed_dim = 512
max_len = 128
num_heads = 8
num_layers = 6
hidden_dim = 1024
# Initialize Model
model = Transformer(vocab_size, embed_dim, max_len, num_heads, num_layers, hidden_dim)
# Dummy Training Data
data = [(np.array([1, 2, 3]), np.array([2, 3, 4]))]
# Train the Model
train_model(model, data, vocab_size, epochs=10, lr=0.001, debug=True)from tokenizer import Tokenizer
from transformer import Transformer
# Load Tokenizer and Model
tokenizer = Tokenizer(vocab)
model = Transformer(vocab_size, embed_dim, max_len, num_heads, num_layers, hidden_dim)
# Generate Text
prompt = "Hello world"
prompt_tokens = tokenizer.tokenize(prompt)
prompt_ids = tokenizer.encode(prompt_tokens)
generated_ids = model.generate(prompt_ids, max_len=20, tokenizer=tokenizer, temperature=1.0)
generated_text = tokenizer.decode(generated_ids)
print("Generated Text:", generated_text)-
Tokenizer (
tokenizer.py)- Implements text tokenization, encoding, and decoding.
-
Vocabulary Builder (
vocab.py)- Creates a vocabulary from a given corpus.
-
Transformer Model (
transformer.py)- Implements a Transformer with multi-head attention and feed-forward networks.
-
Training Pipeline (
training.py)- Uses
numbato optimize softmax and cross-entropy loss calculations.
- Uses
-
Main Script (
main.py)- Loads dataset, preprocesses text, initializes the model, and runs training.
numpy = "^1.21.0"
numba = "^0.54.0"
datasets = "^2.0.0"Distributed under the MIT License.