L1 is a transformer-based large language model implementation built from scratch using PyTorch. This project provides a complete pipeline for training, fine-tuning, and deploying your own language model with comprehensive documentation and best practices.
- Custom Transformer Architecture: Complete implementation with multi-head attention, feed-forward networks, and positional embeddings
- GPU Accelerated Training: Full CUDA support with RTX 50.. Series optimization and mixed precision training
- Advanced Checkpointing: Automatic saves every 1000 steps with intelligent cleanup and seamless resume capability
- Memory Optimization: Gradient checkpointing, model compilation, and memory-efficient training for high-end GPUs
- BPE Tokenization: Byte Pair Encoding implementation from scratch for intelligent text understanding
- Intelligence-Optimized: 32k vocabulary with subword tokenization for coherent reasoning
- Flexible Training Pipeline: Support for pretraining, fine-tuning, and distributed training
- Model Serving: REST API for text generation and inference
- Configuration Management: YAML-based configuration system for easy experimentation
- Comprehensive Logging: Training metrics, tensorboard integration, and monitoring
- Production Ready: Optimized for both research and deployment
L1/
βββ π οΈ tools/ # User-facing command-line tools
β βββ train.py # Main training script (GPU-optimized)
β βββ generate.py # Text generation and inference
β βββ demo.py # Interactive model demonstration
β βββ validate.py # Setup validation and testing
βββ π data_tools/ # Dataset management utilities
β βββ add_dataset.py # Dataset adding and preset management
β βββ prepare_dataset.py # Dataset preparation with BPE tokenization
β βββ download_preset.py # Automated dataset downloads
β βββ download_wikipedia.py # Wikipedia dataset downloader
β βββ fix_tokenizer.py # Tokenizer repair and optimization
βββ π§ utils/ # Project utilities and helpers
β βββ dataset_manager.py # Dataset management functions
β βββ warning_manager.py # Warning and error management
βββ π src/ # Core library source code
β βββ models/ # Model architectures (transformer, config, embeddings)
β βββ training/ # Training pipeline (trainer, optimizer, loss)
β βββ data/ # Data processing (tokenizer, dataset, preprocessing)
β βββ utils/ # Core utilities (logging, device management)
βββ βοΈ configs/ # Configuration files (YAML)
βββ π scripts/ # Legacy development scripts
βββ π data/ # Dataset storage (raw and processed)
βββ π§ͺ tests/ # Unit tests and validation
βββ ποΈ models/ # Trained model storage
βββ π docs/ # Documentation and guides
βββ train_minimal.py # Minimal training example (educational)
βββ quick_setup.bat # Windows quick setup script
βββ requirements.txt # Python dependencies
- Python 3.8+
- CUDA 12.8+ (for GPU training)
- 16GB+ RAM (32GB recommended for large models)
- Modern GPU (RTX 4060+, tested on RTX 5060 Ti 16GB )
-
Clone the repository:
git clone https://github.com/juliuspleunes4/L1 cd L1 -
Create virtual environment:
python -m venv l1_env l1_env\Scripts\activate # Windows # source l1_env/bin/activate # Linux/Mac
-
Install PyTorch with CUDA support:
# For CUDA 12.1+ (RTX 5060+ Optimised, also works on the 40+ series) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # Or for CPU-only training (slower) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
-
Install remaining dependencies:
pip install -r requirements.txt
-
Verify installation (optional):
# Test data preparation (should work immediately) python data_tools/add_dataset.py --help # Test GPU setup (requires PyTorch) python -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}')" python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"None\"}')"
-
Run validation test:
- Note: The test will most likely fail 3 out of the 5, because of CUDA problems. This can be safely ignored.
python tools/validate.py
Ready to train your own intelligent language model? Here's the fastest way:
# Step 1: Get high-quality training data (500k samples)
python data_tools/add_dataset.py --preset advanced
# Step 2: Prepare the dataset with BPE tokenization (for intelligence)
python data_tools/prepare_dataset.py data/raw/combined_dataset.txt --vocab-size 32000
# Step 3: Start GPU training (resume-capable)
python tools/train.py
# Step 4: Generate text with your trained model
python tools/generate.py --model_path models/l1-gpu-compatible --prompt "The future of AI is"That's it! The preset automatically downloads Wikipedia + ArXiv papers, and the BPE tokenization creates a 32k vocabulary for intelligent text understanding and generation.
- BPE Tokenization: 32,000 subword tokens with special tokens for intelligent text understanding
- Stable Architecture: 12 layers optimized for reliability (134M parameters)
- Coherent Generation: Produces meaningful sentences and coherent text
- Lightning Fast: 108x speed improvement (18minβ10sec per 100 steps)
- Auto-Resume: Training automatically resumes from latest checkpoint
- Demo the Project: Run
python tools/demo.pyto test all components - Customize Training: Edit
configs/train_config_gpu.yamlfor your hardware - Add Custom Data: See the Data Preparation section below for advanced options
- Monitor Training: Use
tail -f models/l1-gpu-compatible/training.log
L1 implements a decoder-only transformer architecture with:
- Multi-Head Self-Attention: Configurable number of attention heads with causal masking
- Feed-Forward Networks: Position-wise fully connected layers with GELU activation
- Layer Normalization: Pre-norm architecture for training stability
- Positional Encoding: Learnable positional embeddings
- Residual Connections: Skip connections for gradient flow
- BPE Tokenization: Byte Pair Encoding implementation from scratch
| Model | Layers | Heads | Embedding | Parameters | GPU Memory | Use Case | Config File |
|---|---|---|---|---|---|---|---|
| Small | 6 | 8 | 512 | ~25M | 4GB | Experiments | train_config.yaml |
| L1 Stable | 12 | 12 | 768 | ~134M | 8GB | Stable Training | train_config_gpu.yaml |
| L1 Large | 16 | 16 | 1024 | ~220M | 12GB | Advanced (experimental) | Custom config |
Note: L1 Stable model uses BPE tokenization (32k vocab) with conservative settings to prevent system freezing. Still intelligent but prioritizes stability.
The model and training parameters are configured via YAML files in the configs/ directory:
model:
vocab_size: 32000 # BPE tokenization for intelligence (overwritten by actual tokenizer size during training)
max_seq_length: 512 # Conservative for system stability
n_layers: 12 # Balanced depth for stable training
n_heads: 12 # Sufficient attention heads
n_embd: 768 # Good embedding size
n_inner: 3072 # 4x embedding size
dropout: 0.1
training:
num_epochs: 10 # Reasonable training duration
batch_size: 4 # Conservative batch size
learning_rate: 0.0001 # Stable learning rate
mixed_precision: true # Memory efficiency
checkpoint_every_steps: 1000 # Every ~2 minutes
max_checkpoints_to_keep: 5 # Auto-cleanup
gradient_accumulation_steps: 4 # Effective batch size 16model:
vocab_size: 50257
max_seq_length: 512
n_layers: 6
n_heads: 8
n_embd: 512
dropout: 0.1
training:
num_epochs: 3
batch_size: 4
learning_rate: 5e-4L1 includes a powerful dataset management system that makes adding datasets incredibly easy. You have 15+ pre-configured datasets ready to use, plus simple ways to add your own.
Choose from curated datasets in datasets.yaml:
# Advanced: Comprehensive training (recommended)
python data_tools/add_dataset.py --preset advanced
# β Wikipedia Simple + ArXiv Papers (500k samples)
# Beginner: Quick training with high-quality data
python data_tools/add_dataset.py --preset beginner
# β Wikipedia Simple + News (50k samples)
# Intermediate: Balanced training
python data_tools/add_dataset.py --preset intermediate
# β Wikipedia + Books + News (150k samples)
# Specialized presets
python data_tools/add_dataset.py --preset conversational # Reddit + Twitter + Wikipedia
python data_tools/add_dataset.py --preset technical # GitHub + Stack Overflow + Papers
python data_tools/add_dataset.py --preset knowledge # Full Wikipedia + Papers + BooksWhat happens when you run a preset:
- π Downloads the specified datasets automatically
- π Combines them into a single training file
- β
Processes and saves to
data/processed/for training
| Dataset | Samples | Quality | Topics | Use Case |
|---|---|---|---|---|
| Wikipedia Simple | 100k | High | Encyclopedia | Current default |
| Wikipedia Full | 500k | Very High | Comprehensive | Large-scale training |
| ArXiv Papers | 150k | Very High | Scientific | Technical knowledge |
| Project Gutenberg | 80k | Very High | Literature | Creative writing |
| Stack Overflow | 100k | High | Programming | Code understanding |
| Reddit Comments | 200k | Medium | Conversation | Chat/dialogue |
| News Articles | 50k | High | Current events | Factual knowledge |
| OpenWebText | 500k | High | General web | GPT-style training |
- Find your Kaggle dataset: Go to kaggle.com, find your dataset
- Edit
datasets.yaml: Add your dataset configuration
# Example: Adding a new dataset
your_awesome_dataset:
name: "Your Dataset Name"
description: "What this dataset contains"
download_method: "kagglehub"
kagglehub_path: "username/dataset-name" # From Kaggle URL
auto_detect_format: true
recommended_samples: 100000
recommended_vocab: 20000
quality: "high" # high, very_high, medium
topics: ["your", "topic", "tags"]
# Add to a preset
presets:
your_preset:
name: "Your Custom Training"
recommended_datasets: ["your_awesome_dataset", "wikipedia_simple"]
max_samples: 150000
vocab_size: 25000
description: "Your custom training mix"- Use your dataset:
# Use a specific dataset
python data_tools/add_dataset.py --dataset-id your_awesome_dataset \
--name "Your Dataset" \
--description "Description" \
--method kagglehub \
--path "username/dataset-name"
# Or use in a preset (edit datasets.yaml first)
python data_tools/add_dataset.py --preset your_presetimport kagglehub
# Download any Kaggle dataset directly
dataset_path = kagglehub.dataset_download("huggingface/squad")
dataset_path = kagglehub.dataset_download("Cornell-University/arxiv")
dataset_path = kagglehub.dataset_download("your-username/your-dataset")
# Then process with L1
python data_tools/prepare_dataset.py "path/to/downloaded/dataset.txt"# Setup Kaggle API (one time)
pip install kaggle
# Add your kaggle.json credentials to ~/.kaggle/
# Download dataset
kaggle datasets download username/dataset-name -p data/raw/
unzip data/raw/dataset-name.zip -d data/raw/
# Process with L1
python data_tools/prepare_dataset.py data/raw/your-extracted-file.txt# Prepare your own text files
python data_tools/prepare_dataset.py data/raw/your_text.txt
# Or use the scripts directory
python scripts/prepare_data.py \
--input data/raw/your_text.txt \
--output data/processed/For beginners:
- Start with
wikipedia_simple(current default) - high quality, manageable size - Add
news_allfor current events knowledge
For specific use cases:
- Conversational AI:
reddit_comments+twitter_sentiment - Technical/Code:
code_stackoverflow+papers_arxiv - Creative Writing:
books_gutenberg+books_openlib - Scientific:
papers_arxiv+papers_pubmed
For production models:
- Combine multiple high-quality sources
- Use
openwebtextfor general knowledge - Include domain-specific data for your use case
After adding a dataset, verify it's working:
# Check dataset info
python utils/dataset_manager.py --info your_dataset
# Preview samples
python utils/dataset_manager.py --preview your_dataset --samples 5
# Validate format
python utils/dataset_manager.py --validate your_datasetTrain with GPU acceleration and advanced optimizations:
# RTX 50.. Series optimized training
python tools/train.py
# Monitor progress
tail -f models/l1-gpu-compatible/training.logGPU Training Features:
- β Mixed Precision: Automatic FP16 for 2x speed improvement
- β Lightning Fast: 10+ steps/second
- β Gradient Checkpointing: Memory-efficient training for large models
- β Smart Checkpointing: Save every 1000 steps (~2 minutes) with auto-cleanup
- β Best Model Tracking: Automatically keeps the best checkpoint based on average loss
- β Automatic Resume: Seamless training continuation from interruptions
- β Optimized Architecture: BPE tokenization + stable 12-layer configuration
For systems without CUDA support:
python train_minimal.pyL1 provides comprehensive training monitoring with detailed logging:
# Real-time training metrics
π Training Configuration:
βββ Epochs: 10
βββ Total steps: 112,500
βββ Checkpoint every: 1000 steps (~2 minutes)
βββ Keep checkpoints: 5 latest
βββ Mixed precision: True
βββ Optimizer: AdamW
# Progress tracking with best model detection
π NEW BEST LOSS! Saving best checkpoint at step 5000 (loss: 1.2341, best: 1.2341)
πΎ Saving progress checkpoint at step 6000 (loss: 1.2456, best: 1.2341)π Comprehensive Training Log (training.log):
2025-08-12 18:06:24 | INFO | TRAINING SESSION STARTED
2025-08-12 18:06:24 | INFO | Model: 12 layers, 12 heads, Vocabulary: 32000
2025-08-12 18:06:25 | INFO | BEST | Epoch: 1 | Step: 1000 | Loss: 2.543200 | Avg_Loss: 2.521000 | Best: 2.543200 | LR: 1.00e-04
2025-08-12 18:06:26 | INFO | CHECKPOINT | Epoch: 1 | Step: 2000 | Loss: 2.456789 | Avg_Loss: 2.445000 | Best: 2.543200 | LR: 1.00e-04
2025-08-12 18:06:27 | INFO | EPOCH 1 COMPLETED | Loss: 2.456789 | Avg_Loss: 2.445000 | LR: 1.00e-04 | EPOCH_COMPLETE
Training Log Features:
- π Detailed Metrics: Timestamp, epoch, step, loss, avg_loss, learning rate for every checkpoint
- π Best Tracking: Clear marking of new best checkpoints with "BEST" indicator
- π Dual Loss Tracking: Both instantaneous loss and running average loss
- π Resume Compatible: Logs continue seamlessly across training sessions
- π Progress Analysis: Easy to track loss trends and training progress over time
- πΎ Persistent: All training history preserved in
models/[model-name]/training.log
Best Model Tracking:
- π Smart Best Detection: Tracks the best model at every checkpoint (1000 steps)
- π Automatic Updates:
best_checkpoint.ptis updated whenever loss improves - π― Generation Ready: Best checkpoint is automatically saved in
pytorch_model.binformat - π Resume Compatible: Best loss tracking continues across training sessions
Training automatically resumes from the last checkpoint:
# Same command detects and resumes automatically
python tools/train.py
# Output:
π₯ Loading checkpoint from models/l1-gpu-compatible/latest_checkpoint.pt
β
Resumed from epoch 2, step 1847, loss: 2.1432π Monitor Training Progress:
# Watch live training log
tail -f models/l1-gpu-compatible/training.log
# Check recent progress (last 20 lines)
tail -20 models/l1-gpu-compatible/training.log
# Search for best checkpoints
grep "BEST" models/l1-gpu-compatible/training.log
# Track loss progression
grep "Step:" models/l1-gpu-compatible/training.log | tail -10L1 supports various generation strategies with the simple generation script:
python tools/generate.py --prompt "The future of AI"python tools/generate.py \
--prompt "The future of artificial intelligence" \
--max_new_tokens 100 \
--temperature 0.8 \
--model_path models/l1-gpu-compatible/best_checkpoint.ptUsing the Best Checkpoint:
# Use the automatically tracked best model
python tools/generate.py --model_path models/l1-gpu-compatible --prompt "Your prompt here"
# The script automatically uses pytorch_model.bin (best checkpoint format)
# No need to specify best_checkpoint.pt directly- Temperature: Control randomness (0.1 = conservative, 1.0 = creative)
- Max New Tokens: Maximum number of tokens to generate (use
--max_new_tokens) - Model Path: Path to trained model checkpoint
Input: "The future of artificial intelligence"
Generated: "The future of artificial intelligence will be shaped by advances in
machine learning, neural networks, and computational power. These technologies
will enable more sophisticated reasoning..."
Run the test suite:
python -m pytest tests/ -vRun the demo script:
python tools/demo.py- Architecture: Detailed L1 transformer architecture
- GPU Training Guide: RTX 5060 Ti setup and optimization
- Dataset Setup: Comprehensive data preparation
- Wikipedia Setup: Wikipedia Simple English dataset guide
- Easy Datasets: Quick dataset options
- Training Guide: Advanced training techniques
- Mixed Precision: Enabled by default (
mixed_precision: true) - Lightning Fast Training: 108x speed improvement since its original design
- Gradient Checkpointing: Memory-efficient training for large models
- Conservative Settings: Stable batch size and sequence length for reliability
- BPE Tokenization: Smart 32k vocabulary for efficient learning
- Automatic Checkpointing: Saves every 1000 steps with cleanup
- GPU Cache Clearing: Automatic CUDA cache management
- Gradient Accumulation: Simulate larger batch sizes (4 steps)
- Pin Memory: Enabled for faster GPU data transfer
- Resume Capability: Automatic recovery from interruptions
- Checkpoint Cleanup: Keeps only 5 most recent checkpoints
- Error Handling: Graceful fallback to CPU on CUDA errors
- Progress Tracking: Detailed logging and monitoring
# Optimal settings for RTX 5060 Ti 16GB
training:
batch_size: 4 # Conservative for stability
mixed_precision: true # Memory efficiency
gradient_accumulation_steps: 4 # Effective batch size 16
checkpoint_every_steps: 1000 # Reasonable saves (~2 minutes)
max_checkpoints_to_keep: 5 # Auto-cleanupRun the comprehensive test suite:
python -m pytest tests/ -vTest model functionality:
python tests/test_model.pyRun the demo script:
python tools/demo.pyIf you're seeing excessive <unk> tokens or garbled output during text generation:
# Fix existing tokenizers (adds missing essential tokens)
python data_tools/fix_tokenizer.py
# Then test generation
python tools/generate.py --prompt "The future of AI is"This commonly happens with older trained models where the tokenizer was missing basic punctuation and space tokens.
Run this to check if everything is set up correctly:
python tools/validate.py1. ModuleNotFoundError: No module named 'kagglehub' or 'tokenizers'
pip install kagglehub pandas tokenizers2. --preset argument not recognized
Make sure you're using the latest version with the fixed add_dataset.py
3. Dataset download fails Some Kaggle datasets require authentication or have access restrictions. The preset will continue with available datasets.
4. PyTorch DLL load failed (Windows) Reinstall PyTorch with the correct CUDA version:
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu1215. Unicode encoding errors (Windows) The scripts are designed to handle Windows encoding. If you see Unicode errors, try running in Windows Terminal or PowerShell.
6. Text generation produces excessive <unk> tokens or garbled output
# Fix tokenizer (adds missing essential tokens like spaces and punctuation)
python data_tools/fix_tokenizer.py
# Then test
python tools/generate.py --prompt "Hello world"7. GPU out of memory
Reduce batch size in configs/train_config_gpu.yaml:
training:
batch_size: 4 # Reduce from 8 to 48. Training seems stuck or very slow
Reduce batch size in configs/train_config_gpu.yaml:
training:
batch_size: 4 # Reduce from 8 to 49. Vocabulary size mismatch errors (RuntimeError about embeddings) This happens when config files don't match the actual tokenizer vocabulary:
# Check actual tokenizer size
python -c "import json; data=json.load(open('data/processed/tokenizer.json')); print(f'Actual vocab: {len(data[\"vocab\"])}')"
# Quick fix: Add missing vocab_size field to tokenizer
python -c "
import json
with open('data/processed/tokenizer.json', 'r', encoding='utf-8') as f:
data = json.load(f)
data['vocab_size'] = len(data['vocab'])
with open('data/processed/tokenizer.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
print('β
Fixed tokenizer file')
"
# The training script automatically handles this, but you can manually update configs:
# Edit configs/train_config_gpu.yaml and models/*/config.json
# Set vocab_size to match the actual tokenizer size (typically 32,000 for BPE tokenization)L1 is actively being trained and improved:
- β GPU Compatibility: Full RTX 5060 Ti support with CUDA 12.8
- β Model Architecture: 140M parameter transformer (12 layers, 12 heads)
- β BPE Tokenization: 32k vocabulary for intelligent text understanding
- β Training Pipeline: Lightning-fast checkpointing every 1000 steps (~2 minutes)
- β Dataset: Wikipedia + ArXiv (500,000 samples)
- β Optimization: Mixed precision, gradient checkpointing, stable configuration
- π Current Training: Ultra-fast training with automatic resume capability
- π Performance: Excellent loss reduction and 108x speed improvement
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure all tests pass (
python -m pytest tests/) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- "Attention Is All You Need" - Vaswani et al. (The Transformer paper)
- GPT and BERT architectures - OpenAI and Google Research
- PyTorch community - For the excellent deep learning framework
- Hugging Face - For transformers library inspiration
For questions and support:
- Issues: Open an issue on GitHub
- Discussions: Use GitHub Discussions
- Documentation: Check the
docs/directory
βThe limits of my language mean the limits of my world.β
β Ludwig Wittgenstein
Built with β€οΈ by Julius Pleunes
