Atlas is a complete pipeline for building, training, and deploying decoder-only transformer language models. The project focuses on clarity, modularity, and the ability to export trained models to GGUF format for efficient inference.
- π― Clean Implementation: Decoder-only transformer architecture built from scratch with PyTorch
- π Complete Pipeline: Training, evaluation, inference, and export all in one place
- πΎ Memory Efficient: 8-bit optimizer support (75% memory reduction) for large models on consumer GPUs
- β‘ Reliable Checkpointing: Built-in mid-epoch checkpoint saving every N steps
- π¦ GGUF Export: Convert trained models to GGUF format for use with llama.cpp
- π§© Modular Design: Well-organized codebase with clear separation of concerns
- β Comprehensive Testing: 326 tests covering all major components
π§ Under active development π§
Atlas is currently in early development. See docs/ROADMAP.md for the full development plan.
Completed:
- β Phase 0: Project Foundation
- β Phase 1: Configuration System (32 tests)
- β Phase 2: Tokenizer Integration (27 tests)
- β Phase 3: Model Architecture (51 tests - +9 gradient checkpointing tests)
- β Phase 4: Data Pipeline (72 tests)
- β Phase 5: Training Loop (62 tests - +6 auto-resume tests)
- β Phase 5.5: Training Script (13 tests)
- β Phase 6: Inference & Generation (21 tests)
- β Phase 6.3: Inference Script (12 tests)
- β Phase 7: GGUF Export (17 tests)
Skipped (for now):
- βοΈ Phase 8: End-to-End Integration (individual components thoroughly tested)
Upcoming:
- Phase 9-10: Advanced features and optimization
Total: 326 passing tests β¨
Atlas implements a decoder-only transformer architecture (GPT-style) with the following flow:
Key Components:
- Token & Positional Embeddings: Convert input tokens to dense vectors with positional information
- Transformer Blocks: Multi-head self-attention + feed-forward networks with residual connections
- Layer Normalization: Pre-norm architecture for training stability
- Causal Masking: Autoregressive generation (model only sees previous tokens)
- Language Model Head: Projects hidden states back to vocabulary for next-token prediction
For detailed architecture documentation, see docs/ARCHITECTURE.md.
Atlas provides multiple pre-configured model sizes optimized for different hardware capabilities:
| Config | Parameters | Layers | Hidden Size | Heads | Sequence Length | Batch Size | VRAM Usage | Best For |
|---|---|---|---|---|---|---|---|---|
| TINY | ~40M | 8 | 512 | 8 | 512 | 4 (Γ4 accum) | 4-6 GB | Testing, debugging, low-end GPUs |
| SMALL | ~124M | 12 | 768 | 12 | 1024 | 24 (Γ2 accum) | 6-8 GB | Quick experiments, prototyping |
| DEFAULT | ~350M | 24 | 1024 | 16 | 1024 | 16 (Γ2 accum) | 12-14 GB | Recommended for most users |
| LARGE | ~500M | 30 | 1280 | 20 | 1024 | 8 (Γ4 accum) | 14-15 GB | Maximum quality, high-end GPUs |
| XLARGE | ~500M | 30 | 1280 | 20 | 1024 | 2 (Γ16 accum) | 8-10 GB | Max params with memory safety |
| ULTRA | ~650M | 30 | 1280 | 20 | 256 | 1 (Γ16 accum) | 3-5 GB | Cool & Quiet - Low GPU temp |
Configuration Details:
-
TINY (
configs/tiny.yaml): Ultra-lightweight for testing and development- Effective batch size: 16 (4 Γ 4 gradient accumulation)
- Training time: ~1-2 days on RTX 3080 for 10K steps
- Good for verifying pipeline, debugging, or running on older GPUs
-
SMALL (
configs/small.yaml): GPT-2 Small equivalent- Effective batch size: 48 (24 Γ 2 gradient accumulation)
- Training time: ~3-5 days on RTX 3080 for 20K steps
- Capable of basic text generation and learning patterns
-
DEFAULT (
configs/default.yaml): GPT-2 Medium equivalent (Recommended)- Effective batch size: 32 (16 Γ 2 gradient accumulation)
- Training time: ~1-2 weeks on RTX 3080 for 50K steps
- Optimized for RTX 5060 Ti 16GB with safe memory margin
- Best balance of quality and training time
-
LARGE (
configs/large.yaml): Maximum quality- Effective batch size: 32 (8 Γ 4 gradient accumulation)
- Training time: ~2-3 weeks on RTX 3080 for 80K steps
- Close to 16GB VRAM limit - ensure good cooling
-
XLARGE (
configs/xlarge.yaml): Memory-optimized maximum size- Effective batch size: 32 (2 Γ 16 gradient accumulation)
- Same 500M parameters as LARGE but uses 40% less VRAM
- Training time: Similar to LARGE (~2-3 weeks for 80K steps)
- Best choice for maximizing model size while staying within GPU limits
-
ULTRA (
configs/ultra.yaml): Extreme low-temperature optimization- Effective batch size: 64 (1 Γ 64 gradient accumulation)
- Same 500M parameters, shorter sequences (256 tokens)
- Uses absolute minimum VRAM (batch_size=1) + gradient checkpointing
- Runs COOLEST of all configs - minimal GPU load/temperature
- Training time: Slower than XLARGE due to extreme accumulation
- Best for: Maximum parameters while keeping GPU cool and quiet
Choosing a Configuration:
Use the automated training script to select interactively:
.\scripts\run_pipeline.ps1 # Windows
./scripts/run_pipeline.sh # Linux/MacOr specify directly:
python scripts/train.py --config configs/tiny.yaml --train-data data/processed/wikipedia- Python 3.8 or higher
- pip (Python package manager)
- Git
- Clone the repository
git clone https://github.com/juliuspleunes4/Atlas.git
cd Atlas- Create a virtual environment
On Windows:
python -m venv venv
.\venv\Scripts\Activate.ps1On macOS/Linux:
python3 -m venv venv
source venv/bin/activate- Install dependencies
pip install --upgrade pip
pip install -r requirements.txt- Install Atlas as a package (development mode)
pip install -e .- Verify installation
pytest tests/ -vThe easiest way to get started - fully automated and interactive!
git clone https://github.com/juliuspleunes4/Atlas.git
cd AtlasGet the Wikipedia SimpleEnglish dataset from Kaggle (171 MB zip file)
Atlas/data/raw/archive.zip
# Windows
.\scripts\run_pipeline.ps1
# Linux/Mac
chmod +x scripts/run_pipeline.sh
./scripts/run_pipeline.shThat's it! π The script will:
- β Check Python and create virtual environment
- β Install all dependencies automatically
- β Prepare the dataset (249K articles)
- β Show you an interactive menu to choose model size
- β Start training with your selected configuration
The script handles everything else automatically and guides you through each step!
Atlas automatically detects existing checkpoints and asks if you want to resume training:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Checkpoint found: checkpoints/atlas_step_500.pt β
β Step: 500 β
β Epoch: 1 β
β Loss: 3.456 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Resume from checkpoint? (y/n):
- Choose "y": Continue training from the checkpoint
- Restores model weights, optimizer state, learning rate scheduler
- Continues from the exact step and epoch where training stopped
- Progress bar shows:
Training: 0%| 500/80000(global steps)
- Choose "n": Start a fresh training session
- Existing checkpoints remain untouched (won't be deleted)
- Training starts from step 0, epoch 1
Progress Tracking:
- Training progress displays global steps throughout the entire session
- Example:
Training: 1%| 1250/80000 [06:15<8:16:32, 2.64s/it]1250/80000: Current step out of max_steps2.64s/it: Seconds per global step (1 step = gradient_accumulation_steps batches)- Progress bar continues across epochs without resetting
This works in:
- Interactive pipeline scripts (
run_pipeline.ps1,run_pipeline.sh) - Direct training script (
python scripts/train.py)
To bypass the prompt and force resumption:
python scripts/train.py --config configs/default.yaml --resume checkpoints/atlas_step_500.ptTo skip the prompt and start fresh:
python scripts/train.py --config configs/default.yaml --no-resumeIf you need more control over data preparation:
# Basic usage
python scripts/prepare_data.py --input data/raw/archive.zip
# Custom output directory
python scripts/prepare_data.py --input data/raw/archive.zip --output data/processed/my_wiki
# List prepared datasets
python scripts/prepare_data.py --listThis extracts and organizes 249K articles (~400MB) into the processed directory.
# Basic training with a configuration
python scripts/train.py --config configs/default.yaml --train-data data/processed/wikipedia
# Train with validation data
python scripts/train.py \
--config configs/default.yaml \
--train-data data/processed/wikipedia \
--val-data data/processed/validation
# Resume training from a checkpoint
python scripts/train.py \
--config configs/default.yaml \
--train-data data/processed/wikipedia \
--resume checkpoints/atlas_step_5000.pt
# Override config parameters from CLI
python scripts/train.py \
--config configs/default.yaml \
--train-data data/processed/wikipedia \
--max-steps 100000 \
--save-interval 500 \
--eval-interval 1000Available training arguments:
--config: Path to YAML configuration file (required)--train-data: Path to training data directory (required)--val-data: Path to validation data (optional)--output-dir: Checkpoint directory (default:./checkpoints)--resume: Resume from checkpoint path--max-steps: Override max training steps--save-interval: Save checkpoint every N steps (default: 1000)--eval-interval: Evaluate every N steps (default: 1000)--device: Device to use (default: auto-detect CUDA)
Generate text from a trained model:
# Single prompt
python scripts/infer.py \
--checkpoint checkpoints/my_model/checkpoint_step_10000.pt \
--prompt "Once upon a time" \
--max-new-tokens 100 \
--temperature 0.8 \
--top-k 40 \
--do-sampleInteractive mode:
python scripts/infer.py \
--checkpoint checkpoints/my_model/best_model.pt \
--interactiveBatch generation from file:
python scripts/infer.py \
--checkpoint checkpoints/my_model/best_model.pt \
--prompts-file prompts.txt \
--output-file generated.txt \
--temperature 0.9 \
--top-p 0.95 \
--do-sampleExport trained model to GGUF format:
# Export with float32 (default)
python scripts/export_gguf.py \
--checkpoint checkpoints/my_model/best_model.pt \
--output models/atlas_model.gguf \
--quantization f32Export with float16 for smaller file size:
python scripts/export_gguf.py \
--checkpoint checkpoints/my_model/best_model.pt \
--output models/atlas_model_f16.gguf \
--quantization f16 \
--tokenizer gpt2Atlas/
βββ atlas/ # Main package
β βββ model/ # Model architecture (embeddings, attention, transformer blocks)
β βββ tokenizer/ # Tokenization and vocabulary
β βββ training/ # Training loop and optimization
β βββ data/ # Dataset loading and preprocessing
β βββ config/ # Configuration management
β βββ inference/ # Text generation and sampling
β βββ utils/ # Logging, checkpointing, metrics
β βββ export/ # Model export (GGUF, etc.)
βββ tests/ # Test suite
βββ scripts/ # CLI scripts
βββ docs/ # Documentation
β βββ ROADMAP.md # Development roadmap
β βββ CHANGELOG.md # Change log
βββ README.md # This file
Atlas implements a decoder-only transformer architecture (GPT-style):
- Multi-head self-attention with causal masking
- Feed-forward networks (MLP) with configurable activation (GELU/SiLU)
- Layer normalization (pre-norm architecture)
- Learned positional embeddings
- Weight tying between input embeddings and output projection
Run all tests:
pytest tests/ -vRun specific test file:
pytest tests/test_config.py -vRun with coverage:
pytest tests/ --cov=atlas --cov-report=htmlFormat code:
black atlas/ tests/ scripts/Lint code:
flake8 atlas/ tests/ scripts/Type checking:
mypy atlas/Run all quality checks:
black atlas/ tests/ scripts/ && flake8 atlas/ tests/ scripts/ && mypy atlas/ && pytest tests/ -vSee docs/ROADMAP.md for the complete development plan, which includes:
- Phase 0-1: Project foundation and configuration system
- Phase 2: Tokenizer integration
- Phase 3: Model architecture implementation
- Phase 4: Data pipeline
- Phase 5: Training loop
- Phase 6: Inference and generation
- Phase 7: GGUF export
- Phase 8: End-to-end integration
- Phase 9: Optimization and refinement
- Phase 10: Documentation and release
We welcome contributions to Atlas! Whether you're fixing bugs, adding features, improving documentation, or writing tests, your help is appreciated.
- Fork the repository on GitHub
- Clone your fork locally:
git clone https://github.com/juliuspleunes4/Atlas.git cd Atlas - Create a virtual environment and install dependencies (see Installation section)
- Create a new branch for your changes:
git checkout -b feature/your-feature-name
- Format your code with black:
black atlas/ tests/ scripts/
- Lint your code with flake8:
flake8 atlas/ tests/ scripts/
- Type check with mypy:
mypy atlas/
- Write tests for all new features and bug fixes
- Run the test suite to ensure nothing breaks:
pytest tests/ -v
- Check test coverage:
pytest tests/ --cov=atlas --cov-report=html
- Aim for >80% coverage on new code
- Write clear, descriptive commit messages
- Use conventional commit format:
feat:for new featuresfix:for bug fixesdocs:for documentation changestest:for test additions/changesrefactor:for code refactoringchore:for maintenance tasks
- Update
docs/CHANGELOG.mdwith your changes - Add docstrings to all public functions and classes
- Update README.md if adding user-facing features
- Ensure all tests pass and code is formatted
- Commit your changes:
git add . git commit -m "feat: add amazing new feature"
- Push to your fork:
git push origin feature/your-feature-name
- Create a Pull Request on GitHub
- Address review feedback if requested
Check out the roadmap for areas that need work:
- π§ Core Features: Model architecture, training loop, inference
- π Documentation: Tutorials, examples, API docs
- π§ͺ Tests: Increase coverage, add edge cases
- π Bug Fixes: Check issues for known bugs
- β¨ Optimizations: Performance improvements, memory efficiency
- Be respectful and inclusive
- Provide constructive feedback
- Focus on the code, not the person
- Help others learn and grow
See LICENSE for details.
Windows (PowerShell):
.\venv\Scripts\Activate.ps1Windows (Command Prompt):
venv\Scripts\activate.batmacOS/Linux:
source venv/bin/activateSimply run:
deactivate- Ensure you've activated the virtual environment
- Make sure all dependencies are installed:
pip install -r requirements.txt - Try running a specific test file to isolate the issue
- Check if you're using Python 3.8+:
python --version
- Check the roadmap for tasks marked as
[ ](not started) - Look for
TODOcomments in the codebase - Improve test coverage in existing modules
- Add documentation and examples
- Inspired by modern transformer architectures (GPT, LLaMA)
- GGUF format from ggerganov/llama.cpp
- Built with PyTorch