Skip to content

juliuspleunes4/Atlas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

81 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

🌍 Atlas

Version Python 3.8+ License: MIT Code style: black Tests

A from-scratch language model implementation with GGUF export.

Atlas is a complete pipeline for building, training, and deploying decoder-only transformer language models. The project focuses on clarity, modularity, and the ability to export trained models to GGUF format for efficient inference.

✨ Features

  • 🎯 Clean Implementation: Decoder-only transformer architecture built from scratch with PyTorch
  • πŸ”„ Complete Pipeline: Training, evaluation, inference, and export all in one place
  • πŸ’Ύ Memory Efficient: 8-bit optimizer support (75% memory reduction) for large models on consumer GPUs
  • ⚑ Reliable Checkpointing: Built-in mid-epoch checkpoint saving every N steps
  • πŸ“¦ GGUF Export: Convert trained models to GGUF format for use with llama.cpp
  • 🧩 Modular Design: Well-organized codebase with clear separation of concerns
  • βœ… Comprehensive Testing: 326 tests covering all major components

πŸ“Š Project Status

🚧 Under active development 🚧

Atlas is currently in early development. See docs/ROADMAP.md for the full development plan.

Completed:

  • βœ… Phase 0: Project Foundation
  • βœ… Phase 1: Configuration System (32 tests)
  • βœ… Phase 2: Tokenizer Integration (27 tests)
  • βœ… Phase 3: Model Architecture (51 tests - +9 gradient checkpointing tests)
  • βœ… Phase 4: Data Pipeline (72 tests)
  • βœ… Phase 5: Training Loop (62 tests - +6 auto-resume tests)
  • βœ… Phase 5.5: Training Script (13 tests)
  • βœ… Phase 6: Inference & Generation (21 tests)
  • βœ… Phase 6.3: Inference Script (12 tests)
  • βœ… Phase 7: GGUF Export (17 tests)

Skipped (for now):

  • ⏭️ Phase 8: End-to-End Integration (individual components thoroughly tested)

Upcoming:

  • Phase 9-10: Advanced features and optimization

Total: 326 passing tests ✨

πŸ—οΈ Architecture

Atlas implements a decoder-only transformer architecture (GPT-style) with the following flow:

Atlas Architecture Diagram

Key Components:

  • Token & Positional Embeddings: Convert input tokens to dense vectors with positional information
  • Transformer Blocks: Multi-head self-attention + feed-forward networks with residual connections
  • Layer Normalization: Pre-norm architecture for training stability
  • Causal Masking: Autoregressive generation (model only sees previous tokens)
  • Language Model Head: Projects hidden states back to vocabulary for next-token prediction

For detailed architecture documentation, see docs/ARCHITECTURE.md.

βš™οΈ Model Configurations

Atlas provides multiple pre-configured model sizes optimized for different hardware capabilities:

Config Parameters Layers Hidden Size Heads Sequence Length Batch Size VRAM Usage Best For
TINY ~40M 8 512 8 512 4 (Γ—4 accum) 4-6 GB Testing, debugging, low-end GPUs
SMALL ~124M 12 768 12 1024 24 (Γ—2 accum) 6-8 GB Quick experiments, prototyping
DEFAULT ~350M 24 1024 16 1024 16 (Γ—2 accum) 12-14 GB Recommended for most users
LARGE ~500M 30 1280 20 1024 8 (Γ—4 accum) 14-15 GB Maximum quality, high-end GPUs
XLARGE ~500M 30 1280 20 1024 2 (Γ—16 accum) 8-10 GB Max params with memory safety
ULTRA ~650M 30 1280 20 256 1 (Γ—16 accum) 3-5 GB Cool & Quiet - Low GPU temp

Configuration Details:

  • TINY (configs/tiny.yaml): Ultra-lightweight for testing and development

    • Effective batch size: 16 (4 Γ— 4 gradient accumulation)
    • Training time: ~1-2 days on RTX 3080 for 10K steps
    • Good for verifying pipeline, debugging, or running on older GPUs
  • SMALL (configs/small.yaml): GPT-2 Small equivalent

    • Effective batch size: 48 (24 Γ— 2 gradient accumulation)
    • Training time: ~3-5 days on RTX 3080 for 20K steps
    • Capable of basic text generation and learning patterns
  • DEFAULT (configs/default.yaml): GPT-2 Medium equivalent (Recommended)

    • Effective batch size: 32 (16 Γ— 2 gradient accumulation)
    • Training time: ~1-2 weeks on RTX 3080 for 50K steps
    • Optimized for RTX 5060 Ti 16GB with safe memory margin
    • Best balance of quality and training time
  • LARGE (configs/large.yaml): Maximum quality

    • Effective batch size: 32 (8 Γ— 4 gradient accumulation)
    • Training time: ~2-3 weeks on RTX 3080 for 80K steps
    • Close to 16GB VRAM limit - ensure good cooling
  • XLARGE (configs/xlarge.yaml): Memory-optimized maximum size

    • Effective batch size: 32 (2 Γ— 16 gradient accumulation)
    • Same 500M parameters as LARGE but uses 40% less VRAM
    • Training time: Similar to LARGE (~2-3 weeks for 80K steps)
    • Best choice for maximizing model size while staying within GPU limits
  • ULTRA (configs/ultra.yaml): Extreme low-temperature optimization

    • Effective batch size: 64 (1 Γ— 64 gradient accumulation)
    • Same 500M parameters, shorter sequences (256 tokens)
    • Uses absolute minimum VRAM (batch_size=1) + gradient checkpointing
    • Runs COOLEST of all configs - minimal GPU load/temperature
    • Training time: Slower than XLARGE due to extreme accumulation
    • Best for: Maximum parameters while keeping GPU cool and quiet

Choosing a Configuration:

Use the automated training script to select interactively:

.\scripts\run_pipeline.ps1  # Windows
./scripts/run_pipeline.sh   # Linux/Mac

Or specify directly:

python scripts/train.py --config configs/tiny.yaml --train-data data/processed/wikipedia

πŸš€ Installation

πŸ“‹ Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)
  • Git

πŸ”§ Setup Instructions

  1. Clone the repository
git clone https://github.com/juliuspleunes4/Atlas.git
cd Atlas
  1. Create a virtual environment

On Windows:

python -m venv venv
.\venv\Scripts\Activate.ps1

On macOS/Linux:

python3 -m venv venv
source venv/bin/activate
  1. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
  1. Install Atlas as a package (development mode)
pip install -e .
  1. Verify installation
pytest tests/ -v

🎯 Quick Start (Recommended)

The easiest way to get started - fully automated and interactive!

Step 1: Clone the repository

git clone https://github.com/juliuspleunes4/Atlas.git
cd Atlas

Step 2: Download training data

Get the Wikipedia SimpleEnglish dataset from Kaggle (171 MB zip file)

Step 3: Place the zip file

Atlas/data/raw/archive.zip

Step 4: Run the interactive training pipeline

# Windows
.\scripts\run_pipeline.ps1

# Linux/Mac
chmod +x scripts/run_pipeline.sh
./scripts/run_pipeline.sh

That's it! πŸŽ‰ The script will:

  • βœ… Check Python and create virtual environment
  • βœ… Install all dependencies automatically
  • βœ… Prepare the dataset (249K articles)
  • βœ… Show you an interactive menu to choose model size
  • βœ… Start training with your selected configuration

The script handles everything else automatically and guides you through each step!

πŸ”„ Checkpoint Auto-Resume

Atlas automatically detects existing checkpoints and asks if you want to resume training:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Checkpoint found: checkpoints/atlas_step_500.pt     β”‚
β”‚  Step: 500                                           β”‚
β”‚  Epoch: 1                                            β”‚
β”‚  Loss: 3.456                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Resume from checkpoint? (y/n):
  • Choose "y": Continue training from the checkpoint
    • Restores model weights, optimizer state, learning rate scheduler
    • Continues from the exact step and epoch where training stopped
    • Progress bar shows: Training: 0%| 500/80000 (global steps)
  • Choose "n": Start a fresh training session
    • Existing checkpoints remain untouched (won't be deleted)
    • Training starts from step 0, epoch 1

Progress Tracking:

  • Training progress displays global steps throughout the entire session
  • Example: Training: 1%| 1250/80000 [06:15<8:16:32, 2.64s/it]
    • 1250/80000: Current step out of max_steps
    • 2.64s/it: Seconds per global step (1 step = gradient_accumulation_steps batches)
    • Progress bar continues across epochs without resetting

This works in:

  • Interactive pipeline scripts (run_pipeline.ps1, run_pipeline.sh)
  • Direct training script (python scripts/train.py)

To bypass the prompt and force resumption:

python scripts/train.py --config configs/default.yaml --resume checkpoints/atlas_step_500.pt

To skip the prompt and start fresh:

python scripts/train.py --config configs/default.yaml --no-resume

πŸ”§ Advanced Usage

πŸ“¦ Manual Data Preparation

If you need more control over data preparation:

# Basic usage
python scripts/prepare_data.py --input data/raw/archive.zip

# Custom output directory
python scripts/prepare_data.py --input data/raw/archive.zip --output data/processed/my_wiki

# List prepared datasets
python scripts/prepare_data.py --list

This extracts and organizes 249K articles (~400MB) into the processed directory.

πŸ‹οΈ Manual Training

# Basic training with a configuration
python scripts/train.py --config configs/default.yaml --train-data data/processed/wikipedia

# Train with validation data
python scripts/train.py \
  --config configs/default.yaml \
  --train-data data/processed/wikipedia \
  --val-data data/processed/validation

# Resume training from a checkpoint
python scripts/train.py \
  --config configs/default.yaml \
  --train-data data/processed/wikipedia \
  --resume checkpoints/atlas_step_5000.pt

# Override config parameters from CLI
python scripts/train.py \
  --config configs/default.yaml \
  --train-data data/processed/wikipedia \
  --max-steps 100000 \
  --save-interval 500 \
  --eval-interval 1000

Available training arguments:

  • --config: Path to YAML configuration file (required)
  • --train-data: Path to training data directory (required)
  • --val-data: Path to validation data (optional)
  • --output-dir: Checkpoint directory (default: ./checkpoints)
  • --resume: Resume from checkpoint path
  • --max-steps: Override max training steps
  • --save-interval: Save checkpoint every N steps (default: 1000)
  • --eval-interval: Evaluate every N steps (default: 1000)
  • --device: Device to use (default: auto-detect CUDA)

πŸ’¬ Inference

Generate text from a trained model:

# Single prompt
python scripts/infer.py \
  --checkpoint checkpoints/my_model/checkpoint_step_10000.pt \
  --prompt "Once upon a time" \
  --max-new-tokens 100 \
  --temperature 0.8 \
  --top-k 40 \
  --do-sample

Interactive mode:

python scripts/infer.py \
  --checkpoint checkpoints/my_model/best_model.pt \
  --interactive

Batch generation from file:

python scripts/infer.py \
  --checkpoint checkpoints/my_model/best_model.pt \
  --prompts-file prompts.txt \
  --output-file generated.txt \
  --temperature 0.9 \
  --top-p 0.95 \
  --do-sample

πŸ“¦ Export to GGUF

Export trained model to GGUF format:

# Export with float32 (default)
python scripts/export_gguf.py \
  --checkpoint checkpoints/my_model/best_model.pt \
  --output models/atlas_model.gguf \
  --quantization f32

Export with float16 for smaller file size:

python scripts/export_gguf.py \
  --checkpoint checkpoints/my_model/best_model.pt \
  --output models/atlas_model_f16.gguf \
  --quantization f16 \
  --tokenizer gpt2

πŸ“ Project Structure

Atlas/
β”œβ”€β”€ atlas/              # Main package
β”‚   β”œβ”€β”€ model/         # Model architecture (embeddings, attention, transformer blocks)
β”‚   β”œβ”€β”€ tokenizer/     # Tokenization and vocabulary
β”‚   β”œβ”€β”€ training/      # Training loop and optimization
β”‚   β”œβ”€β”€ data/          # Dataset loading and preprocessing
β”‚   β”œβ”€β”€ config/        # Configuration management
β”‚   β”œβ”€β”€ inference/     # Text generation and sampling
β”‚   β”œβ”€β”€ utils/         # Logging, checkpointing, metrics
β”‚   └── export/        # Model export (GGUF, etc.)
β”œβ”€β”€ tests/             # Test suite
β”œβ”€β”€ scripts/           # CLI scripts
β”œβ”€β”€ docs/              # Documentation
β”‚   β”œβ”€β”€ ROADMAP.md    # Development roadmap
β”‚   └── CHANGELOG.md  # Change log
└── README.md          # This file

πŸ—οΈ Architecture

Atlas implements a decoder-only transformer architecture (GPT-style):

  • Multi-head self-attention with causal masking
  • Feed-forward networks (MLP) with configurable activation (GELU/SiLU)
  • Layer normalization (pre-norm architecture)
  • Learned positional embeddings
  • Weight tying between input embeddings and output projection

πŸ› οΈ Development

πŸ§ͺ Running Tests

Run all tests:

pytest tests/ -v

Run specific test file:

pytest tests/test_config.py -v

Run with coverage:

pytest tests/ --cov=atlas --cov-report=html

πŸ’Ž Code Quality

Format code:

black atlas/ tests/ scripts/

Lint code:

flake8 atlas/ tests/ scripts/

Type checking:

mypy atlas/

Run all quality checks:

black atlas/ tests/ scripts/ && flake8 atlas/ tests/ scripts/ && mypy atlas/ && pytest tests/ -v

πŸ—ΊοΈ Roadmap

See docs/ROADMAP.md for the complete development plan, which includes:

  • Phase 0-1: Project foundation and configuration system
  • Phase 2: Tokenizer integration
  • Phase 3: Model architecture implementation
  • Phase 4: Data pipeline
  • Phase 5: Training loop
  • Phase 6: Inference and generation
  • Phase 7: GGUF export
  • Phase 8: End-to-end integration
  • Phase 9: Optimization and refinement
  • Phase 10: Documentation and release

🀝 Contributing

We welcome contributions to Atlas! Whether you're fixing bugs, adding features, improving documentation, or writing tests, your help is appreciated.

🎬 Getting Started

  1. Fork the repository on GitHub
  2. Clone your fork locally:
    git clone https://github.com/juliuspleunes4/Atlas.git
    cd Atlas
  3. Create a virtual environment and install dependencies (see Installation section)
  4. Create a new branch for your changes:
    git checkout -b feature/your-feature-name

πŸ“ Development Guidelines

πŸ’Ž Code Quality

  • Format your code with black:
    black atlas/ tests/ scripts/
  • Lint your code with flake8:
    flake8 atlas/ tests/ scripts/
  • Type check with mypy:
    mypy atlas/

πŸ§ͺ Testing

  • Write tests for all new features and bug fixes
  • Run the test suite to ensure nothing breaks:
    pytest tests/ -v
  • Check test coverage:
    pytest tests/ --cov=atlas --cov-report=html
  • Aim for >80% coverage on new code

πŸ“ Commit Guidelines

  • Write clear, descriptive commit messages
  • Use conventional commit format:
    • feat: for new features
    • fix: for bug fixes
    • docs: for documentation changes
    • test: for test additions/changes
    • refactor: for code refactoring
    • chore: for maintenance tasks

πŸ“š Documentation

  • Update docs/CHANGELOG.md with your changes
  • Add docstrings to all public functions and classes
  • Update README.md if adding user-facing features

πŸš€ Submitting Changes

  1. Ensure all tests pass and code is formatted
  2. Commit your changes:
    git add .
    git commit -m "feat: add amazing new feature"
  3. Push to your fork:
    git push origin feature/your-feature-name
  4. Create a Pull Request on GitHub
  5. Address review feedback if requested

πŸ’‘ What to Contribute

Check out the roadmap for areas that need work:

  • πŸ”§ Core Features: Model architecture, training loop, inference
  • πŸ“ Documentation: Tutorials, examples, API docs
  • πŸ§ͺ Tests: Increase coverage, add edge cases
  • πŸ› Bug Fixes: Check issues for known bugs
  • ✨ Optimizations: Performance improvements, memory efficiency

🌟 Code of Conduct

  • Be respectful and inclusive
  • Provide constructive feedback
  • Focus on the code, not the person
  • Help others learn and grow

πŸ“„ License

See LICENSE for details.

❓ FAQ

How do I activate the virtual environment?

Windows (PowerShell):

.\venv\Scripts\Activate.ps1

Windows (Command Prompt):

venv\Scripts\activate.bat

macOS/Linux:

source venv/bin/activate

How do I deactivate the virtual environment?

Simply run:

deactivate

Tests are failing, what should I do?

  1. Ensure you've activated the virtual environment
  2. Make sure all dependencies are installed: pip install -r requirements.txt
  3. Try running a specific test file to isolate the issue
  4. Check if you're using Python 3.8+: python --version

Where should I start contributing?

  1. Check the roadmap for tasks marked as [ ] (not started)
  2. Look for TODO comments in the codebase
  3. Improve test coverage in existing modules
  4. Add documentation and examples

πŸ™ Acknowledgments

About

A simple build-from-scratch LLM

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors

Languages