Skip to content

amyanger/local-llm-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local LLM Fine-Tuning Framework

Python 3.10+ PyTorch License: MIT CUDA 12.8

A memory-efficient framework for fine-tuning large language models on consumer GPUs using QLoRA. Designed for NVIDIA RTX 5090 (Blackwell architecture) but compatible with RTX 30/40 series.

Features

  • Memory Efficient: Fine-tune 7B parameter models with only ~16GB VRAM using 4-bit quantization
  • QLoRA Training: Parameter-efficient fine-tuning with Low-Rank Adaptation
  • RTX 5090 Ready: Full support for Blackwell architecture (sm_120) via PyTorch nightly
  • Flexible: Works with Mistral, Llama, Qwen, and other HuggingFace models
  • Easy to Use: Simple CLI interface for training and inference

Quick Start

Prerequisites

  • Python 3.10+
  • NVIDIA GPU with 16GB+ VRAM
  • CUDA 12.8+ (for RTX 50 series) or CUDA 12.1+ (for RTX 30/40 series)

Installation

git clone https://github.com/amyanger/local-llm-project.git
cd local-llm-project

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# Install PyTorch (RTX 5090 / Blackwell)
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

# Install dependencies
pip install -r requirements.txt

Training

# Fine-tune OpenHermes Mistral 7B on UltraChat dataset (default)
python src/train.py

# Or customize the training
python src/train.py \
    --model teknium/OpenHermes-2.5-Mistral-7B \
    --dataset HuggingFaceH4/ultrachat_200k \
    --epochs 1 \
    --max-samples 10000

# Use your own dataset
python src/train.py \
    --model teknium/OpenHermes-2.5-Mistral-7B \
    --dataset data/raw/your_dataset.jsonl \
    --epochs 3

Inference

# Interactive chat mode
python src/inference.py --model models/openhermes-chat

# Single prompt
python src/inference.py --model models/openhermes-chat --prompt "Explain quantum computing"

Web UI

# Launch Gradio chat interface (opens at http://localhost:7860)
python src/app.py

# With custom options
python src/app.py --model models/openhermes-chat --port 7860 --share

The web UI provides:

  • Clean chat interface with conversation history
  • Adjustable generation parameters (temperature, max tokens, etc.)
  • Custom system prompts
  • Retry/regenerate functionality

Project Structure

local-llm-project/
├── src/
│   ├── train.py          # QLoRA fine-tuning script
│   ├── inference.py      # Model inference and chat
│   └── app.py            # Gradio web UI
├── data/
│   ├── raw/              # Training datasets
│   └── processed/        # Preprocessed data
├── models/
│   └── checkpoints/      # Saved model weights
├── config/               # Training configurations
├── requirements.txt      # Python dependencies
└── README.md

Supported Models

Model Parameters VRAM Required Recommended For
Mistral 7B 7B ~16GB General tasks, instruction-following
Llama 3.1 8B 8B ~18GB General purpose, large ecosystem
CodeLlama 7B 7B ~16GB Code generation
Qwen 2.5 Coder 7B 7B ~16GB Code + reasoning

Dataset Format

Training data should be in JSONL format with instruction-response pairs:

{"instruction": "What is machine learning?", "response": "Machine learning is a subset of artificial intelligence..."}
{"instruction": "Write a Python function to sort a list", "response": "def sort_list(lst):\n    return sorted(lst)"}

Or use HuggingFace datasets with a text field directly.

Training Configuration

Parameter Default Description
--model teknium/OpenHermes-2.5-Mistral-7B Base model from HuggingFace
--dataset HuggingFaceH4/ultrachat_200k Path to JSONL or HuggingFace dataset
--epochs 1 Number of training epochs
--batch-size 2 Per-device batch size
--lr 2e-5 Learning rate
--max-samples 10000 Max training samples (0 for all)
--output models/openhermes-chat Output directory

Technical Details

QLoRA Configuration

  • LoRA Rank (r): 16
  • LoRA Alpha: 16
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Quantization: 4-bit NF4 with double quantization
  • Compute dtype: BFloat16

Stack

  • Transformers: Model loading and tokenization
  • PEFT: Parameter-efficient fine-tuning (LoRA)
  • BitsAndBytes: 4-bit quantization
  • TRL: Supervised fine-tuning trainer
  • Datasets: Data loading and processing

Hardware Requirements

Minimum

  • NVIDIA GPU with 16GB VRAM (RTX 4080, 3090, etc.)
  • 32GB System RAM
  • 50GB Storage

Recommended

  • NVIDIA RTX 5090 (32GB VRAM)
  • 64GB System RAM
  • 100GB+ NVMe Storage

RTX 5090 Notes

The RTX 5090 uses Blackwell architecture (sm_120) which requires:

  • PyTorch nightly build with CUDA 12.8
  • Latest NVIDIA drivers (560+)

Results

Fine-tuning Mistral 7B on 10K instruction pairs:

Metric Before After
Training Loss 2.1 0.8
Perplexity 8.2 2.2

Results vary based on dataset quality and training duration.

Roadmap

  • Add DPO (Direct Preference Optimization) support
  • Implement evaluation benchmarks
  • Add multi-GPU training with DeepSpeed
  • Create web UI for inference
  • Support for vision-language models

References

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


Built for the AI/ML community. Star this repo if you find it useful!

About

Local LLM deployment and experimentation with Python

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages