A memory-efficient framework for fine-tuning large language models on consumer GPUs using QLoRA. Designed for NVIDIA RTX 5090 (Blackwell architecture) but compatible with RTX 30/40 series.
- Memory Efficient: Fine-tune 7B parameter models with only ~16GB VRAM using 4-bit quantization
- QLoRA Training: Parameter-efficient fine-tuning with Low-Rank Adaptation
- RTX 5090 Ready: Full support for Blackwell architecture (sm_120) via PyTorch nightly
- Flexible: Works with Mistral, Llama, Qwen, and other HuggingFace models
- Easy to Use: Simple CLI interface for training and inference
- Python 3.10+
- NVIDIA GPU with 16GB+ VRAM
- CUDA 12.8+ (for RTX 50 series) or CUDA 12.1+ (for RTX 30/40 series)
git clone https://github.com/amyanger/local-llm-project.git
cd local-llm-project
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
# Install PyTorch (RTX 5090 / Blackwell)
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
# Install dependencies
pip install -r requirements.txt# Fine-tune OpenHermes Mistral 7B on UltraChat dataset (default)
python src/train.py
# Or customize the training
python src/train.py \
--model teknium/OpenHermes-2.5-Mistral-7B \
--dataset HuggingFaceH4/ultrachat_200k \
--epochs 1 \
--max-samples 10000
# Use your own dataset
python src/train.py \
--model teknium/OpenHermes-2.5-Mistral-7B \
--dataset data/raw/your_dataset.jsonl \
--epochs 3# Interactive chat mode
python src/inference.py --model models/openhermes-chat
# Single prompt
python src/inference.py --model models/openhermes-chat --prompt "Explain quantum computing"# Launch Gradio chat interface (opens at http://localhost:7860)
python src/app.py
# With custom options
python src/app.py --model models/openhermes-chat --port 7860 --shareThe web UI provides:
- Clean chat interface with conversation history
- Adjustable generation parameters (temperature, max tokens, etc.)
- Custom system prompts
- Retry/regenerate functionality
local-llm-project/
├── src/
│ ├── train.py # QLoRA fine-tuning script
│ ├── inference.py # Model inference and chat
│ └── app.py # Gradio web UI
├── data/
│ ├── raw/ # Training datasets
│ └── processed/ # Preprocessed data
├── models/
│ └── checkpoints/ # Saved model weights
├── config/ # Training configurations
├── requirements.txt # Python dependencies
└── README.md
| Model | Parameters | VRAM Required | Recommended For |
|---|---|---|---|
| Mistral 7B | 7B | ~16GB | General tasks, instruction-following |
| Llama 3.1 8B | 8B | ~18GB | General purpose, large ecosystem |
| CodeLlama 7B | 7B | ~16GB | Code generation |
| Qwen 2.5 Coder 7B | 7B | ~16GB | Code + reasoning |
Training data should be in JSONL format with instruction-response pairs:
{"instruction": "What is machine learning?", "response": "Machine learning is a subset of artificial intelligence..."}
{"instruction": "Write a Python function to sort a list", "response": "def sort_list(lst):\n return sorted(lst)"}Or use HuggingFace datasets with a text field directly.
| Parameter | Default | Description |
|---|---|---|
--model |
teknium/OpenHermes-2.5-Mistral-7B |
Base model from HuggingFace |
--dataset |
HuggingFaceH4/ultrachat_200k |
Path to JSONL or HuggingFace dataset |
--epochs |
1 | Number of training epochs |
--batch-size |
2 | Per-device batch size |
--lr |
2e-5 | Learning rate |
--max-samples |
10000 | Max training samples (0 for all) |
--output |
models/openhermes-chat |
Output directory |
- LoRA Rank (r): 16
- LoRA Alpha: 16
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Quantization: 4-bit NF4 with double quantization
- Compute dtype: BFloat16
- Transformers: Model loading and tokenization
- PEFT: Parameter-efficient fine-tuning (LoRA)
- BitsAndBytes: 4-bit quantization
- TRL: Supervised fine-tuning trainer
- Datasets: Data loading and processing
- NVIDIA GPU with 16GB VRAM (RTX 4080, 3090, etc.)
- 32GB System RAM
- 50GB Storage
- NVIDIA RTX 5090 (32GB VRAM)
- 64GB System RAM
- 100GB+ NVMe Storage
The RTX 5090 uses Blackwell architecture (sm_120) which requires:
- PyTorch nightly build with CUDA 12.8
- Latest NVIDIA drivers (560+)
Fine-tuning Mistral 7B on 10K instruction pairs:
| Metric | Before | After |
|---|---|---|
| Training Loss | 2.1 | 0.8 |
| Perplexity | 8.2 | 2.2 |
Results vary based on dataset quality and training duration.
- Add DPO (Direct Preference Optimization) support
- Implement evaluation benchmarks
- Add multi-GPU training with DeepSpeed
- Create web UI for inference
- Support for vision-language models
MIT License - see LICENSE for details.
Contributions are welcome! Please feel free to submit a Pull Request.
Built for the AI/ML community. Star this repo if you find it useful!