Skip to content

mwheeler235/tune-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Fine-Tuning with LoRA

Can a Small Language Model (SLM) be tuned to "Explain it to me like I'm 5?"

A minimal implementation for fine-tuning small language models using LoRA (Low-Rank Adaptation) on custom datasets. Given the usage of Apple MPS (without distributed computing), a small language model (SLM) needed to be leveraged. I explored using 'EleutherAI/gpt-neo-1.3B', but settled on 'TinyLlama/TinyLlama-1.1B-Chat-v1.0' for faster training, faster inference and better tuning results. I learned many nuances during this project. Some of my key takeaways include:

  1. SLM's are not equipped to be fine-tune for multi-step answers. I initially wanted to tune the model to become my Running Coach, with the ability to forecast out 16 to 20 runs over 4 weeks. I tried several SLM's and many parameter configurations, but was getting total nonsense with each attempt. Hence, I moved on to the simpler ELI5 use case. I would love to recircle to my initial use case with a much larger model, hosted on Cloud Compute.

  2. Discovering the correct Lora parameters ended up being absolutely crucial for getting proper output responses. This is akin to asking "how much do you want to fine-tune the original LLM weights?" Clearly the more you update the weights, the longer the training and the inference will be, and also the higher probability of overfitting to your specific examples. Getting them just right will ensure the model weights are tuned appropriately to learn from your training examples.

  3. I also have some inference takeways, which are listed in a separate section below.

Overview

This project demonstrates how to fine-tune models like TinyLlama for specific tasks using Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters. Currently configured for ELI5-style explanations.

Setup

  1. Install dependencies:

    pip install -r requirements.txt
  2. Environment configuration: Create a .env file with any required API keys or configuration variables.

Usage

Quick Start

Open and run llm_model_tuning.ipynb to:

  1. Load a pre-trained model (TinyLlama/TinyLlama-1.1B-Chat-v1.0)
  2. Configure LoRA adapters for efficient fine-tuning
  3. Train on your custom dataset (examples with question and answer)
  4. Evaluate the fine-tuned model

Key Components

  • Model: TinyLlama-1.1B-Chat (switchable to other models)
  • Method: LoRA fine-tuning with rank=16, alpha=32
  • Training: AdamW optimizer with gradient clipping
  • Device: Apple Silicon MPS support + CPU fallback

Training Data Format

training_examples = [
    "### Question: [Your question] ### Answer: [Expected response]",
    # Add more examples...
]

Key Inference Takeaways

1. Memory Optimization is Critical

# Always clear cache and disable gradients for inference
torch.mps.empty_cache()
torch.set_grad_enabled(False)
model.eval()

2. Generation Parameters Significantly Impact Quality

  • max_new_tokens: Keep small (25-50 tokens) for focused ELI5 responses
  • temperature: 0.6-0.7 works best for ELI5 - balances creativity with coherence
  • repetition_penalty: 1.05-1.1 prevents loops without being too restrictive
  • no_repeat_ngram_size: 1-2 prevent exact repetitions
  • top_p: 0.9-0.95 provides good nucleus sampling for natural responses
  • Many more that effect the quality of the output responses. I'm always learning!

3. Format Consistency is Essential

Your inference prompts must exactly match your training format:

# Training format
"### Question: [question] ### Answer: [answer]"

# Inference format (must be identical)
"### Question: [question] ### Answer:"

4. Speed vs. Quality Trade-offs

  • Faster inference: Lower max_new_tokens, disable do_sample, use use_cache=True
  • Better quality: Higher temperature, enable do_sample, more tokens
  • Apple Silicon: Enable torch.backends.mps.allow_tf32 = True for speed

5. Model Compilation Benefits

# Significant speed improvement on M1/M2 chips
if hasattr(torch, 'compile'):
    model = torch.compile(model, mode='reduce-overhead')

About

This project demonstrates how to fine-tune models like TinyLlama for specific tasks using Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors