LLM Fine-Tuning with LoRA

Can a Small Language Model (SLM) be tuned to "Explain it to me like I'm 5?"

A minimal implementation for fine-tuning small language models using LoRA (Low-Rank Adaptation) on custom datasets. Given the usage of Apple MPS (without distributed computing), a small language model (SLM) needed to be leveraged. I explored using 'EleutherAI/gpt-neo-1.3B', but settled on 'TinyLlama/TinyLlama-1.1B-Chat-v1.0' for faster training, faster inference and better tuning results. I learned many nuances during this project. Some of my key takeaways include:

SLM's are not equipped to be fine-tune for multi-step answers. I initially wanted to tune the model to become my Running Coach, with the ability to forecast out 16 to 20 runs over 4 weeks. I tried several SLM's and many parameter configurations, but was getting total nonsense with each attempt. Hence, I moved on to the simpler ELI5 use case. I would love to recircle to my initial use case with a much larger model, hosted on Cloud Compute.
Discovering the correct Lora parameters ended up being absolutely crucial for getting proper output responses. This is akin to asking "how much do you want to fine-tune the original LLM weights?" Clearly the more you update the weights, the longer the training and the inference will be, and also the higher probability of overfitting to your specific examples. Getting them just right will ensure the model weights are tuned appropriately to learn from your training examples.
I also have some inference takeways, which are listed in a separate section below.

Overview

This project demonstrates how to fine-tune models like TinyLlama for specific tasks using Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters. Currently configured for ELI5-style explanations.

Setup

Install dependencies:
```
pip install -r requirements.txt
```
Environment configuration: Create a .env file with any required API keys or configuration variables.

Usage

Quick Start

Open and run llm_model_tuning.ipynb to:

Load a pre-trained model (TinyLlama/TinyLlama-1.1B-Chat-v1.0)
Configure LoRA adapters for efficient fine-tuning
Train on your custom dataset (examples with question and answer)
Evaluate the fine-tuned model

Key Components

Model: TinyLlama-1.1B-Chat (switchable to other models)
Method: LoRA fine-tuning with rank=16, alpha=32
Training: AdamW optimizer with gradient clipping
Device: Apple Silicon MPS support + CPU fallback

Training Data Format

training_examples = [
    "### Question: [Your question] ### Answer: [Expected response]",
    # Add more examples...
]

Key Inference Takeaways

1. Memory Optimization is Critical

# Always clear cache and disable gradients for inference
torch.mps.empty_cache()
torch.set_grad_enabled(False)
model.eval()

2. Generation Parameters Significantly Impact Quality

max_new_tokens: Keep small (25-50 tokens) for focused ELI5 responses
temperature: 0.6-0.7 works best for ELI5 - balances creativity with coherence
repetition_penalty: 1.05-1.1 prevents loops without being too restrictive
no_repeat_ngram_size: 1-2 prevent exact repetitions
top_p: 0.9-0.95 provides good nucleus sampling for natural responses
Many more that effect the quality of the output responses. I'm always learning!

3. Format Consistency is Essential

Your inference prompts must exactly match your training format:

# Training format
"### Question: [question] ### Answer: [answer]"

# Inference format (must be identical)
"### Question: [question] ### Answer:"

4. Speed vs. Quality Trade-offs

Faster inference: Lower max_new_tokens, disable do_sample, use use_cache=True
Better quality: Higher temperature, enable do_sample, more tokens
Apple Silicon: Enable torch.backends.mps.allow_tf32 = True for speed

5. Model Compilation Benefits

# Significant speed improvement on M1/M2 chips
if hasattr(torch, 'compile'):
    model = torch.compile(model, mode='reduce-overhead')

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
eli5_2.png		eli5_2.png
llm_model_tuning.ipynb		llm_model_tuning.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Fine-Tuning with LoRA

Can a Small Language Model (SLM) be tuned to "Explain it to me like I'm 5?"

Overview

Setup

Usage

Quick Start

Key Components

Training Data Format

Key Inference Takeaways

1. Memory Optimization is Critical

2. Generation Parameters Significantly Impact Quality

3. Format Consistency is Essential

4. Speed vs. Quality Trade-offs

5. Model Compilation Benefits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Fine-Tuning with LoRA

Can a Small Language Model (SLM) be tuned to "Explain it to me like I'm 5?"

Overview

Setup

Usage

Quick Start

Key Components

Training Data Format

Key Inference Takeaways

1. Memory Optimization is Critical

2. Generation Parameters Significantly Impact Quality

3. Format Consistency is Essential

4. Speed vs. Quality Trade-offs

5. Model Compilation Benefits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages