A minimal implementation for fine-tuning small language models using LoRA (Low-Rank Adaptation) on custom datasets. Given the usage of Apple MPS (without distributed computing), a small language model (SLM) needed to be leveraged. I explored using 'EleutherAI/gpt-neo-1.3B', but settled on 'TinyLlama/TinyLlama-1.1B-Chat-v1.0' for faster training, faster inference and better tuning results. I learned many nuances during this project. Some of my key takeaways include:
-
SLM's are not equipped to be fine-tune for multi-step answers. I initially wanted to tune the model to become my Running Coach, with the ability to forecast out 16 to 20 runs over 4 weeks. I tried several SLM's and many parameter configurations, but was getting total nonsense with each attempt. Hence, I moved on to the simpler ELI5 use case. I would love to recircle to my initial use case with a much larger model, hosted on Cloud Compute.
-
Discovering the correct Lora parameters ended up being absolutely crucial for getting proper output responses. This is akin to asking "how much do you want to fine-tune the original LLM weights?" Clearly the more you update the weights, the longer the training and the inference will be, and also the higher probability of overfitting to your specific examples. Getting them just right will ensure the model weights are tuned appropriately to learn from your training examples.
-
I also have some inference takeways, which are listed in a separate section below.
This project demonstrates how to fine-tune models like TinyLlama for specific tasks using Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters. Currently configured for ELI5-style explanations.
-
Install dependencies:
pip install -r requirements.txt
-
Environment configuration: Create a
.envfile with any required API keys or configuration variables.
Open and run llm_model_tuning.ipynb to:
- Load a pre-trained model (TinyLlama/TinyLlama-1.1B-Chat-v1.0)
- Configure LoRA adapters for efficient fine-tuning
- Train on your custom dataset (examples with question and answer)
- Evaluate the fine-tuned model
- Model: TinyLlama-1.1B-Chat (switchable to other models)
- Method: LoRA fine-tuning with rank=16, alpha=32
- Training: AdamW optimizer with gradient clipping
- Device: Apple Silicon MPS support + CPU fallback
training_examples = [
"### Question: [Your question] ### Answer: [Expected response]",
# Add more examples...
]# Always clear cache and disable gradients for inference
torch.mps.empty_cache()
torch.set_grad_enabled(False)
model.eval()max_new_tokens: Keep small (25-50 tokens) for focused ELI5 responsestemperature: 0.6-0.7 works best for ELI5 - balances creativity with coherencerepetition_penalty: 1.05-1.1 prevents loops without being too restrictiveno_repeat_ngram_size: 1-2 prevent exact repetitionstop_p: 0.9-0.95 provides good nucleus sampling for natural responses- Many more that effect the quality of the output responses. I'm always learning!
Your inference prompts must exactly match your training format:
# Training format
"### Question: [question] ### Answer: [answer]"
# Inference format (must be identical)
"### Question: [question] ### Answer:"- Faster inference: Lower
max_new_tokens, disabledo_sample, useuse_cache=True - Better quality: Higher
temperature, enabledo_sample, more tokens - Apple Silicon: Enable
torch.backends.mps.allow_tf32 = Truefor speed
# Significant speed improvement on M1/M2 chips
if hasattr(torch, 'compile'):
model = torch.compile(model, mode='reduce-overhead')