A complete, reproducible LoRA fine-tuning pipeline that distills the decision-making patterns of verified One-Person Company (OPC) practitioners into a deployable AI advisor. Built with Qwen3-14B on Apple Silicon using MLX.
- Motivation
- Architecture
- Quick Start
- Dataset
- Training
- Results & Evaluation
- Failure Analysis & Lessons Learned
- Methodology: A Reproducible Framework
- Deployment
- Project Structure
- Roadmap
- License
Solo entrepreneurs (One-Person Companies, or OPCs) face a unique set of challenges that general-purpose LLMs are poorly equipped to handle. Generic AI assistants tend to be agreeable, generic, and lacking in the specific judgment frameworks that experienced OPC practitioners rely on daily.
OPC-Codex addresses this gap by fine-tuning a large language model on high-quality conversational data from verified OPC practitioners. The goal is not to replace human judgment, but to create an AI advisor that internalizes the thinking patterns, frameworks, and decision-making heuristics of those who have already succeeded in the OPC space.
| Approach | Strengths | Weaknesses |
|---|---|---|
| RAG (Retrieval-Augmented Generation) | Accurate factual recall | Style inconsistency; retrieval failures expose base model |
| Prompt Engineering / Skills | Easy to iterate | Template-like responses; limited depth |
| Fine-Tuning (this project) | Consistent style; transferable reasoning | Higher deployment cost; requires quality data |
Each approach has its place. Fine-tuning excels when you need consistent persona, transferable reasoning, and style that doesn't degrade over long conversations. RAG and Skills excel when you need factual accuracy and easy iteration. The ideal production system combines all three.
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ Raw Data │────▶│ Data Eng │────▶│ LoRA Train │────▶│ Merge & │
│ Collection │ │ & QC │ │ (MLX) │ │ Dequantize │
└─────────────┘ └──────────────┘ └─────────────┘ └──────┬───────┘
│
▼
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ Ollama │◀────│ Quantize │◀────│ Convert │◀────│ GGUF F16 │
│ Deploy │ │ (Q4_K_M) │ │ (llama.cpp)│ │ Export │
└─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
| Component | Technology | Version |
|---|---|---|
| Base Model | Qwen3-14B | 14.8B params |
| Quantized Base | mlx-community/Qwen3-14B-4bit | 4-bit MLX |
| Training Framework | MLX | 0.31.3 |
| Fine-Tuning Method | LoRA (rank=32, alpha=64) | 8 trainable layers |
| GGUF Conversion | llama.cpp | latest |
| Quantization | Q4_K_M | ~8.5GB final model |
| Deployment | Ollama | latest |
| Hardware | Apple M4 Max (36GB) | macOS |
- macOS with Apple Silicon (M1/M2/M3/M4)
- 32GB+ unified memory
- Python 3.13+
- Ollama installed
git clone https://github.com/YOUR_USERNAME/opc-codex.git
cd opc-codex
pip install mlx-lm --break-system-packageschmod +x scripts/retrain_v2.sh
./scripts/retrain_v2.shTraining takes ~1-2 hours on M4 Max. Peak memory usage: ~10.6GB.
chmod +x scripts/convert_v2_to_gguf.sh
./scripts/convert_v2_to_gguf.shollama run opc-codex| Metric | Value |
|---|---|
| Total samples | 221 |
| Format | JSONL (OpenAI chat format) |
| Language | Chinese |
| Quality score | 88+/100 (5-layer review) |
| Train/Val split | 209 / 12 (95%/5%) |
- Source identification: Verified OPC practitioners with trackable results
- Content extraction: Video transcripts, articles, podcasts, social media posts
- Conversation formatting: Structured as multi-turn dialogues (system + user + assistant)
- Quality scoring: 5-dimension rubric (relevance, specificity, framework usage, uniqueness, actionability)
- Iterative refinement: V1 → V2 → V3 → V4, progressively improving quality
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}See data/sample_data.jsonl for 5 anonymized examples.
| Parameter | V1 (Failed) | V2 (Current) | Reason for Change |
|---|---|---|---|
| Base model | Qwen3-14B-8bit | Qwen3-14B-4bit | Avoid quantization loss |
| LoRA rank | 8 | 32 | Too small for 14B model |
| LoRA alpha | 16 | 64 | 2:1 ratio with rank |
| Trainable layers | 16 | 8 | OOM on 36GB Mac |
| Batch size | 4 | 1 | Memory constraint |
| Grad accumulation | 4 | 4 | Effective batch = 4 |
| Learning rate | 5e-5 | 2e-5 | Lower for higher rank |
| Max seq length | 2048 | 2048 | Sufficient for most responses |
| Gradient checkpoint | No | Yes | Critical for memory |
| Total iterations | 220 | 1000 | ~3 epochs over 209 samples |
# configs/lora_config_v2.yaml
lora_parameters:
rank: 32
alpha: 64
dropout: 0.05
scale: 20.0
batch_size: 2
grad_accumulation_steps: 2
learning_rate: 2.0e-5
optimizer: "adam"Iter Train Loss Val Loss Tokens/sec Peak Mem
1 3.244 3.507 198 10.2 GB
100 1.803 2.151 160 10.6 GB
200 1.770 2.151 153 10.6 GB
300 1.765 2.119 165 10.6 GB
400 1.695 2.119 165 10.6 GB ← Best val loss
500 1.203 2.209 165 10.6 GB
600 1.024 2.209 162 10.6 GB
700 0.925 2.543 162 10.6 GB
800 0.782 2.543 155 10.6 GB
900 0.395 2.773 164 10.6 GB
1000 0.371 2.773 166 10.6 GB ← Final
Key observation: Training loss decreased consistently (3.5 → 0.37), but validation loss started increasing after iter 400, indicating overfitting. The optimal checkpoint is around iter 400-500.
We designed a 9-question progressive test to evaluate persona fidelity:
| Level | Question Type | What It Tests |
|---|---|---|
| 1 | Style recognition | Does the model adopt the right tone? |
| 2 | Methodology activation | Does it use specific frameworks? |
| 3 | Deep reasoning transfer | Can it think in novel ways? |
| 4 | Stress test | Does it maintain persona under pressure? |
| Version | Score | Style | Content Recall | Repetition |
|---|---|---|---|---|
| Base (Qwen3-14B) | — | Generic | N/A | None |
| V1 (rank=8) | D | Surface imitation | 0/9 | Severe |
| V2 (rank=32, final) | C+ | Improved | 1/9 | Moderate |
| Skills/Prompt baseline | B+ | Good | 5/9 | None |
The fine-tuned model successfully captures the tone and attitude (direct, contrarian, confident) but struggles with content recall (specific case studies, named frameworks, exact methodologies). This is expected given:
- Only 0.17% of parameters are trainable (LoRA rank=32 on 14B model)
- 221 samples is insufficient for both style and content learning
- Overfitting after iter 400 suggests the model memorizes surface patterns rather than deep understanding
For production use, we recommend a hybrid approach: fine-tuning for style + RAG for content accuracy + Skills for framework enforcement.
This is the most valuable section of this project. Every failure is documented to save you time.
| # | Issue | Symptom | Fix |
|---|---|---|---|
| 1 | LoRA rank too low (8) | Model couldn't learn style or content | Increased to 32 |
| 2 | Insufficient epochs (~1) | Most training data never seen | Increased to ~3 |
| 3 | 8-bit quantized base model | Precision loss compounded during GGUF conversion | Switched to 4-bit MLX base |
| 4 | Wrong mlx-lm CLI syntax | Training failed to start | Updated to 0.31.x API |
| 5 | No gradient checkpointing | OOM on 36GB Mac | Added --grad-checkpoint |
| # | Issue | Symptom | Potential Fix |
|---|---|---|---|
| 1 | Overfitting | Val loss increased after iter 400 | Use early stopping; best checkpoint at iter 400 |
| 2 | Content recall gap | Specific cases/frameworks not reproduced | Increase data to 500+ samples |
| 3 | Thinking mode leak | Qwen3 generates 💭 blocks |
Add stop tokens in Modelfile |
- Always check mlx-lm version compatibility — CLI arguments change between minor versions
- Start with gradient checkpointing enabled — it's free insurance against OOM
- Monitor validation loss, not just training loss — divergence means overfitting
- Save checkpoints frequently — the best model may not be the final one
- GGUF conversion requires dequantization first — MLX quantized weights are not directly compatible with llama.cpp
Based on our experience, here is a 5-step framework for fine-tuning a persona-specific LLM from small data:
- Identify 3-5 core traits (e.g., "direct", "framework-driven", "contrarian")
- List 5-10 signature frameworks/methodologies
- Collect 10+ representative examples of desired output
- Minimum 200 high-quality samples (500+ recommended)
- Use a structured quality rubric (5 dimensions)
- Format as multi-turn conversations
- Split 95/5 for train/validation
- Use 4-bit quantized base to fit on consumer hardware
- LoRA rank = 32-64 (higher for smaller base models)
- Learning rate = 1e-5 to 3e-5
- Enable gradient checkpointing
- Monitor validation loss for early stopping
- Design 9+ questions across 4 difficulty levels
- Compare against base model and prompt-only baseline
- Score on style, content recall, and repetition
- Convert to GGUF via dequantization → F16 → quantization
- Deploy with Ollama for easy testing
- Collect user feedback for next training iteration
# After running convert_v2_to_gguf.sh
ollama create opc-codex -f Modelfile.opc_codex_v2
ollama run opc-codex./llama.cpp/build/bin/llama-server \
-m opc_codex_v2_14b_q4_k_m.gguf \
-c 4096 \
--temp 0.7 \
--top-p 0.8Import the .gguf file directly into any GGUF-compatible client.
opc-codex/
├── README.md # This file (English)
├── README_zh.md # Chinese documentation
├── LICENSE # MIT License
├── .gitignore # Git ignore rules
│
├── configs/
│ ├── lora_config_v2.yaml # LoRA training configuration
│ └── training_params.md # Hyperparameter documentation
│
├── data/
│ ├── README.md # Data documentation
│ └── sample_data.jsonl # 5 anonymized examples
│
├── scripts/
│ ├── retrain_v2.sh # One-click training script
│ ├── convert_v2_to_gguf.sh # MLX → GGUF conversion
│ ├── convert_to_gguf.sh # V1 conversion (legacy)
│ ├── run_finetune.sh # V1 training (legacy)
│ └── run_finetune_4b.sh # Mobile variant (4B model)
│
├── docs/
│ ├── methodology.md # 5-step fine-tuning framework
│ ├── failure_analysis.md # Detailed failure analysis
│ └── evaluation.md # Evaluation framework & results
│
├── Modelfile.opc_codex_v2 # Ollama deployment configuration
│
└── .github/
└── ISSUE_TEMPLATE/
└── bug_report.md
- Data expansion: Increase to 500+ high-quality samples
- Full fine-tuning: Experiment with DoRA or full-parameter tuning on cloud GPU
- Hybrid architecture: Combine fine-tuning (style) + RAG (content) + Skills (frameworks)
- Mobile variant: Optimize Qwen3-4B version for on-device deployment
- Evaluation benchmark: Build automated persona fidelity scoring
- Multi-persona support: Fine-tune multiple OPC practitioners as switchable personas
- Hugging Face upload: Publish model with proper Model Card
This project is licensed under the MIT License. See LICENSE for details.