OPC-Codex: Distilling Verified OPC Judgment into a Fine-Tuned LLM

A complete, reproducible LoRA fine-tuning pipeline that distills the decision-making patterns of verified One-Person Company (OPC) practitioners into a deployable AI advisor. Built with Qwen3-14B on Apple Silicon using MLX.

Motivation

Solo entrepreneurs (One-Person Companies, or OPCs) face a unique set of challenges that general-purpose LLMs are poorly equipped to handle. Generic AI assistants tend to be agreeable, generic, and lacking in the specific judgment frameworks that experienced OPC practitioners rely on daily.

OPC-Codex addresses this gap by fine-tuning a large language model on high-quality conversational data from verified OPC practitioners. The goal is not to replace human judgment, but to create an AI advisor that internalizes the thinking patterns, frameworks, and decision-making heuristics of those who have already succeeded in the OPC space.

Why Fine-Tuning Instead of RAG or Prompt Engineering?

Approach	Strengths	Weaknesses
RAG (Retrieval-Augmented Generation)	Accurate factual recall	Style inconsistency; retrieval failures expose base model
Prompt Engineering / Skills	Easy to iterate	Template-like responses; limited depth
Fine-Tuning (this project)	Consistent style; transferable reasoning	Higher deployment cost; requires quality data

Each approach has its place. Fine-tuning excels when you need consistent persona, transferable reasoning, and style that doesn't degrade over long conversations. RAG and Skills excel when you need factual accuracy and easy iteration. The ideal production system combines all three.

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐     ┌──────────────┐
│  Raw Data   │────▶│  Data Eng    │────▶│  LoRA Train │────▶│  Merge &     │
│  Collection  │     │  & QC        │     │  (MLX)      │     │  Dequantize  │
└─────────────┘     └──────────────┘     └─────────────┘     └──────┬───────┘
                                                                  │
                                                                  ▼
┌─────────────┐     ┌──────────────┐     ┌─────────────┐     ┌──────────────┐
│   Ollama    │◀────│  Quantize    │◀────│  Convert    │◀────│  GGUF F16    │
│   Deploy    │     │  (Q4_K_M)    │     │  (llama.cpp)│     │  Export      │
└─────────────┘     └──────────────┘     └──────────────┘     └──────────────┘

Tech Stack

Component	Technology	Version
Base Model	Qwen3-14B	14.8B params
Quantized Base	mlx-community/Qwen3-14B-4bit	4-bit MLX
Training Framework	MLX	0.31.3
Fine-Tuning Method	LoRA (rank=32, alpha=64)	8 trainable layers
GGUF Conversion	llama.cpp	latest
Quantization	Q4_K_M	~8.5GB final model
Deployment	Ollama	latest
Hardware	Apple M4 Max (36GB)	macOS

Quick Start

Prerequisites

macOS with Apple Silicon (M1/M2/M3/M4)
32GB+ unified memory
Python 3.13+
Ollama installed

1. Clone & Setup

git clone https://github.com/YOUR_USERNAME/opc-codex.git
cd opc-codex
pip install mlx-lm --break-system-packages

2. Train (Optional — pre-trained weights available)

chmod +x scripts/retrain_v2.sh
./scripts/retrain_v2.sh

Training takes ~1-2 hours on M4 Max. Peak memory usage: ~10.6GB.

3. Convert to GGUF & Deploy

chmod +x scripts/convert_v2_to_gguf.sh
./scripts/convert_v2_to_gguf.sh

4. Run

ollama run opc-codex

Dataset

Overview

Metric	Value
Total samples	221
Format	JSONL (OpenAI chat format)
Language	Chinese
Quality score	88+/100 (5-layer review)
Train/Val split	209 / 12 (95%/5%)

Data Collection Pipeline

Source identification: Verified OPC practitioners with trackable results
Content extraction: Video transcripts, articles, podcasts, social media posts
Conversation formatting: Structured as multi-turn dialogues (system + user + assistant)
Quality scoring: 5-dimension rubric (relevance, specificity, framework usage, uniqueness, actionability)
Iterative refinement: V1 → V2 → V3 → V4, progressively improving quality

Data Format

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Sample Data

See data/sample_data.jsonl for 5 anonymized examples.

Training

Hyperparameters

Parameter	V1 (Failed)	V2 (Current)	Reason for Change
Base model	Qwen3-14B-8bit	Qwen3-14B-4bit	Avoid quantization loss
LoRA rank	8	32	Too small for 14B model
LoRA alpha	16	64	2:1 ratio with rank
Trainable layers	16	8	OOM on 36GB Mac
Batch size	4	1	Memory constraint
Grad accumulation	4	4	Effective batch = 4
Learning rate	5e-5	2e-5	Lower for higher rank
Max seq length	2048	2048	Sufficient for most responses
Gradient checkpoint	No	Yes	Critical for memory
Total iterations	220	1000	~3 epochs over 209 samples

Training Configuration

# configs/lora_config_v2.yaml
lora_parameters:
  rank: 32
  alpha: 64
  dropout: 0.05
  scale: 20.0
batch_size: 2
grad_accumulation_steps: 2
learning_rate: 2.0e-5
optimizer: "adam"

Training Results

Iter    Train Loss    Val Loss    Tokens/sec    Peak Mem
  1      3.244         3.507      198           10.2 GB
100      1.803         2.151      160           10.6 GB
200      1.770         2.151      153           10.6 GB
300      1.765         2.119      165           10.6 GB
400      1.695         2.119      165           10.6 GB  ← Best val loss
500      1.203         2.209      165           10.6 GB
600      1.024         2.209      162           10.6 GB
700      0.925         2.543      162           10.6 GB
800      0.782         2.543      155           10.6 GB
900      0.395         2.773      164           10.6 GB
1000     0.371         2.773      166           10.6 GB  ← Final

Key observation: Training loss decreased consistently (3.5 → 0.37), but validation loss started increasing after iter 400, indicating overfitting. The optimal checkpoint is around iter 400-500.

Results & Evaluation

Evaluation Framework

We designed a 9-question progressive test to evaluate persona fidelity:

Level	Question Type	What It Tests
1	Style recognition	Does the model adopt the right tone?
2	Methodology activation	Does it use specific frameworks?
3	Deep reasoning transfer	Can it think in novel ways?
4	Stress test	Does it maintain persona under pressure?

Results Summary

Version	Score	Style	Content Recall	Repetition
Base (Qwen3-14B)	—	Generic	N/A	None
V1 (rank=8)	D	Surface imitation	0/9	Severe
V2 (rank=32, final)	C+	Improved	1/9	Moderate
Skills/Prompt baseline	B+	Good	5/9	None

Honest Assessment

The fine-tuned model successfully captures the tone and attitude (direct, contrarian, confident) but struggles with content recall (specific case studies, named frameworks, exact methodologies). This is expected given:

Only 0.17% of parameters are trainable (LoRA rank=32 on 14B model)
221 samples is insufficient for both style and content learning
Overfitting after iter 400 suggests the model memorizes surface patterns rather than deep understanding

For production use, we recommend a hybrid approach: fine-tuning for style + RAG for content accuracy + Skills for framework enforcement.

Failure Analysis & Lessons Learned

This is the most valuable section of this project. Every failure is documented to save you time.

V1: Five Root Causes of Failure

#	Issue	Symptom	Fix
1	LoRA rank too low (8)	Model couldn't learn style or content	Increased to 32
2	Insufficient epochs (~1)	Most training data never seen	Increased to ~3
3	8-bit quantized base model	Precision loss compounded during GGUF conversion	Switched to 4-bit MLX base
4	Wrong mlx-lm CLI syntax	Training failed to start	Updated to 0.31.x API
5	No gradient checkpointing	OOM on 36GB Mac	Added `--grad-checkpoint`

V2: Three Remaining Issues

#	Issue	Symptom	Potential Fix
1	Overfitting	Val loss increased after iter 400	Use early stopping; best checkpoint at iter 400
2	Content recall gap	Specific cases/frameworks not reproduced	Increase data to 500+ samples
3	Thinking mode leak	Qwen3 generates `💭` blocks	Add stop tokens in Modelfile

Lessons Learned

Always check mlx-lm version compatibility — CLI arguments change between minor versions
Start with gradient checkpointing enabled — it's free insurance against OOM
Monitor validation loss, not just training loss — divergence means overfitting
Save checkpoints frequently — the best model may not be the final one
GGUF conversion requires dequantization first — MLX quantized weights are not directly compatible with llama.cpp

Methodology: A Reproducible Framework

Based on our experience, here is a 5-step framework for fine-tuning a persona-specific LLM from small data:

Step 1: Define the Persona

Identify 3-5 core traits (e.g., "direct", "framework-driven", "contrarian")
List 5-10 signature frameworks/methodologies
Collect 10+ representative examples of desired output

Step 2: Collect & Curate Data

Minimum 200 high-quality samples (500+ recommended)
Use a structured quality rubric (5 dimensions)
Format as multi-turn conversations
Split 95/5 for train/validation

Step 3: Train with Conservative Parameters

Use 4-bit quantized base to fit on consumer hardware
LoRA rank = 32-64 (higher for smaller base models)
Learning rate = 1e-5 to 3e-5
Enable gradient checkpointing
Monitor validation loss for early stopping

Step 4: Evaluate with Progressive Tests

Design 9+ questions across 4 difficulty levels
Compare against base model and prompt-only baseline
Score on style, content recall, and repetition

Step 5: Deploy & Iterate

Convert to GGUF via dequantization → F16 → quantization
Deploy with Ollama for easy testing
Collect user feedback for next training iteration

Deployment

Option 1: Ollama (Recommended)

# After running convert_v2_to_gguf.sh
ollama create opc-codex -f Modelfile.opc_codex_v2
ollama run opc-codex

Option 2: llama.cpp Server

./llama.cpp/build/bin/llama-server \
    -m opc_codex_v2_14b_q4_k_m.gguf \
    -c 4096 \
    --temp 0.7 \
    --top-p 0.8

Option 3: LM Studio / GPT4All

Import the .gguf file directly into any GGUF-compatible client.

Project Structure

opc-codex/
├── README.md                    # This file (English)
├── README_zh.md                 # Chinese documentation
├── LICENSE                      # MIT License
├── .gitignore                   # Git ignore rules
│
├── configs/
│   ├── lora_config_v2.yaml      # LoRA training configuration
│   └── training_params.md       # Hyperparameter documentation
│
├── data/
│   ├── README.md                # Data documentation
│   └── sample_data.jsonl        # 5 anonymized examples
│
├── scripts/
│   ├── retrain_v2.sh            # One-click training script
│   ├── convert_v2_to_gguf.sh    # MLX → GGUF conversion
│   ├── convert_to_gguf.sh       # V1 conversion (legacy)
│   ├── run_finetune.sh          # V1 training (legacy)
│   └── run_finetune_4b.sh       # Mobile variant (4B model)
│
├── docs/
│   ├── methodology.md           # 5-step fine-tuning framework
│   ├── failure_analysis.md      # Detailed failure analysis
│   └── evaluation.md            # Evaluation framework & results
│
├── Modelfile.opc_codex_v2      # Ollama deployment configuration
│
└── .github/
    └── ISSUE_TEMPLATE/
        └── bug_report.md

Roadmap

Data expansion: Increase to 500+ high-quality samples
Full fine-tuning: Experiment with DoRA or full-parameter tuning on cloud GPU
Hybrid architecture: Combine fine-tuning (style) + RAG (content) + Skills (frameworks)
Mobile variant: Optimize Qwen3-4B version for on-device deployment
Evaluation benchmark: Build automated persona fidelity scoring
Multi-persona support: Fine-tune multiple OPC practitioners as switchable personas
Hugging Face upload: Publish model with proper Model Card

License

This project is licensed under the MIT License. See LICENSE for details.

Acknowledgments

Qwen Team for the Qwen3 base model
MLX for the Apple Silicon training framework
llama.cpp for GGUF conversion tools
Ollama for easy local deployment

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
configs		configs
data		data
docs		docs
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Modelfile.opc_codex_v2		Modelfile.opc_codex_v2
README.md		README.md
README_zh.md		README_zh.md

Folders and files

Latest commit

History

Repository files navigation

OPC-Codex: Distilling Verified OPC Judgment into a Fine-Tuned LLM

Table of Contents

Motivation

Why Fine-Tuning Instead of RAG or Prompt Engineering?

Architecture

Tech Stack

Quick Start

Prerequisites

1. Clone & Setup

2. Train (Optional — pre-trained weights available)

3. Convert to GGUF & Deploy

4. Run

Dataset

Overview

Data Collection Pipeline

Data Format

Sample Data

Training

Hyperparameters

Training Configuration

Training Results

Results & Evaluation

Evaluation Framework

Results Summary

Honest Assessment

Failure Analysis & Lessons Learned

V1: Five Root Causes of Failure

V2: Three Remaining Issues

Lessons Learned

Methodology: A Reproducible Framework

Step 1: Define the Persona

Step 2: Collect & Curate Data

Step 3: Train with Conservative Parameters

Step 4: Evaluate with Progressive Tests

Step 5: Deploy & Iterate

Deployment

Option 1: Ollama (Recommended)

Option 2: llama.cpp Server

Option 3: LM Studio / GPT4All

Project Structure

Roadmap

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages