A fine-tuned clinical AI auditing model built on Microsoft Phi-3-mini-4k-instruct using LoRA (Low-Rank Adaptation) with 8-bit quantization. Given a JSON report of clinical ML model performance metrics, the model classifies the model's health and generates a human-readable explanation.
- Overview
- Project Structure
- How It Works
- Dataset
- Model Details
- Training
- Inference
- Evaluation
- Merging LoRA Weights
- Requirements
Clinical ML models deployed in hospital settings need continuous monitoring. This project automates that auditing process by fine-tuning Phi-3-mini to act as a clinical AI auditor — it reads a model's performance metrics (AUC, ECE, drift, label shift, etc.) and returns a structured audit verdict with a category and explanation.
Example output:
Category: Calibration Failure
Explanation: High calibration error (ECE 0.2781) despite reasonable discrimination (AUC 0.863).
Recommend recalibration (Platt scaling / isotonic) and threshold review.
Hospital-Audit-Trained-Model/
│
├── data/
│ ├── raw/
│ │ └── audit_dataset_v2_5000.json # Raw dataset (5000 samples)
│ ├── processed/
│ │ ├── train.jsonl # 80% training split
│ │ └── test.jsonl # 20% test split
│ └── templates/
│ ├── prompt_template.txt # Prompt format template
│ └── response_template.txt # Response format template
│
├── training/
│ ├── dataset_builder.py # Builds train/test JSONL from raw data
│ ├── lora_config.py # LoRA hyperparameter configuration
│ └── train.py # Main training script
│
├── inference/
│ ├── run_inference.py # Single sample inference
│ └── Metrics_Test.py # Batch evaluation on test set
│
├── models/
│ └── phi3-auditor-lora-8bit/ # Saved LoRA adapter + tokenizer
│ ├── adapter_config.json
│ ├── adapter_model.safetensors
│ ├── tokenizer files...
│ ├── checkpoint-500/
│ └── checkpoint-675/
│
└── merge_lora.py # Merges LoRA adapter into base model
- Input: A JSON object containing clinical ML model metrics (AUC, accuracy, precision, recall, F1, ECE, Brier score, drift, missing rate, label shift, etc.)
- Prompt: The metrics are injected into a structured prompt with a system instruction for the auditor role.
- Output: The model returns a
Category:classification and anExplanation:with actionable recommendations.
Prompt format:
<|system|>
You are an AI auditor analyzing clinical model performance reports.
<|user|>
Instruction: Analyze the clinical model report and classify its health.
Report:
{ ...metrics JSON... }
<|assistant|>
- Source:
audit_dataset_v2_5000.json— a synthetic dataset of 5,000 clinical model audit reports generated on2025-11-17. - Fields per record:
metrics(JSON object),audit_label(category string),explanation(natural language justification). - Split: 80/20 train/test using
sklearn.model_selection.train_test_splitwithrandom_state=42. - Processed format: JSONL with three fields —
instruction,input(metrics JSON),output(category + explanation).
To rebuild the processed dataset:
cd training
python dataset_builder.py| Property | Value |
|---|---|
| Base Model | microsoft/Phi-3-mini-4k-instruct |
| Fine-tuning Method | LoRA (PEFT) |
| Training Quantization | 8-bit (BitsAndBytesConfig) |
| Merged Model Precision | FP16 (F16 safetensors) |
| Model Size | ~4B parameters |
| Model Weight Files | 2 shards (model-00001-of-00002.safetensors + model-00002-of-00002.safetensors) |
| Total Repo Size | 7.65 GB |
| PEFT Version | 0.18.0 |
LoRA Rank (r) |
16 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.05 |
| Target Modules | q_proj, k_proj, v_proj, o_proj |
| Task Type | CAUSAL_LM |
| HuggingFace Model | PhantomAjusshi/phi3-auditor-merged |
Hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Batch Size | 4 |
| Gradient Accumulation Steps | 4 |
| Effective Batch Size | 16 |
| Learning Rate | 1e-4 |
| Warmup Ratio | 0.1 |
| Max Sequence Length | 512 |
| Precision | FP16 (if CUDA available) |
| Checkpointing | Every 500 steps |
Training loss progression:
| Step | Epoch | Loss |
|---|---|---|
| 50 | 0.22 | 1.6228 |
| 100 | 0.44 | 0.6565 |
| 150 | 0.67 | 0.4436 |
| 500 | 2.22 | 0.4109 |
| 675 | 3.00 | ~0.410 |
To train from scratch:
cd training
python train.pyNote: Requires a CUDA-enabled GPU with at least 8GB VRAM for 8-bit training.
Run a single inference using the merged model from HuggingFace:
python inference/run_inference.pyThe script loads the model directly from HuggingFace at PhantomAjusshi/phi3-auditor-merged. You can also use it programmatically:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_PATH = "PhantomAjusshi/phi3-auditor-merged"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto" if torch.cuda.is_available() else None,
trust_remote_code=True # required for Phi-3 custom modeling code
)
report = """{
"auc": 0.863,
"accuracy": 0.83,
"precision": 0.79,
"recall": 0.69,
"f1": 0.79,
"ece": 0.278,
"brier": 0.263,
"drift": 0.03,
"missing_rate": 0.003,
"label_shift": 0.06,
"pos_rate": 0.10,
"data_integrity_issues": 0
}"""
prompt = (
f"<|system|> You are a clinical AI auditor model.\n"
f"<|user|>\nInstruction: Analyze the clinical model report and classify its health.\n\nReport:\n{report}\n"
f"<|assistant|>\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=400, temperature=0.7, top_p=0.9, repetition_penalty=1.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Note:
trust_remote_code=Trueis required because the HuggingFace repo includes custom Phi-3 modeling files (modeling_phi3.py,configuration_phi3.py).
Generation settings: max_new_tokens=400, temperature=0.7, top_p=0.9, repetition_penalty=1.2
Run batch evaluation on the test set to get classification metrics:
python inference/Metrics_Test.pyThis generates predictions for all records in data/processed/test.jsonl, computes weighted precision, recall, F1, and accuracy against ground-truth Category: labels, and saves full results to inference_results.json.
To merge the LoRA adapter into the base model for standalone deployment (no PEFT dependency at inference time):
python merge_lora.pyOutput is saved to models/phi3-auditor-merged/. The merged model is published on HuggingFace as PhantomAjusshi/phi3-auditor-merged and consists of:
model-00001-of-00002.safetensors(4.97 GB)model-00002-of-00002.safetensors(2.67 GB)modeling_phi3.py+configuration_phi3.py(custom Phi-3 code, required fortrust_remote_code=True)- Tokenizer files (
tokenizer.json,tokenizer.model,tokenizer_config.json, etc.)
Note: Large model files (
.safetensors,.bin,.pt) are excluded from this repository via.gitignore. Use the HuggingFace hosted weights directly or retrain and merge locally.
pip install torch transformers peft bitsandbytes datasets scikit-learn accelerate| Package | Role |
|---|---|
transformers |
Base model loading, tokenizer, training |
peft |
LoRA adapter configuration and application |
bitsandbytes |
8-bit quantization |
datasets |
JSONL dataset loading |
scikit-learn |
Train/test split and evaluation metrics |
accelerate |
Device mapping and mixed precision |
This project is for research and educational purposes. The base model is subject to Microsoft's Phi-3 license.