Skip to content

Ashx098/sft-play

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SFT-Play (QLoRA-ready, 8-GB Friendly) - QLoRA-ready SFT starter kit

License: MIT Python 3.8+ PyTorch Transformers Code style: black

Plug-and-play Supervised Fine-Tuning on small GPUs. Single config, QLoRA/LoRA/Full switches, bitsandbytes/Unsloth backends, Jinja chat templating, TensorBoard live UI, and lean checkpoints (save adapters, not full models).


🎯 Who Should Use SFT-Play?

  • AI hobbyists β€” fine-tune models on your own dataset without cloud GPUs
  • Researchers β€” run small-scale experiments before scaling to larger infrastructure
  • Educators β€” teach LLM fine-tuning with a minimal, reproducible setup
  • Open-source contributors β€” build datasets + share fine-tuned models efficiently
  • Developers β€” prototype AI features with custom models on local hardware

✨ Features

  • Runs on any single GPU (8 GB+) β€” VRAM probe auto-tunes batch/grad-accum.
  • Two-config UX β€” config_base.yaml (defaults) + backend-specific configs (run_bnb.yaml, run_unsloth.yaml).
  • Tuning modes β€” qlora | lora | full (config switch).
  • Backends β€” bitsandbytes (default) or unsloth (optional; auto-fallback to bnb).
  • Data pipeline β€” raw β†’ structured chat (system,user,assistant) β†’ Jinja render on-the-fly.
  • UI β€” TensorBoard only (loss/metrics/LR; optional GPU stats).
  • Model Caching - Automatically download models from Hugging Face Hub and cache them locally.
  • Flexible GPU scaling β€” uses one GPU by default, or many via torchrun/accelerate.
  • Tiny checkpoints β€” LoRA adapters only (~50-200 MB vs. full model's multiple GB).
  • Complete automation β€” Makefile + workflows for zero-config setup.

πŸ—‚οΈ Repo Layout

sft-play/
β”œβ”€ configs/
β”‚  β”œβ”€ config_base.yaml          # reusable defaults (rarely change)
β”‚  β”œβ”€ run_bnb.yaml              # BitsAndBytes backend config (default)
β”‚  └─ run_unsloth.yaml          # Unsloth backend config (optional)
β”œβ”€ data/
β”‚  β”œβ”€ raw/                      # input sources (json/csv/jsonl)
β”‚  β”œβ”€ processed/                # structured chat (system,user,assistant)
β”‚  β”œβ”€ processed_with_style/     # optional: after style injection
β”‚  └─ rendered/                 # optional: materialized seq2seq (input,target)
β”œβ”€ chat_templates/
β”‚  └─ default.jinja             # single Jinja template
β”œβ”€ scripts/
β”‚  β”œβ”€ process_data.py           # raw β†’ structured chat + split
β”‚  β”œβ”€ style_prompt.py           # inject/override system/style rule
β”‚  β”œβ”€ render_template.py        # (optional) Jinja β†’ seq2seq jsonl
β”‚  β”œβ”€ train.py                  # QLoRA/LoRA/Full; bnb/Unsloth; TB logging
β”‚  β”œβ”€ eval.py                   # ROUGE-L/SARI/Exact-Match (+ schema checks)
β”‚  β”œβ”€ infer.py                  # batch/interactive inference (same template)
β”‚  β”œβ”€ merge_lora.py             # merge adapters β†’ single FP16 model (optional)
β”‚  └─ utils/
β”‚     └─ model_store.py        # handles model downloading/caching
β”œβ”€ env/
β”‚  └─ accelerate_config.yaml    # fp16, single-GPU defaults
β”œβ”€ outputs/                     # TB logs, metrics, sample preds
β”œβ”€ adapters/                    # LoRA adapter checkpoints
β”œβ”€ workflows/                   # automation scripts
β”‚  β”œβ”€ quick_start.sh            # interactive setup with sample data
β”‚  └─ batch_process.sh          # batch processing automation
β”œβ”€ Makefile                     # complete automation commands
β”œβ”€ requirements.txt
└─ README.md

βš™οΈ Configs

configs/config_base.yaml (defaults)

  • Training: epochs, warmup, weight_decay, fp16, gradient_checkpointing
  • Checkpoint/eval: save_strategy, save_steps, eval_strategy, save_total_limit, metric_for_best_model, load_best_model_at_end
  • Data: format: chat, template_path, split ratios (train/val/test)
  • Logging: backend: tensorboard, log_interval

Backend-Specific Configs (Recommended)

For BitsAndBytes (Stable, Broad Compatibility):

# configs/run_bnb.yaml
include: configs/config_base.yaml

tuning:
  backend: bnb             # BitsAndBytes backend
  mode: qlora

train:
  bf16: false              # BitsAndBytes works best with fp16
  fp16: true
  output_dir: outputs/run-bnb

For Unsloth (Faster, Requires Compatible CUDA):

# configs/run_unsloth.yaml
include: configs/config_base.yaml

tuning:
  backend: unsloth         # Unsloth backend
  mode: qlora

train:
  bf16: true               # Unsloth works better with bfloat16
  fp16: false
  output_dir: outputs/run-unsloth

Data Format Support

The training script supports both processed and rendered data formats:

Processed Format (Default):

{"system": "You are a helpful assistant.", "user": "What is 2+2?", "assistant": "2+2 equals 4."}

Rendered Format (Optional):

{"input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant", "target": "2+2 equals 4."}

To switch to rendered data:

  1. Generate rendered data: make render
  2. Create config pointing to rendered paths:
data:
  train_path: data/rendered/train.jsonl
  val_path:   data/rendered/val.jsonl
  test_path:  data/rendered/test.jsonl

Fallback behavior when no chat template: If chat_templates/default.jinja is missing, the system automatically creates simple format:

System: You are a helpful assistant.
User: What is 2+2?
Assistant:2+2 equals 4.

Backend Safety Features

Automatic Backend Stamping:

  • Each training run creates outputs/<run>/backend.json with backend info
  • Prevents accidental resume across different backends
  • Validates configuration consistency on resume

XFormers Safety:

  • Unsloth automatically disables XFormers to prevent compatibility issues
  • BitsAndBytes uses standard PyTorch attention mechanisms
  • Automatic fallback from Unsloth to BitsAndBytes if import fails

Precision Auto-Detection:

  • Auto-enables bf16 on Ada GPUs (RTX 40xx series) when neither precision is set
  • Auto-enables fp16 on other GPUs as fallback
  • Prevents both bf16 and fp16 being enabled simultaneously

Recommended Usage: Use configs/run_bnb.yaml for stability or configs/run_unsloth.yaml for speed. The backend-specific configs ensure optimal settings and prevent configuration conflicts.


πŸ€— Hugging Face Token

To download models from the Hugging Face Hub, you need to provide an access token. You can do this in one of three ways:

  1. CLI Login (Recommended):

    huggingface-cli login

    This will store your token securely on your machine.

  2. Environment Variable:

    export HUGGINGFACE_HUB_TOKEN=hf_...

    You can add this to your shell profile (e.g., .bashrc, .zshrc) or a .env file.

  3. Offline Mode: After downloading a model once, you can work offline:

    export HF_HUB_OFFLINE=1

πŸ“ˆ TensorBoard

We log to a fixed path: outputs/tb.

Automatic TensorBoard (Recommended):

make train-bnb-tb           # BitsAndBytes training with TensorBoard auto-start
make train-unsloth-tb       # Unsloth training with TensorBoard auto-start

Manual TensorBoard:

make tensorboard            # uses port 6006
make tensorboard TB_PORT=6007

Stop it:

make tb-stop

How it works:

  • The -tb training targets automatically start TensorBoard in the background before training
  • TensorBoard runs on port 6006 by default (configurable with TB_PORT)
  • After training completes, TensorBoard continues running for you to review results
  • Use make tb-stop to stop TensorBoard when you're done

If TB shows "No dashboards…" check you're pointing at the absolute path:

tensorboard --logdir "$(pwd)/outputs/tb" --port 6006

TensorBoard Screenshots

Training Progress Monitoring: TensorBoard Training

Loss and Learning Rate Tracking: Loss and LR Tracking

Evaluation Metrics: Evaluation Metrics

πŸš€ Quickstart

Option 1: Automated Setup (Recommended)

Complete setup in one command:

./workflows/quick_start.sh

This interactive script will:

  • Install dependencies (auto-detects uv or pip)
  • Create all necessary directories
  • Generate sample data if none exists
  • Process data through the complete pipeline
  • Guide you to training

Setup and Training Screenshots: Setup Process

Training Configuration

Or use individual Makefile commands:

make help                    # See all available commands
make install                 # Install dependencies
make setup-dirs             # Create directories
make full-pipeline          # Complete data processing
make check                  # Validate setup before training
make train-with-tb          # Train with TensorBoard monitoring

You can also pre-download models using the Makefile:

make download-model MODEL=Qwen/Qwen2.5-3B-Instruct

Option 2: Manual Step-by-Step

0) Install

pip install -r requirements.txt
# or
# uv venv && uv pip install -e .

(Optional) Configure Accelerate:

accelerate config  # or use env/accelerate_config.yaml

1) Add raw data

Create data/raw/raw.jsonl or raw.json. Example (JSONL):

{"system":"You are a helpful assistant.","user":"What is machine learning?","assistant":"Machine learning is..."}

(Your process_data.py also supports simple dicts like {"question":"...","answer":"..."}.)

2) Process raw β†’ structured chat

python scripts/process_data.py --config configs/run_bnb.yaml --raw_path data/raw/raw.jsonl
# writes data/processed/{train,val,test}.jsonl

3) (Optional) Inject style/system rule

python scripts/style_prompt.py --config configs/run_bnb.yaml \
  --style "Answer in ≀2 concise sentences. No markdown." \
  --in data/processed/train.jsonl \
  --out data/processed_with_style/train.jsonl
# repeat for val/test if desired and update config data paths

4) Validate setup

make check                   # Comprehensive sanity check

5) Train (TensorBoard logs)

python scripts/train.py --config configs/run_bnb.yaml
tensorboard --logdir outputs/
  • See train/loss, eval/loss, train/lr, and eval/rougeL live.
  • Checkpoints saved every save_steps (adapters + trainer state only).

6) Evaluate

python scripts/eval.py --config configs/run_bnb.yaml --split val
# writes outputs/metrics.json and outputs/samples.jsonl

7) Inference

Interactive:

python scripts/infer.py --config configs/run_bnb.yaml

Batch:

echo "Explain QLoRA in two lines." > demo_inputs.txt
python scripts/infer.py --config configs/run_bnb.yaml --mode batch --input_file demo_inputs.txt --output_file outputs/preds.txt
Inference Quality Improvements

The inference script includes several optimizations to ensure high-quality, single-turn responses:

1. Proper Stopping Conditions

  • Forces stop at EOS tokens using eos_token_id=tokenizer.eos_token_id
  • Adds custom stop tokens for chat template boundaries (<|user|>, </|assistant|>)
  • Prevents multi-turn generation where the model continues beyond the assistant's response

2. Template Boundary Parsing

  • Extracts only the assistant's response from the full generation
  • Strips everything after the first <|assistant|> β†’ <|user|> boundary
  • Handles both </|assistant|> end tags and natural conversation boundaries

3. Template Consistency

  • Loads the same Jinja chat template used during training
  • Ensures inference format exactly matches training format
  • Prevents template mismatches that can cause poor generation quality

Example of clean output extraction:

# Raw generation might include:
# "<|system|>You are helpful</|system|><|user|>Hello</|user|><|assistant|>Hi there!<|user|>..."

# Cleaned output extracts only:
# "Hi there!"

These improvements ensure that:

  • Models stop generating at appropriate conversation boundaries
  • Output is clean and contains only the intended assistant response
  • Template consistency is maintained between training and inference
  • Multi-turn conversations don't bleed into single responses

Inference in Action: Interactive Inference

8) (Optional) Merge adapters β†’ FP16 model

python scripts/merge_lora.py --config configs/run_bnb.yaml \
  --adapters adapters/last \
  --out outputs/merged_fp16 \
  --dtype fp16

πŸ› οΈ Automation Commands

Complete Makefile Reference

# Setup
make install                 # Install dependencies (auto-detects uv/pip)
make setup-dirs             # Create all necessary directories

# Data Pipeline
make process                # Process raw data to structured chat
make style                  # Apply style prompts to all splits
make render                 # Render chat templates
make full-pipeline          # Complete data processing pipeline

# Training & Evaluation
make train                  # Start training with current config
make train-bnb              # Start training with BitsAndBytes backend
make train-unsloth          # Start training with Unsloth backend
make train-with-tb          # Train with TensorBoard monitoring
make train-bnb-tb           # BitsAndBytes training with TensorBoard
make train-unsloth-tb       # Unsloth training with TensorBoard
make eval                   # Evaluate on validation set
make eval-test              # Evaluate on test set
make eval-quick             # Quick evaluation (200 samples)
make eval-full              # Full evaluation (no limit)

# Inference
make infer                  # Interactive inference (chat mode)
make infer-batch            # Batch inference from file
make infer-interactive      # Interactive inference (explicit)

# Model Management
make download-model         # Download a model from Hugging Face Hub
make merge                  # Merge LoRA adapters to FP16 model
make merge-bf16             # Merge LoRA adapters to BF16 model
make merge-test             # Test merged model loading

# Monitoring
make tensorboard            # Start TensorBoard on outputs/tb
make tb-stop                # Kill any running TensorBoard
make tb-clean               # Remove TB event files
make tb-open                # Print exact path & suggest URL

# Utilities
make check                  # Validate project setup
make clean                  # Clean generated files
make help                   # Show all commands

Workflow Scripts

# Interactive setup with sample data
./workflows/quick_start.sh

# Batch processing for multiple datasets
./workflows/batch_process.sh

Customization

# Custom style prompts
make style STYLE="Answer in JSON format only"

# Custom configuration
make train CONFIG=configs/my_config.yaml

# Custom workflows
make process && make style && make train

🧠 Design Notes

  • Memory-efficient checkpoints: we save only LoRA adapters + trainer state. Result: tiny checkpoints, fast resume. Merge at the end only if you need a single FP16 folder.

  • VRAM-aware: when batch_size/grad_accum are auto, training probes free VRAM and picks safe values (starts at bs=1, increases accumulation).

  • Template flexibility: training renders Jinja on-the-fly, so you can change chat_templates/default.jinja without reprocessing.

    {{ system }}
    User: {{ user }}
    Assistant: {{ assistant }}
  • Backends: set tuning.backend: unsloth if installed; otherwise it auto-falls back to bnb with a warning.

  • Complete automation: Makefile provides 20+ commands for every aspect of the pipeline.

  • VRAM efficiency: Qwen2.5-3B + QLoRA + bnb β†’ ~6.5 GB VRAM at seq_len=512

  • Modes:

    • qlora β†’ 4-bit base + LoRA (best for 8 GB on 7B/3B causal LMs)
    • lora β†’ fp16/bf16 base + LoRA (fine for 1–3B, or enough VRAM)
    • full β†’ full fine-tune (use for small seq2seq, e.g., FLAN-T5-base)

πŸ§ͺ Troubleshooting

Configuration Issues

  • Training Arguments Mismatch Error

    ValueError: --load_best_model_at_end requires the save and eval strategy to match
    

    Solution: This was fixed in the training script. The issue occurred when evaluation_strategy and save_strategy didn't match. The script now properly handles both evaluation_strategy and eval_strategy keys from config files.

  • Backend Switching: BitsAndBytes ↔ Unsloth

    Use the provided backend-specific configs:

    # For BitsAndBytes (stable, broad compatibility)
    make train CONFIG=configs/run_bnb.yaml
    
    # For Unsloth (faster, requires compatible CUDA)
    make train CONFIG=configs/run_unsloth.yaml

    Or create custom config:

    # For Unsloth backend
    include: configs/config_base.yaml
    tuning:
      backend: unsloth
    train:
      bf16: true    # Unsloth works better with bfloat16
      fp16: false
    
    # For BitsAndBytes backend  
    include: configs/config_base.yaml
    tuning:
      backend: bnb
    train:
      bf16: false   # BitsAndBytes is more stable with float16
      fp16: true

    Key differences:

    • Unsloth: Faster training, requires specific CUDA versions, works best with bf16: true
    • BitsAndBytes: More stable, broader compatibility, works best with fp16: true
    • Auto-fallback: If Unsloth fails to import, the system automatically falls back to BitsAndBytes

Memory and Performance Issues

  • CUDA OOM

    • Lower model.max_seq_len (e.g., 512 β†’ 384).
    • Keep mode: qlora, backend: bnb, batch_size: auto, grad_accum: auto.
    • Ensure TensorBoard isn't eating VRAM on the same GPU (runs on CPU, but double-check).
  • Unsloth import fails

    • Use backend: bnb (default).
    • If you insist on Unsloth, match its CUDA/PTX requirements.
  • XFormers compatibility issues

    NotImplementedError: No operator found for `memory_efficient_attention_backward`
    

    Solution: This is a known compatibility issue with Unsloth + XFormers on certain GPU/CUDA configurations. The system now automatically falls back to BitsAndBytes when Unsloth is requested:

    Automatic Fallback: When you use configs/run_unsloth.yaml, the system detects XFormers issues and automatically switches to BitsAndBytes with a warning message.

    Recommended Approach: Use BitsAndBytes directly for maximum stability:

    make train-bnb        # Direct BitsAndBytes training
    make train-bnb-tb     # BitsAndBytes with TensorBoard

    Manual Configuration: If you want to force BitsAndBytes:

    tuning:
      backend: bnb
    train:
      bf16: false
      fp16: true

Data and Training Issues

  • Weird formatting in generations

    • Check chat_templates/default.jinja and your style prompt.
    • Remember causal LMs may echo the prompt; infer.py strips assistant tags heuristically.
  • Metrics too low

    • Increase epochs to 3–5.
    • Tune LoRA r (16β†’32) or LR (2e-4 β†’ 1e-4).
    • Ensure your processed data is clean and task-consistent.
  • ROUGE metrics showing 0.0

    [train] Warning: Could not compute ROUGE metrics: argument 'ids': 'list' object cannot be interpreted as an integer
    

    Note: This is a known issue with the ROUGE evaluation library and doesn't affect training. The model is still learning (check the decreasing loss values).

Setup Issues

  • Setup issues

    • Run make check to validate your setup
    • Use ./workflows/quick_start.sh for guided setup
    • Check AUTOMATION_GUIDE.md for detailed automation docs

Configuration Validation

  • Before training, always validate your config:

    make check                    # Comprehensive validation
    python scripts/train.py --config configs/run_bnb.yaml --help  # Check arguments
  • Common config mistakes:

    • Mismatched evaluation_strategy and save_strategy (now auto-fixed)
    • Wrong precision settings for your backend (bf16 vs fp16)
    • Missing data files or incorrect paths
    • Incompatible model settings with available VRAM

βœ… Definition of Done (v0.1)

  • End-to-end run on Qwen2.5-3B (QLoRA+bnb) on an 8 GB GPU without OOM.
  • Live TensorBoard charts.
  • outputs/metrics.json with ROUGE-L (and others).
  • infer.py produces sensible answers.
  • (Optional) outputs/merged_fp16 exists and loads with HF.
  • Complete automation with Makefile and workflow scripts.
  • Sanity checking with make check validation.

πŸ“š Documentation

  • AUTOMATION_GUIDE.md - Detailed automation system documentation
  • SETUP_DOCUMENTATION.md - Complete project setup guide
  • LICENSE - MIT License for open source use

🎯 Quick Examples

Try fine-tuning on EduGen Small Q&A (Kaggle link). Takes ~10 min on an RTX 4060.

πŸš€ Real Results: 10-Minute Fine-Tune Success

Tried a 10-minute QLoRA fine-tune on Qwen2.5-1.5B with 240 Q&As (EduGen style):

  • Baseline ROUGE-L: 0.17 β†’ After SFT: 0.33 (~95% improvement!)
  • SARI: 40 β†’ 55 (+15 points)

That's ~95% growth in ROUGE-L and +15 points in SARI in just 10 minutes! Even with tiny data, the model became step-by-step and precise.

Proof that fine-tuning isn't just for labs β€” you can try it in a few minutes on consumer GPUs.

πŸ“Š Metric Improvements (Visual Proof)

ROUGE-L Score Improvement: ROUGE-L Results

SARI Score Improvement: SARI Results

Usage Examples

Complete beginner workflow:

./workflows/quick_start.sh     # One command setup
make check                     # Validate everything
make train-bnb-tb             # Train with TensorBoard monitoring

Advanced user workflow:

make install && make setup-dirs
make process
make style STYLE="Be concise and professional"
make check
make train CONFIG=configs/my_qlora.yaml
make eval-test
make merge && make merge-test

Development workflow:

make full-pipeline            # Process all data
make eval-quick               # Fast validation
make infer                    # Test interactively

GPU scaling: one or many

scripts/train.py uses Hugging Face's Trainer. It runs on a single GPU out of the box and can scale to multiple GPUs when launched with standard PyTorch tools.

Single GPU (default)

python scripts/train.py --config configs/run_bnb.yaml
# or pick a specific device
CUDA_VISIBLE_DEVICES=1 python scripts/train.py --config configs/run_bnb.yaml

If only one GPU is visible, the script uses it automatically. When no GPU is available, it falls back to CPU (slow).

Multiple GPUs (data parallel)

# Two GPUs on one machine
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 scripts/train.py --config configs/run_bnb.yaml

# Four processes via Accelerate
accelerate launch --num_processes 4 scripts/train.py --config configs/run_bnb.yaml

The script assigns each rank to its own GPU, disables device_map="auto" sharding, and adjusts gradient accumulation so the effective batch size stays consistent across ranks. To limit GPUs, set CUDA_VISIBLE_DEVICES or --nproc_per_node=1.

Makefile helpers

# Torchrun backend
make train-multi NPROC=2 CONFIG=configs/run_bnb.yaml

# Accelerate backend
make train-accelerate NPROC=4 CONFIG=configs/run_bnb.yaml

NPROC defaults to the number of visible GPUs. Override CONFIG to pick a different training config.

That's it! The automation system makes SFT-Play truly plug-and-play. Run make help to see all available commands, or start with ./workflows/quick_start.sh for a guided experience.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors