Colab guide for fine-tuning LiquidAI/LFM2-VL-1.6B with LoRA (PEFT) and 4-bit quantization.
- Click Open in Colab (button above).
- In Colab: Runtime → Change runtime type → GPU.
- Run cells top-to-bottom. When prompted, upload your
train_data.zip. - After training, download
lfm2_adapter.zip.
Optional: uploadprobe.7z(or any.7zwith images, e.g.probe.7z/prob.7z, or upload images directly) to test and getpreds.jsonl.
Latest release: https://github.com/Absurd7550/lfm2-vl-finetune-guide/releases/tag/v1.0
- A small LoRA adapter you can share:
out/final_adapter/(folder)lfm2_adapter.zip(archive)
- A working inference recipe:
- Base model:
LiquidAI/LFM2-VL-1.6B -
- your adapter
- Run on new images
- Base model:
- Google Colab with GPU (recommended).
- Your dataset packed as a ZIP archive (
train_data.zip) in the format below.
Create train_data.zip with this structure:
dataset/
images/
00001.jpg
00002.jpg
...
metadata.jsonl
metadata.jsonl must be JSON Lines (one JSON object per line):
{"file_name":"00001.jpg","text":"Your target text / caption / answer"}
{"file_name":"00002.jpg","text":"..."}
Notes:
file_name must exist inside dataset/images/.
text is what you want the model to learn (captions, instructions, structured JSON answers, etc.).
Start small (200–1000 examples) and iterate.
A minimal example is provided in dataset_example/.
Why this notebook is stable
This guide uses a training setup discovered through trial-and-error that avoids common pitfalls:
Correct model class for training: Lfm2VlForConditionalGeneration (not AutoModel).
Avoids fragile vision patches (no SigLIP2 hacks).
Forces single-crop by resizing images to 384×384 before passing them to the processor.
This reduces multi-crop shape issues that can break training.
Trainer configuration uses remove_unused_columns=False (important for multimodal batches).
W&B is disabled to avoid interactive prompts in Colab.
Quick start overview (what the notebook does)
The notebook train_colab.ipynb contains:
Install dependencies
Upload + unzip train_data.zip
Load LiquidAI/LFM2-VL-1.6B in 4-bit
Attach LoRA and fine-tune on your dataset
Save the adapter to lfm2_adapter.zip
Optional: test on new images / probe.7z and save results to preds.jsonl
Common issues
The model repeats the prompt or returns empty output during inference
Use:
processor.apply_chat_template(..., add_generation_prompt=True)
Decode only generated tokens: out[0][input_len:]
W&B asks interactive questions in Colab
Disable it:
WANDB_DISABLED=true
report_to="none"
Multi-crop / vision shape errors
Use forcing single-crop:
img.resize((384,384))
Publishing your adapter
You can publish lfm2_adapter.zip via:
GitHub Releases (simple)
Hugging Face Hub (recommended for adapters)
Note about Raspberry Pi 5
Fine-tuning should be done on GPU (Colab/desktop/server).
Running a 1.6B vision-language model on Raspberry Pi 5 is usually slow; Pi is best used as an edge capture/control device, while inference runs on a stronger machine or a heavily quantized runtime.
Attribution / license
Base model: LiquidAI/LFM2-VL-1.6B (see the model card/license on Hugging Face).
This repository provides training code and guidance; users are responsible for respecting dataset/model licenses.