Setup - Quickstart - CLI - Backends - Datasets - GRPO - Evaluation - GGUF - Contributing
LEAP-Finetune is a minimal fine-tuning repo for LFM2. It handles dataset formatting, validation, distributed orchestration, checkpointing, and export for local GPU nodes, SLURM clusters, Modal, and Kubernetes/KubeRay.
For feature requests or custom infrastructure support, reach out to support@liquid.ai with your setup.
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | shClone the repo:
git clone <repository-url>
cd leap_finetuneCUDA / NVIDIA clusters use the default dependency groups:
uv syncAMD / ROCm clusters should use the ROCm group instead:
uv sync --no-group cuda --group rocmThe ROCm group is lockfile-managed and uses vLLM's ROCm wheel index for vLLM
plus the matching torch, torchvision, torchaudio, flash-attn, and
triton stack. The pinned ROCm vLLM wheels are Python 3.12 Linux wheels, so
use the repo's .python-version when creating AMD environments.
If flash-attn was built against a different Torch/CUDA ABI, errors such as
flash_attn_2_cuda... undefined symbol usually mean the environment needs to
be rebuilt:
uv cache clean flash-attn
rm -rf .venv
MAX_JOBS=1 uv syncRun this on a machine with a CUDA toolkit and enough build memory available if
uv needs to rebuild flash-attn from source.
Create a YAML config file or copy one from job_configs/:
project_name: "my_sft_project"
model_name: "LFM2-1.2B"
training_type: "sft"
dataset:
path: "HuggingFaceTB/smoltalk"
type: "sft"
limit: 1000
test_size: 0.2
subset: "all"
training_config:
extends: "DEFAULT_SFT"
num_train_epochs: 3
per_device_train_batch_size: 2
learning_rate: 2e-5
peft_config:
extends: "DEFAULT_LORA"
use_peft: truetraining_config.extends inherits from a base config such as DEFAULT_SFT,
DEFAULT_DPO, or DEFAULT_VLM_SFT; fields in your YAML override the base.
peft_config.extends works the same way for LoRA defaults such as
DEFAULT_LORA and DEFAULT_VLM_LORA.
Launch training:
uv run leap-finetune job_configs/sft_example.yamlTraining uses Ray Train and Accelerate for distributed execution. Unless
output_dir is set, results are written to
outputs/{project_name}/{run_name}/. Each run gets a unique name based on the
model, dataset, learning rate, and timestamp.
Useful starter configs:
| Mode | Config |
|---|---|
| SFT | job_configs/sft_example.yaml |
| SFT + LoRA | job_configs/sft_with_lora_example.yaml |
| DPO | job_configs/dpo_example.yaml |
| VLM SFT | job_configs/vlm_sft_example.yaml |
| VLM DPO | job_configs/vlm_dpo_example.yaml |
| GRPO | job_configs/grpo_example.yaml |
| VLM GRPO | job_configs/vlm_grpo_grounding_example.yaml |
| MoE SFT | job_configs/moe_sft_example.yaml |
| MoE DPO | job_configs/moe_dpo_example.yaml |
| Expert-parallel MoE SFT | job_configs/moe_ep_sft_example.yaml |
| Standalone eval | job_configs/eval_standalone_example.yaml |
During development, prefer the repo environment so you get the lockfile-managed CUDA/vLLM stack:
uv run leap-finetune job_configs/sft_example.yaml
uv run leap-finetune run job_configs/sft_example.yaml
uv run leap-finetune job_configs/eval_standalone_example.yaml
uv run leap-finetune eval job_configs/eval_standalone_example.yaml --output results.jsonInstall the command as a reusable tool from a checkout:
uv tool install --editable . --force
leap-finetune /absolute/path/to/config.yaml
leap-finetune slurm /absolute/path/to/config.yaml --output-dir /absolute/path/to/slurms
leap-finetune /absolute/path/to/eval_config.yaml
leap-finetune eval /absolute/path/to/eval_config.yaml --output /absolute/path/to/results.jsonuv tool install creates an isolated tool environment. Use explicit config
paths when invoking the command outside the repo; bare names like
sft_example.yaml resolve from the current directory's job_configs/ first,
then from the installed package's LEAP_FINETUNE_DIR.
For one-off execution without installing the command:
uvx --from . leap-finetune /absolute/path/to/config.yamlYou can also start a run from Python. This uses the same backend dispatch as
the CLI: configs with slurm, modal, or kuberay submit remotely; other
configs run local Ray training and require visible CUDA devices.
from leap_finetune import run_config
run_config("/absolute/path/to/config.yaml")Standalone evals use the same entry point:
from leap_finetune import run_config
metrics = run_config("/absolute/path/to/eval_config.yaml")Run that file inside an environment where leap-finetune is installed:
uv run --with-editable . python launch_training.pyConfigs without a remote backend section run in the current environment through Ray Train. Add one of the backend blocks below to submit the same config to a remote runtime.
Modal lets you run training jobs on serverless GPUs from a laptop or Mac. No local GPU is required.
One-time setup:
huggingface-cli login
modal setupAdd a modal: section:
modal:
gpu: "H100:4"
timeout: 86400
output_volume: "leap-finetune"
output_dir: "/outputs"
detach: falseRun:
uv run leap-finetune job_configs/sft_example_modal.yamlIn attached mode (detach: false), the CLI builds the container image,
auto-creates a huggingface-secret from your local HF token, streams logs, and
saves checkpoints to a Modal Volume.
Retrieve checkpoints:
modal volume ls leap-finetune
modal volume get leap-finetune <checkpoint-name> ./local-outputsSet detach: true to submit and disconnect. The CLI prints the Modal app ID
plus commands to monitor or stop it:
modal app logs ap-...
modal app stop ap-...Use the printed ap-... app ID for detached logs. See
job_configs/sft_example_modal.yaml.
If your config includes a slurm: section, leap-finetune auto-generates and
submits a SLURM script:
uv run leap-finetune job_configs/sft_example_with_slurm.yamlGenerate a SLURM script without submitting it:
uv run leap-finetune slurm <path_to_config.yaml>Monitor your SLURM jobs in a TUI:
uv run turm --meIf your config includes a kuberay: section, leap-finetune submits a KubeRay
RayJob instead of launching local training. You need a configured Kubernetes
context, KubeRay CRDs installed, and a container image that already contains
this repo plus its Python environment.
kuberay:
image: "registry.example.com/leap-finetune:latest"
namespace: "training"
worker_replicas: 2
gpus_per_worker: 4
output_dir: "/outputs"
output_pvc: "leap-finetune-outputs"
env:
HF_HOME: "/outputs/hf-cache"Run the same command as local training:
uv run leap-finetune path/to/config.yamlThe CLI creates a ConfigMap for the training config, submits a RayJob, and
prints kubectl commands for status and logs. The product of
worker_replicas and gpus_per_worker becomes ray.num_workers unless you
set it explicitly.
Add tracker to training_config:
training_config:
tracker: "trackio" # or "wandb"Trackio logs to a HuggingFace Space.
trackio_space_id is auto-created if needed, and Modal injects the HF token
automatically:
training_config:
tracker: "trackio"
trackio_space_id: "username/my-dashboard"Weights & Biases uses WANDB_API_KEY:
training_config:
tracker: "wandb"Set the key locally or add it as a Modal secret:
export WANDB_API_KEY=your_keymodal secret create wandb-secret WANDB_API_KEY=your_keymodal:
secrets:
- "wandb-secret"When training is done, bundle your output checkpoint with leap-bundle to use
it directly within LEAP. See the
LEAP bundle quickstart.
The dataset.path field accepts local files, HuggingFace Hub IDs, and cloud
storage URIs:
| Source | Example path |
|---|---|
| Local file | /path/to/data.jsonl, /path/to/data.parquet |
| HuggingFace Hub | HuggingFaceTB/smoltalk |
| S3 | s3://bucket/path/to/data.parquet |
| GCS | gs://bucket/path/to/data.parquet |
| Azure | az://container/path/to/data.parquet |
Cloud storage requires appropriate AWS, GCP, or Azure credentials. Use subset
for HuggingFace datasets with multiple configs, split for HF split
expressions such as train+validation, and limit to cap samples for quick
testing.
{
"messages": [
{ "role": "user", "content": "What is the capital of France?" },
{ "role": "assistant", "content": "The capital of France is Paris." }
]
}DPO uses preference columns instead of a top-level messages field. For
single-turn data, prompt, chosen, and rejected can be plain strings:
{
"prompt": "What is the capital of France?",
"chosen": "The capital of France is Paris.",
"rejected": "The capital of France is London."
}For multi-turn DPO, make prompt the shared conversation history and make
chosen / rejected the preferred and rejected assistant completions:
{
"prompt": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "I am planning a trip to France." },
{ "role": "assistant", "content": "What would you like to know?" },
{ "role": "user", "content": "What is the capital?" }
],
"chosen": [
{ "role": "assistant", "content": "The capital of France is Paris." }
],
"rejected": [
{ "role": "assistant", "content": "The capital of France is London." }
]
}Rows without prompt are also accepted if chosen and rejected are full
conversations with the same shared prefix; the tokenizer extracts that prefix
as the prompt. Prefer the explicit prompt shape above when writing new data.
{
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an image-based assistant. Answer questions based on the provided image."
}
]
},
{
"role": "user",
"content": [
{ "type": "image", "image": "/path/to/image.jpg" },
{ "type": "text", "text": "What do you see in this image?" }
]
},
{
"role": "assistant",
"content": [{ "type": "text", "text": "I see a car in the image." }]
}
]
}VLM datasets often store image paths in a separate column. Merge those image
references into each row's messages content before training.
Use training_type: "vlm_dpo" for multimodal preference data. Each row should
provide prompt, chosen, and rejected message lists plus either an image
or images column:
{
"image": "images/chart.png",
"prompt": [
{
"role": "user",
"content": [{ "type": "text", "text": "Which trend is most important?" }]
}
],
"chosen": [
{
"role": "assistant",
"content": "Revenue grows fastest in Q4."
}
],
"rejected": [
{
"role": "assistant",
"content": "The chart is inconclusive."
}
]
}Use images instead of image for multi-image rows. Relative image paths are
resolved against dataset.image_root when set. See
job_configs/vlm_dpo_example.yaml for a
complete LoRA config. freeze_vision_encoder and
optimizer_type: "adamw_8bit" are optional knobs, not defaults.
GRPO can reuse the SFT/VLM SFT messages format. The loader turns each row
into prompt and solution for online reward computation, and any extra
columns are forwarded to reward functions. See GRPO for the full
dataset, reward, and vLLM rollout contract.
Tool calls use LFM bracket notation in the assistant content field. Tool
definitions go in the system prompt, and tool responses use role: "tool".
{
"messages": [
{
"role": "system",
"content": "List of tools: [{\"type\":\"function\",\"function\":{\"name\":\"get_weather\",\"description\":\"Get weather for a city\",\"parameters\":{\"type\":\"object\",\"properties\":{\"location\":{\"type\":\"string\"}},\"required\":[\"location\"]}}}]"
},
{ "role": "user", "content": "What's the weather in Boston?" },
{
"role": "assistant",
"content": "<|tool_call_start|>[get_weather(location=\"Boston\")]<|tool_call_end|>"
},
{
"role": "tool",
"content": "{\"temperature\": 72, \"condition\": \"sunny\"}"
},
{ "role": "assistant", "content": "It's 72 F and sunny in Boston." }
]
}- Tool calls must be pre-baked in
contentusing<|tool_call_start|>[func(args)]<|tool_call_end|>bracket notation. - Structured
tool_callsfields in OpenAI format are auto-converted when present. - Foreign formats such as
<tool_call>XML are rejected with an actionable error. - Do not include
<|tool_response_start|>or<|tool_response_end|>markers inrole: "tool"messages. The LFM2 chat template adds them during tokenization. - LFM2 models expect
<|tool_list_start|>and<|tool_list_end|>around tool definitions in the system prompt. Include them for LFM2 and omit them for LFM2.5. The pipeline warns on mismatches and auto-strips<|tool_list_start|>when training LFM2.5.
Resume interrupted runs from the last checkpoint with optimizer state, LR schedule, training step, RNG state, and tracker continuity intact:
training_config:
resume_from_checkpoint: "latest"latest finds the most recent run directory under outputs/{project_name}/
and resumes from its latest checkpoint. To resume from a specific checkpoint:
training_config:
resume_from_checkpoint: "/path/to/outputs/my_project/run_name/checkpoint-step-8000"To resume a run, save_only_model must be False. The wandb run ID is saved
to <run_dir>/.wandb_run_id; resumed runs reuse the same wandb run, and fresh
runs create a new one.
GRPO runs online RL with TRL v1's GRPOTrainer. Use training_type: "grpo"
for text models and training_type: "vlm_grpo" for vision-language models.
Both modes use the same YAML entrypoint as SFT/DPO, the same Ray Train
launcher, and vLLM rollouts by default.
Text GRPO can reuse the SFT messages format: the loader builds prompt from
non-assistant turns and solution from the last assistant message. Native
prompt / solution columns also work. VLM GRPO uses the same multimodal
messages shape as VLM SFT. Extra dataset columns are forwarded to reward
functions as keyword arguments.
The rewards: block resolves plain Python callables and task recipes from
rewards/. Shipped primitive functions can be
referenced by function name:
rewards:
funcs:
- "accuracy_reward"
- "length_reward"
weights: [1.0, 0.1]Task recipes bundle multiple reward functions and their default weights:
rewards:
recipe: "tasks/vlm_grounding/recipe.py::VLMGroundingIoURecipe"If you combine recipe: and funcs:, recipe rewards are ordered first. A
weights: override must match the expanded reward list. Absolute paths and
./rewards/file.py::function_name specs work for custom rewards.
Add rewards.judge when the reward signal should come from an LLM grader.
Without base_url, the driver starts a local trl vllm-serve judge server
before Ray initializes and exports the endpoint to workers:
rewards:
judge:
model: "LFM2-1.2B"
weight: 1.0
prompt_template: |
Prompt:
{prompt}
Assistant response:
{completion}
Reference answer or rubric:
{solution}
Return only JSON: {"score": 0.0}
grpo_rollout:
judge_gpus: 1For an external judge, set rewards.judge.base_url and omit judge_gpus.
Judge scores are parsed from JSON or the first number in the response and
normalized from min_score/max_score to [0, 1].
DEFAULT_GRPO and DEFAULT_VLM_GRPO set use_vllm: true and
vllm_mode: "colocate". Colocate mode runs vLLM inside each training worker.
Server mode starts trl vllm-serve on driver GPUs before Ray initializes.
Configure GPU counts, not device ids:
grpo_rollout:
server_gpus: 1 # reserve 1 local GPU for vLLM; training gets the rest
judge_gpus: 1 # optional: reserve 1 local GPU for rewards.judge
# training_gpus: 3 # or set only this and vLLM gets the remaining GPUs
tensor_parallel_size: 1
dtype: "bfloat16"
gpu_memory_utilization: 0.9
training_config:
extends: "DEFAULT_GRPO"
vllm_mode: "server"
vllm_server_host: "auto"
vllm_server_port: 8000Local server partitioning is single-node only. For multi-node GRPO, use
colocate mode or point vllm_server_base_url at an externally managed vLLM
server without setting server_gpus or training_gpus.
Example configs:
job_configs/grpo_example.yaml: text GRPO quickstart with the GSM8K recipe.job_configs/grpo_server_mode_example.yaml: text GRPO with a localtrl vllm-serverollout server.job_configs/vlm_grpo_grounding_example.yaml: VLM GRPO with the visual-grounding recipe.
Launch the same way as SFT/DPO:
uv run leap-finetune job_configs/grpo_example.yamlFor tasks where the environment state evolves from agent actions, such as
browsing, tool use, game simulators, or stateful multi-turn tasks,
leap-finetune supports OpenEnv
via an optional rl_env: block:
uv sync --extra rl-envSee
src/leap_finetune/rl/environments/README.md.
For anything scorable by a pure Python function, prefer the rewards: path; it
is simpler and faster.
Run benchmarks during training at every eval_steps by adding an evals:
section. The legacy benchmarks: alias still parses.
evals:
max_new_tokens: 128
benchmarks:
- name: "mmmu_val"
path: "/data/mmmu_val.jsonl"
metric: "short_answer"
- name: "imagenette"
path: "/data/imagenette_eval.jsonl"
metric: "logprob_zero_shot"Benchmark data uses the same HF messages schema as training data. Available
metrics include short_answer, grounding_iou, mcq_gen, and
logprob_zero_shot. Results are logged to wandb at
benchmark/{name}/score.
See the Evaluation Guide for data format examples, YAML reference, and custom metrics.
Run the same eval suite without training:
uv run leap-finetune job_configs/eval_standalone_example.yamlStandalone eval configs use model_name or checkpoint, evals:, and an
optional backend: block. They do not include dataset, training_type,
training_config, or async_eval. Text evals default to modality: text;
set modality: vlm only for standalone VLM evals.
Use the explicit eval subcommand when you want CLI-only eval options such as
writing metrics to JSON:
uv run leap-finetune eval job_configs/eval_standalone_example.yaml --output results.jsonThe same path is available from Python:
from leap_finetune import run_config
metrics = run_config("job_configs/eval_standalone_example.yaml")By default, every eval_steps blocks training until benchmarks finish. For
large generation suites this dominates wall-clock time. Add an async_eval
block to run benchmarks without blocking training, using vLLM for the
actual generation. Results are logged to wandb with benchmark/step and
train/global_step fields so dashboards can align benchmark metrics to the
training step that triggered them.
Three modes (default is sync = today's behavior):
| Mode | Engine | Pauses training? | GPUs reserved | Latency | Multi-node training | Best for |
|---|---|---|---|---|---|---|
sync |
HF transformers | Yes | None | Immediate | ✓ | Small/fast eval suites; default |
sidecar |
vLLM | No | None (slurm-scheduled per cycle) | Slurm queue + eval time | ✓ | Tight clusters; eval should be free of training cost |
reserved |
vLLM | No | N throughout the run | ~30–60s respawn per cycle | Single-node only | Spare GPUs on one node, want predictable eval latency |
reserved mode carves its GPUs off the same SLURM allocation as
training via the driver's CUDA_VISIBLE_DEVICES, which only affects
the head node. Multi-node training will raise NotImplementedError
at startup — use sidecar instead, which scales to any node count.
Async sidecar mode serves generation through vLLM and falls back to an HF model
inside the sidecar for benchmark types vLLM cannot serve, such as logprob
scoring. Reserved mode keeps a persistent vLLM server and should be used for
generation benchmarks; use sync or sidecar for logprob suites.
# Opt in by adding this block. See job_configs/sft_with_async_eval_example.yaml
async_eval:
mode: sidecar # sync (default) | sidecar | reserved
vllm_gpus: 1
tensor_parallel_size: 1
gpu_memory_utilization: 0.9
# mode=sidecar: short sbatch job per eval_steps
sbatch:
time: "00:30:00"
# partition / account default to inheriting from the parent job
# mode=reserved: long-running vllm-serve on dedicated GPUs (single-node only for v1)
reserved:
weight_reload: respawn
server_port: 8100Failures are isolated: if eval crashes or sbatch is rejected, training continues. After failure.max_consecutive consecutive failures the callback disables itself for the rest of the run. See job_configs/sft_with_async_eval_example.yaml for a full example.
For standard VLM benchmarks such as MMMU, OCRBench, RefCOCO, and POPE, use an
environment that includes the private Liquid4All lmms-eval fork with LFM2
model support.
Evaluate a fine-tuned checkpoint:
python -m lmms_eval \
--model lfm2_vl \
--model_args pretrained=/path/to/checkpoint \
--tasks mmmu_val,ocrbench,pope \
--batch_size 1Multi-GPU:
torchrun --nproc-per-node=4 -m lmms_eval \
--model lfm2_vl \
--model_args pretrained=/path/to/checkpoint \
--tasks mmmu_val,ocrbench,pope \
--batch_size 1For faster evaluation with the vLLM backend:
python -m lmms_eval \
--model lfm2_vl_vllm \
--model_args pretrained=/path/to/checkpoint,tensor_parallel_size=1,gpu_memory_utilization=0.85 \
--tasks mmmu_val,ocrbench,pope \
--batch_size 64The lmms-eval and vLLM packages are sourced from private Liquid4All forks
with LFM2 model support, so SSH access to those repos is required.
Export a HuggingFace checkpoint or PEFT adapter to GGUF with
leap-export-gguf:
uv run leap-export-gguf /path/to/checkpoint --quant F16 --output-dir /lambdafs/ggufRepeat --quant to produce multiple outputs:
uv run leap-export-gguf /path/to/checkpoint \
--quant F16 \
--quant Q4_K_M \
--output-dir /lambdafs/gguf \
--llama-cpp-dir /path/to/llama.cppF16, BF16, F32, and Q8_0 are exported directly with the bundled
llama.cpp conversion scripts. K-quants such as Q4_K_M, Q5_K_M, and Q6_K
require a built llama.cpp checkout containing build/bin/llama-quantize; pass
--llama-cpp-dir or set LLAMA_CPP_DIR.
PEFT adapter directories can be exported with F16, BF16, F32, or Q8_0:
uv run leap-export-gguf /path/to/adapter \
--base-model-path /path/to/base-model \
--quant F16 \
--output-dir /lambdafs/ggufFor adapter K-quants, merge the adapter into the base model first, then export the merged checkpoint.
Default base configs live in
src/leap_finetune/training/default_configs/
and are auto-discovered. New configs added to those files are immediately
available via extends in YAML.
Liger Kernel is installed by the
default CUDA group. Enable it with use_liger_kernel: true in
training_config.
LoRA is configured through peft_config. Continued LoRA training can load an
existing adapter with adapter_path; VLM configs can optionally freeze the
vision encoder with freeze_vision_encoder and use bitsandbytes 8-bit AdamW
with optimizer_type: "adamw_8bit".
Pinned chat templates live in
job_configs/chat_templates/. Non-local
LiquidAI LFM2.5 and LFM2-24B models select the pinned LFM2.5 template by
default.
The test suite is intentionally scoped to four buckets: config parsing, e2e,
RL, and MoE. See tests/README.md for the current layout.
E2E fixtures and SLURM launchers live under tests/e2e/.
Run the normal tests:
uv run pytest tests/config tests/rl tests/moe -qGPU e2e tests require an appropriate GPU or cluster backend; see
tests/e2e/ for launchers and fixtures.
Install pre-commit hooks:
uv run pre-commit installRun hooks manually:
uv run pre-commit run --all-filesOpen a PR with a clear description of the behavior changed, tests run, and any known limitations.
