cookbook: visual-grounding SFT→GRPO recipe for LFM2.5-VL-1.6B by Rouzbehat78 · Pull Request #27 · Liquid4All/leap-finetune

Rouzbehat78 · 2026-06-04T20:13:11Z

Summary

End-to-end public cookbook teaching customers how to fine-tune LFM2.5-VL-1.6B for visual grounding — predicting normalized [x1, y1, x2, y2] bounding boxes from text queries — via a two-phase SFT → GRPO recipe. Trains in <2 days on 2× 8-GPU nodes and lands at parity-or-better than published baselines on RefCOCO/+/g.

The cookbook is the customer-facing artifact, but the underlying PR also ships library-level GRPO fixes for LFM2-VL multi-image inputs that were blocking the recipe.

Model + data


Model	`LiquidAI/LFM2.5-VL-1.6B` (~1.6B parameters, native multi-image VLM)
Training data	`Michael4933/MGrounding-630k` — 630K grounding samples covering single-image referring expressions, group grounding, and multi-image object tracking
Eval benchmarks	RefCOCO val · RefCOCO+ val · RefCOCOg val · mgrounding_test (held-out 10% of MGrounding-630k)
Output format	`[{"label": "red car", "bbox": [0.12, 0.34, 0.58, 0.71]}]` — normalized `[0, 1]` coords

Recipe (cookbook/visual-grounding/)

Step 0 — Setup

uv sync && huggingface-cli login && wandb login   # 5 min

Step 1 — Data prep (~1 hr CPU)

sbatch cookbook/visual-grounding/configs/prepare_data.sh
sbatch cookbook/visual-grounding/configs/prepare_evals.sh

prepare_data.py downloads MGrounding-630k from HF, walks each multi-turn conversation (including the multi-image Object_Tracking subset), normalizes 0-1000 → [0, 1] coords, canonicalizes every variant into {label, bbox} JSON, and writes a deterministic 3-way disjoint split: 72% SFT / 18% GRPO / 10% held-out test.

prepare_evals.py pulls RefCOCO/+/g val splits from HF and emits one jsonl per benchmark.

Step 2 — Phase 1: SFT (~10–12 hr on 2 × 8 GPUs)

sbatch cookbook/visual-grounding/configs/sft_grounding_multinode.sh

1 epoch on the 72% SFT split, lr=5e-5 cosine, vision encoder at 0.1× base LR, do_image_splitting=false (one tile per image). Async sidecar eval fires every eval_steps=1000 against the 4 benchmarks so training never pauses for scoring.

Step 3 — Phase 2: GRPO (~24 hr on 2 × 8 GPUs)

# Point grpo_grounding.yaml at your Phase 1 checkpoint, then:
sbatch cookbook/visual-grounding/configs/grpo_grounding_multinode.sh

1 epoch on the 18% GRPO split with beta=0.01 KL brake, dapo loss, num_generations=4, vLLM colocate rollouts. Reward = VLMGroundingCIoURecipe:

strict_format (weight 0.1) — completion must parse as a JSON list of {label, bbox} dicts
ciou_f1 (weight 1.0) — Hungarian-matched multi-bbox F1 over CIoUs (degrades to single CIoU on 1-vs-1)

Async eval / sidecar mode

Both phases use async_eval: mode=sidecar (the library feature shipped in #25 feature/async-eval). Each eval_steps boundary stages a checkpoint and fires an sbatch running vLLM in-process against all 4 benchmarks; the score lands on the training wandb run page back-filled to the originating training step. Training never blocks.

Features exercised by this PR

Feature	What gets tested
Multi-image LFM2-VL GRPO	TRL split/unsplit patches for per-image `pixel_values`/`spatial_shapes`; image-lift on the data path
force-no-split vLLM rollout	Disables vLLM's internal `_is_image_too_large` tile-split for any image >724×724 that would otherwise mismatch the HF processor's tile count and CUDA-assert on `masked_scatter`
Per-sample image-token reconciler	Replaces the prior global aggregate (which masked cross-sample cancellation) with a per-sample preflight check — trims surplus tokens, pads underflows, zero-weights bad rows
CIoU-F1 multi-bbox reward	Hungarian matcher with scipy-or-greedy fallback; reduces to single CIoU; correct abstention on zero-box prompts
Async sidecar eval	Per-step in-flight markers, retry-on-sbatch-failure with exponential backoff, sacct-based stale-marker sweep, auto-disable after N consecutive failures
HF → leap-finetune data conversion	Educational `prepare_data.py` showing the canonicalization pattern for any custom grounding dataset

Results

Validated end-to-end on a 26 hr training run (one full epoch over 12 604 GRPO steps from the Phase 1 SFT checkpoint).

Final benchmark scores (after GRPO):

Benchmark	Metric	Score
RefCOCO val	IoU@0.5	0.710
RefCOCOg val	IoU@0.5	0.733
RefCOCO+ val	IoU@0.5	0.558
mgrounding_test (in-dist)	CIoU-F1	0.714

Improvement from GRPO over the SFT checkpoint (step 5000 → step 12000):

Benchmark	SFT baseline	After GRPO	Δ
RefCOCO val	0.688	0.710	+2.2
RefCOCOg val	0.703	0.733	+3.0
RefCOCO+ val	0.537	0.558	+2.1
mgrounding_test	0.713	0.714	+0.1 (already saturated by SFT)

The CIoU-F1 reward was continuously rising over the full epoch with no collapse (beta=0.01 KL brake holds; v1 with beta=0 had a step-4000 collapse).

Dependency note for reviewers

This PR depends on #25 feature/async-eval — specifically the grounding_iou_f1 metric used by the mgrounding_test benchmark. The cross-PR test (tests/test_grounding_metric_reward_parity.py) skips cleanly until that metric lands, then auto-re-enables on rebase.

Suggested merge order: #25 first → rebase this branch → merge.

Test plan

pytest tests/ — 320 passed / 12 skipped / 0 failed (full repo green on this branch)
No personal cluster paths leak into shipped configs/scripts (/lambdafs, /home/rouzbeh, hardcoded job IDs, etc. — all scrubbed)
Cookbook YAMLs parse cleanly via parse_job_config
Full E2E recipe ran successfully — 26 hr training, 4 benchmarks land on wandb
Smoke-test the data-prep on a customer-fresh uv sync (recommended manual check)
Eyeball the README customer narrative for clarity (recommended manual review)

TRL v1.0 adds production-grade GRPOTrainer with native rollout_func, vLLM colocate/server modes, and async rollouts. Also adds openenv-core as an optional extra ('uv sync --extra rl-env').

Adds rewards/ (top-level, sibling of job_configs/) for plain Python reward functions referenced from YAML by path. Recipe class bundles multiple rewards + weights per task; subclass to extend.

Adds grpo and vlm_grpo training types with colocate (default) and server-mode vLLM rollouts. VLM GRPO preserves the 0.1x vision encoder LR via a shared helper also used by VLM SFT. Ray Train passes the full dataset to every worker for GRPO since TRL's RepeatSampler handles per-rank distribution.

Example YAMLs for text GRPO (colocate + server modes) and VLM grounding. Unit tests for the reward loader, recipes, and config parser; e2e smoke tests + SLURM launchers for 1 GPU runs.

- DPO tokenizer: prompt_input_ids → prompt_ids (TRL v1 collator change) - config_parser: pre-resolve reward paths to absolute on driver so Ray workers can find them from their sandbox CWD - e2e fixtures: use_vllm=false until vllm supports transformers 5.x - test assertions updated for new column names and absolute paths

…ment of multi-node in ray, works iwth gRPO colocate, SFT, dpo

… need to change the dataset format, SFT converts to prompt,solution pair for the GRPO

…wn, reward Hub where recipes for differnet tasks accumulates for re-use. Each task contains a recipe that is essentially a bundle of rewards + weights for each reward. Combine rewards and recipes to construct your ideal reward functions

… the images, bad parses, no gradietns for image tokens: vlm_grpo trainer: lift images, alias spatial_shapes, VLM-aware logps LFMVLMGRPOTrainer patches three gaps in TRL's multimodal data path so LFM2-VL actually gets gradient through the vision tower during GRPO: - Lift images from prompt message content into the top-level key TRL inspects, so the multimodal branch fires and pixel_values reach the training forward pass (without it, generation still sees images but training silently runs with pixel_values=None). - Alias the processor's output to via a context-scoped __class__ swap, letting the tensor ride TRL's fixed multimodal kwarg whitelist from data prep through _compute_loss. - Override _get_per_token_logps_and_entropies to rename back to spatial_shapes at the model-forward boundary, filter to kwargs the model accepts, and skip TRL's per-sample pixel_values chunking (LFM2-VL returns patch-concatenated pixels, not (B, C, H, W)).

… celanup, few exampels for tedt and VLM

…nv with reward lenght for both VLM and LLM

Introduce a narrow Protocol (generate + logprobs) so benchmarks can dispatch through the same code path whether they're talking to an in-process HF model, an in-process vLLM engine, or a remote vLLM server. Ships with HFBackend (in-process HF) used by the sync path and as a logprob fallback for the vLLM backends. Benchmark base class gains an additive evaluate_with_backend default (raises NotImplementedError) so subclasses can opt in incrementally. Zero behavior change for existing sync callers.

LLM and VLM generation + logprob benchmarks gain evaluate_with_backend implementations that build batched GenerateRequest / LogprobRequest lists, dispatch to an InferenceBackend, and score the responses with the existing per-sample scoring logic. Sync path is unchanged. Export the new backend symbols from leap_finetune.evaluation.

AsyncEvalConfig parses + validates the async_eval: YAML block (sync / sidecar / reserved) with sub-blocks for sbatch settings, reserved server settings, and failure handling. make_eval_callback dispatches to BenchmarkEvalCallback (sync), SidecarEvalCallback (sidecar), or ReservedEvalCallback (reserved). Sidecar and reserved imports are lazy so sync users don't pay the import cost.

SidecarEvalCallback (rank 0) stages a checkpoint, renders an sbatch script, and submits at every eval_steps. The sbatch job loads vLLM on whatever GPU SLURM assigns it, runs every configured benchmark, and back-fills the training run's wandb log at the originating step. Training never pauses on eval. A .in_flight marker enforces on_overlap policy (skip / queue); the sbatch clears the marker on EXIT so a crashed runner can't block the callback. After failure.max_consecutive failures the callback disables itself. When eval_on_start is true the step-0 sidecar runs synchronously (callback polls sacct until the job is terminal) so wandb's step counter stays aligned for the baseline metrics.

ReservedEvalCallback owns a daemon helper thread (rank 0) that hosts a persistent vLLM OpenAI server on the dedicated eval GPUs carved off the training pool. On each eval_steps the thread respawns the server against the latest checkpoint, runs every benchmark via VLLMServerBackend, and pushes results back to a queue. on_log drains the queue and back-fills wandb at the originating training step. on_train_end drains any in-flight cycles before teardown so results aren't dropped. Helper-thread exceptions never propagate to training. Single-node only; weight_reload=respawn only (in_place rejected with a clear error). Driver-side GPU carving lands in the next commit.

For mode=reserved, the driver carves vllm_gpus off the training pool at job start, sets CUDA_VISIBLE_DEVICES for the trainer accordingly, and hands the worker the eval server URL + carved GPU ids through train_loop_config. The worker (rank 0) launches its own vllm-serve subprocess inside the helper thread so it owns the lifetime and can respawn on weight reload. Runs AFTER any GRPO server-mode carve so the two modes can coexist. Multi-node is rejected with a clear error.

config_parser validates the async_eval YAML block on the driver (misconfig errors surface before a Ray worker is spawned). The raw dict is forwarded into train_loop_config. Each of the 5 training loops (sft, dpo, grpo, vlm_sft, vlm_grpo) replaces its direct BenchmarkEvalCallback registration with the make_eval_callback dispatch helper. Same call shape across loops; the dispatcher picks sync / sidecar / reserved based on the YAML.

Unit tests for AsyncEvalConfig parsing, make_eval_callback dispatch, sidecar marker lifecycle, and FakeBackend round-trip through the benchmark dispatchers. Toy fixtures (sidecar.sh / reserved.sh + matching YAMLs) exercise each mode end-to-end on a single GPU against a tiny QA benchmark under SLURM. job_configs/sft_with_async_eval_example.yaml is the copy-paste starting point users land on from the README.

Top-level overview of the three modes (sync / sidecar / reserved): when to pick each, the trade-offs (training pause vs reserved GPUs vs queue latency), and the YAML schema. Points users at the example config + the per-mode behavior contracts.

No behavior change. Wraps long argparse / log-format lines, normalizes frozenset literal layout, drops an unused import, and prettier-aligns the async eval table + YAML snippet in the README.

End-to-end Phase 1 of the visual-grounding cookbook on LFM2.5-VL: * prepare_data.py: streams Michael4933/MGrounding-630k from HuggingFace and converts the custom token-tagged format to leap-finetune messages parquet. Deterministic 3-way split (SFT / GRPO holdout / test) so the GRPO phase trains on rows the SFT run never saw. Skips Object_Tracking (different output shape). * prepare_evals.py: builds the canonical RefCOCO/RefCOCO+/RefCOCOg val jsonls from the jxu124 datasets, paired with COCO 2014 train images. * fix_test_hint.py: one-off to canonicalize the format hint on the in-distribution test parquet so it matches the RefCOCO trio. * prompt_templates.py: 8 format-hint variants + 17 REC task prompts; EVAL_FORMAT_HINT pins the canonical eval phrasing. * configs/sft_grounding.{yaml,sh}: Phase 1 SFT config + SLURM launcher. Uses async eval (sidecar mode) so training never pauses on eval. * configs/{prepare_data,prepare_evals,fix_test_hint}.sh: CPU-only SLURM wrappers for the one-time data + eval-set generation. Validated end-to-end on a 12h SFT run hitting refcoco_val 0.620, refcoco_plus 0.483, refcocog 0.686, mgrounding_test 0.751 at the peak checkpoint.

Phase 2 GRPO config + launcher. Resumes from the Phase 1 SFT checkpoint (model_name placeholder needs to be filled in by the user) and trains on the held-out 30% slice via the shipped VLMGroundingIoURecipe (strict format 0.1 + Hungarian-matched IoU-F1 1.0). Same four async-eval benchmarks as Phase 1. README extended with a Phase 2 section describing the reward, the checkpoint pointing step, and the launch command.

Switch the GRPO YAML from VLMGroundingIoURecipe to VLMGroundingCIoURecipe so the matcher itself runs on CIoU (IoU minus center-distance and aspect-ratio penalties) and the F1 reward scores the matched CIoU values. This rewards center-aligned + same-shape pairs even when raw overlap ties, and the F1 wrapping keeps FP/FN penalization. README's Phase 2 section updated accordingly.

Make multi-image (MGrounding-style) VLM GRPO work end-to-end on LFM2-VL, which TRL + upstream vLLM don't handle out of the box: 1. vLLM rollout: inject mm_processor_kwargs (do_image_splitting=False, single-tile) on multi-image prompts so vLLM 0.19's LFM2-VL preprocessor doesn't crash on empty spatial_shapes. 2. TRL split/unsplit_pixel_values_by_grid: patch for LFM2-VL's per-image pixel_values layout (Qwen-style patch-concat assumption breaks it). Split every per-image tensor by num_images and re-merge. 3. _get_per_token_logps_and_entropies: single full-batch forward (no per-sample slicing through concatenated multi-image patches), spatial_shapes<->image_sizes aliasing, and completion-region sanitization of stray image-placeholder tokens sampled during rollout. Validated: GRPO from the SFT checkpoint trains stably and all four benchmarks (refcoco/+/g + mgrounding) climb monotonically above the SFT baseline through step 4000. A residual off-by-one in masked_scatter still surfaces rarely — hardened in the follow-up commit.

The completion-only sanitization reduced the stray-image-token off-by-one but didn't eliminate it — a residual surplus still reached LFM2-VL's masked_scatter and triggered an unrecoverable CUDA assert (device-side, no clean traceback) ~step 4600 on a beta>0 run. Replace it with _reconcile_image_tokens: before every forward, compute the exact expected feature count from spatial_shapes (Σ floor(h/df)·floor(w/df), df=downsample_factor — matches the model's pixel_unshuffle output) and compare against the image-placeholder count in input_ids. Trim any surplus image tokens (scanning from the completion end, so structural prompt placeholders are last to go) so placeholders == features always. image_token_id is resolved from the processor (reliable) rather than the wrapped model's config. Underflow (placeholders < features) is logged as a data/collation bug rather than silently trimmed.

… fixes Cookbook (cookbook/visual-grounding/): - prepare_data.py: HF→leap-finetune conversion for MGrounding-630k incl. Object_Tracking (multi-image), 80/20 SFT/GRPO + 10% test holdout, all canonicalized to [{"label","bbox"}] JSON. - prepare_evals.py: RefCOCO/+/g val → jsonl per benchmark. - Single + multi-node SLURM launchers for SFT and GRPO. - YAMLs use relative ./data and ./outputs; no cluster paths or secrets. - README explains the data flow and how to swap in a custom dataset. GRPO training-loop fixes (src/leap_finetune/training_loops/vlm_grpo_run.py): - Force-no-split vLLM rollout: inject mm_processor_kwargs {do_image_splitting=False, min_tiles=1, max_tiles=1} for every image-bearing prompt (was multi-image only). vLLM 0.19's LFM2-VL has an internal image-size threshold that re-enables tile splitting independently of the HF flag; this disables it. - Per-sample image-token/feature mismatch handler in the HF training forward — trims surplus tokens or pads underflow rows, zero-weights bad samples, so a malformed row never crashes the whole batch. - Multi-image pixel_values / spatial_shapes plumbing for LFM2-VL. Reward (rewards/tasks/vlm_grounding/recipe.py): Hungarian-matched CIoU-F1 with strict-format gate. Multi-bbox aware, degrades to single CIoU when ground truth is a single box. Ancillary: - dataset_loader.py: pyarrow fallback on flaky parquet reads. - logging_utils.py: bump wandb init_timeout 90s→300s for slow nodes. - pyproject.toml: pin transformers 5.3 + vLLM 0.19, openenv-core extra. - New test: rewards/metric parity on grounding samples. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The cookbook branch had 3 tests that failed against current code: 1. ``test_sidecar_defaults`` — asserted ``sbatch.time == "00:30:00"`` but the schema's default is ``None`` (partition default; no cap). 2. ``test_failure_disables_after_max_consecutive`` — called ``cb.on_evaluate(...)`` but ``SidecarEvalCallback`` only hooks ``on_step_end`` (with ``control.should_evaluate=True``). ``on_evaluate`` inherits from the base class as a no-op so the submission path was never triggered and the disable assertion stayed False. 3. ``test_rename_back_in_get_per_token_logps`` — broke after PR A's ``_check_image_token_mismatch`` was added: that helper accesses ``self.processing_class`` directly, and the test constructs the trainer via ``__new__`` without setting it. Switched to ``getattr(self, "processing_class", None)`` so the helper is safe on bare instances (matches the rest of the helper which already uses ``getattr`` for ``image_token_id``). 4. ``test_grounding_metric_reward_parity.py`` — imports ``score_grounding_iou_f1`` which only exists on feature/async-eval. Wrapped the import in ``try/except ImportError`` with ``pytest.skip(..., allow_module_level=True)`` so PR A is pytest-clean standalone. Auto-re-enables on rebase once PR B's metric is on main. 320 passed / 12 skipped / 0 failed — full repo green on this branch.

…ismatch Codex flagged that my previous fix was half-done: ``getattr(self, "processing_class", None)`` made the lookup safe, but ``proc=None`` then falls through to ``proc.image_processor`` on the next line and crashes on ``AttributeError`` if ``image_token_id`` happened to come from ``model.config`` instead of the processor. Fix: - Return None early when ``proc is None``. Without the processor's ``image_processor`` we can't read ``downsample_factor`` so we can't do per-sample reconciliation — skipping the preflight is the right default (the model forward still runs; we just don't pre-screen). - Also use getattr on ``image_processor`` itself so a partially-built processor doesn't crash either. Tests pass — same regression test (``test_rename_back_in_get_per_token_logps``) still passes, plus all other test_grpo_data tests.

Rouzbehat78 added 30 commits April 9, 2026 23:17

chore(deps): bump trl to v1.0, transformers to 5.3

47bbb18

TRL v1.0 adds production-grade GRPOTrainer with native rollout_func, vLLM colocate/server modes, and async rollouts. Also adds openenv-core as an optional extra ('uv sync --extra rl-env').

feat(grpo): reward recipes + example library

59e09f5

Adds rewards/ (top-level, sibling of job_configs/) for plain Python reward functions referenced from YAML by path. Recipe class bundles multiple rewards + weights per task; subclass to extend.

feat(grpo): example configs, tests, and README

c341846

Example YAMLs for text GRPO (colocate + server modes) and VLM grounding. Unit tests for the reward loader, recipes, and config parser; e2e smoke tests + SLURM launchers for 1 GPU runs.

vLLM with new trl and override transformers

f66612c

GRPO needs to do Repeatsampler so deactivate Ray sharding, and enable…

2ee36e2

…ment of multi-node in ray, works iwth gRPO colocate, SFT, dpo

validator to handle GRPO formatted data from same SFT data format, no…

ca0852a

… need to change the dataset format, SFT converts to prompt,solution pair for the GRPO

gsm8k math metrics for Eval hook

a8bd545

automated tests for GRPO, e2e grpo trianing and multi-ndoe testing

4a2b255

recipe and loader for rewards and tasks

1264f38

Mulit-node helper Slurm script to setup the Ray init for hte cluster,…

8824339

… celanup, few exampels for tedt and VLM

adding support for OpenEnv, Beta verison, has been tested with Echo e…

9f4f550

…nv with reward lenght for both VLM and LLM

Linting

c137d28

docs: async eval README section

9c06548

Top-level overview of the three modes (sync / sidecar / reserved): when to pick each, the trade-offs (training pause vs reserved GPUs vs queue latency), and the YAML schema. Points users at the example config + the per-mode behavior contracts.

style: ruff + prettier auto-format

b953009

No behavior change. Wraps long argparse / log-format lines, normalizes frozenset literal layout, drops an unused import, and prettier-aligns the async eval table + YAML snippet in the README.

Rouzbehat78 and others added 4 commits May 27, 2026 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cookbook: visual-grounding SFT→GRPO recipe for LFM2.5-VL-1.6B#27

cookbook: visual-grounding SFT→GRPO recipe for LFM2.5-VL-1.6B#27
Rouzbehat78 wants to merge 34 commits into
mainfrom
cookbook/visual-grounding-grpo

Rouzbehat78 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Rouzbehat78 commented Jun 4, 2026

Summary

Model + data

Recipe (cookbook/visual-grounding/)

Step 0 — Setup

Step 1 — Data prep (~1 hr CPU)

Step 2 — Phase 1: SFT (~10–12 hr on 2 × 8 GPUs)

Step 3 — Phase 2: GRPO (~24 hr on 2 × 8 GPUs)

Async eval / sidecar mode

Features exercised by this PR

Results

Dependency note for reviewers

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant