Skip to content

cookbook: visual-grounding SFT→GRPO recipe for LFM2.5-VL-1.6B#27

Open
Rouzbehat78 wants to merge 34 commits into
mainfrom
cookbook/visual-grounding-grpo
Open

cookbook: visual-grounding SFT→GRPO recipe for LFM2.5-VL-1.6B#27
Rouzbehat78 wants to merge 34 commits into
mainfrom
cookbook/visual-grounding-grpo

Conversation

@Rouzbehat78

Copy link
Copy Markdown
Contributor

Summary

End-to-end public cookbook teaching customers how to fine-tune LFM2.5-VL-1.6B for visual grounding — predicting normalized [x1, y1, x2, y2] bounding boxes from text queries — via a two-phase SFT → GRPO recipe. Trains in <2 days on 2× 8-GPU nodes and lands at parity-or-better than published baselines on RefCOCO/+/g.

The cookbook is the customer-facing artifact, but the underlying PR also ships library-level GRPO fixes for LFM2-VL multi-image inputs that were blocking the recipe.

Model + data

Model LiquidAI/LFM2.5-VL-1.6B (~1.6B parameters, native multi-image VLM)
Training data Michael4933/MGrounding-630k — 630K grounding samples covering single-image referring expressions, group grounding, and multi-image object tracking
Eval benchmarks RefCOCO val · RefCOCO+ val · RefCOCOg val · mgrounding_test (held-out 10% of MGrounding-630k)
Output format [{"label": "red car", "bbox": [0.12, 0.34, 0.58, 0.71]}] — normalized [0, 1] coords

Recipe (cookbook/visual-grounding/)

Step 0 — Setup

uv sync && huggingface-cli login && wandb login   # 5 min

Step 1 — Data prep (~1 hr CPU)

sbatch cookbook/visual-grounding/configs/prepare_data.sh
sbatch cookbook/visual-grounding/configs/prepare_evals.sh

prepare_data.py downloads MGrounding-630k from HF, walks each multi-turn conversation (including the multi-image Object_Tracking subset), normalizes 0-1000 → [0, 1] coords, canonicalizes every variant into {label, bbox} JSON, and writes a deterministic 3-way disjoint split: 72% SFT / 18% GRPO / 10% held-out test.

prepare_evals.py pulls RefCOCO/+/g val splits from HF and emits one jsonl per benchmark.

Step 2 — Phase 1: SFT (~10–12 hr on 2 × 8 GPUs)

sbatch cookbook/visual-grounding/configs/sft_grounding_multinode.sh

1 epoch on the 72% SFT split, lr=5e-5 cosine, vision encoder at 0.1× base LR, do_image_splitting=false (one tile per image). Async sidecar eval fires every eval_steps=1000 against the 4 benchmarks so training never pauses for scoring.

Step 3 — Phase 2: GRPO (~24 hr on 2 × 8 GPUs)

# Point grpo_grounding.yaml at your Phase 1 checkpoint, then:
sbatch cookbook/visual-grounding/configs/grpo_grounding_multinode.sh

1 epoch on the 18% GRPO split with beta=0.01 KL brake, dapo loss, num_generations=4, vLLM colocate rollouts. Reward = VLMGroundingCIoURecipe:

  • strict_format (weight 0.1) — completion must parse as a JSON list of {label, bbox} dicts
  • ciou_f1 (weight 1.0) — Hungarian-matched multi-bbox F1 over CIoUs (degrades to single CIoU on 1-vs-1)

Async eval / sidecar mode

Both phases use async_eval: mode=sidecar (the library feature shipped in #25 feature/async-eval). Each eval_steps boundary stages a checkpoint and fires an sbatch running vLLM in-process against all 4 benchmarks; the score lands on the training wandb run page back-filled to the originating training step. Training never blocks.

Features exercised by this PR

Feature What gets tested
Multi-image LFM2-VL GRPO TRL split/unsplit patches for per-image pixel_values/spatial_shapes; image-lift on the data path
force-no-split vLLM rollout Disables vLLM's internal _is_image_too_large tile-split for any image >724×724 that would otherwise mismatch the HF processor's tile count and CUDA-assert on masked_scatter
Per-sample image-token reconciler Replaces the prior global aggregate (which masked cross-sample cancellation) with a per-sample preflight check — trims surplus tokens, pads underflows, zero-weights bad rows
CIoU-F1 multi-bbox reward Hungarian matcher with scipy-or-greedy fallback; reduces to single CIoU; correct abstention on zero-box prompts
Async sidecar eval Per-step in-flight markers, retry-on-sbatch-failure with exponential backoff, sacct-based stale-marker sweep, auto-disable after N consecutive failures
HF → leap-finetune data conversion Educational prepare_data.py showing the canonicalization pattern for any custom grounding dataset

Results

Validated end-to-end on a 26 hr training run (one full epoch over 12 604 GRPO steps from the Phase 1 SFT checkpoint).

Final benchmark scores (after GRPO):

Benchmark Metric Score
RefCOCO val IoU@0.5 0.710
RefCOCOg val IoU@0.5 0.733
RefCOCO+ val IoU@0.5 0.558
mgrounding_test (in-dist) CIoU-F1 0.714

Improvement from GRPO over the SFT checkpoint (step 5000 → step 12000):

Benchmark SFT baseline After GRPO Δ
RefCOCO val 0.688 0.710 +2.2
RefCOCOg val 0.703 0.733 +3.0
RefCOCO+ val 0.537 0.558 +2.1
mgrounding_test 0.713 0.714 +0.1 (already saturated by SFT)

The CIoU-F1 reward was continuously rising over the full epoch with no collapse (beta=0.01 KL brake holds; v1 with beta=0 had a step-4000 collapse).

Dependency note for reviewers

This PR depends on #25 feature/async-eval — specifically the grounding_iou_f1 metric used by the mgrounding_test benchmark. The cross-PR test (tests/test_grounding_metric_reward_parity.py) skips cleanly until that metric lands, then auto-re-enables on rebase.

Suggested merge order: #25 first → rebase this branch → merge.

Test plan

  • pytest tests/320 passed / 12 skipped / 0 failed (full repo green on this branch)
  • No personal cluster paths leak into shipped configs/scripts (/lambdafs, /home/rouzbeh, hardcoded job IDs, etc. — all scrubbed)
  • Cookbook YAMLs parse cleanly via parse_job_config
  • Full E2E recipe ran successfully — 26 hr training, 4 benchmarks land on wandb
  • Smoke-test the data-prep on a customer-fresh uv sync (recommended manual check)
  • Eyeball the README customer narrative for clarity (recommended manual review)

TRL v1.0 adds production-grade GRPOTrainer with native rollout_func,
vLLM colocate/server modes, and async rollouts. Also adds openenv-core
as an optional extra ('uv sync --extra rl-env').
Adds rewards/ (top-level, sibling of job_configs/) for plain Python
reward functions referenced from YAML by path. Recipe class bundles
multiple rewards + weights per task; subclass to extend.
Adds grpo and vlm_grpo training types with colocate (default) and
server-mode vLLM rollouts. VLM GRPO preserves the 0.1x vision encoder
LR via a shared helper also used by VLM SFT. Ray Train passes the full
dataset to every worker for GRPO since TRL's RepeatSampler handles
per-rank distribution.
Example YAMLs for text GRPO (colocate + server modes) and VLM
grounding. Unit tests for the reward loader, recipes, and config
parser; e2e smoke tests + SLURM launchers for 1 GPU runs.
- DPO tokenizer: prompt_input_ids → prompt_ids (TRL v1 collator change)
- config_parser: pre-resolve reward paths to absolute on driver so Ray
  workers can find them from their sandbox CWD
- e2e fixtures: use_vllm=false until vllm supports transformers 5.x
- test assertions updated for new column names and absolute paths
…ment of multi-node in ray, works iwth gRPO colocate, SFT, dpo
… need to change the dataset format, SFT converts to prompt,solution pair for the GRPO
…wn, reward Hub where recipes for differnet tasks accumulates for re-use. Each task contains a recipe that is essentially a bundle of rewards + weights for each reward. Combine rewards and recipes to construct your ideal reward functions
… the images, bad parses, no gradietns for image tokens: vlm_grpo trainer: lift images, alias spatial_shapes, VLM-aware logps

  LFMVLMGRPOTrainer patches three gaps in TRL's multimodal data path so
  LFM2-VL actually gets gradient through the vision tower during GRPO:

  - Lift images from prompt message content into the top-level
    key TRL inspects, so the multimodal branch fires and pixel_values
    reach the training forward pass (without it, generation still sees
    images but training silently runs with pixel_values=None).
  - Alias the processor's  output to  via a
    context-scoped __class__ swap, letting the tensor ride TRL's fixed
    multimodal kwarg whitelist from data prep through _compute_loss.
  - Override _get_per_token_logps_and_entropies to rename back to
    spatial_shapes at the model-forward boundary, filter to kwargs the
    model accepts, and skip TRL's per-sample pixel_values chunking
    (LFM2-VL returns patch-concatenated pixels, not (B, C, H, W)).
Introduce a narrow Protocol (generate + logprobs) so benchmarks can
dispatch through the same code path whether they're talking to an
in-process HF model, an in-process vLLM engine, or a remote vLLM
server. Ships with HFBackend (in-process HF) used by the sync path
and as a logprob fallback for the vLLM backends.

Benchmark base class gains an additive evaluate_with_backend default
(raises NotImplementedError) so subclasses can opt in incrementally.
Zero behavior change for existing sync callers.
LLM and VLM generation + logprob benchmarks gain evaluate_with_backend
implementations that build batched GenerateRequest / LogprobRequest
lists, dispatch to an InferenceBackend, and score the responses with
the existing per-sample scoring logic. Sync path is unchanged.

Export the new backend symbols from leap_finetune.evaluation.
AsyncEvalConfig parses + validates the async_eval: YAML block (sync /
sidecar / reserved) with sub-blocks for sbatch settings, reserved
server settings, and failure handling.

make_eval_callback dispatches to BenchmarkEvalCallback (sync),
SidecarEvalCallback (sidecar), or ReservedEvalCallback (reserved).
Sidecar and reserved imports are lazy so sync users don't pay the
import cost.
SidecarEvalCallback (rank 0) stages a checkpoint, renders an sbatch
script, and submits at every eval_steps. The sbatch job loads vLLM
on whatever GPU SLURM assigns it, runs every configured benchmark,
and back-fills the training run's wandb log at the originating step.
Training never pauses on eval.

A .in_flight marker enforces on_overlap policy (skip / queue); the
sbatch clears the marker on EXIT so a crashed runner can't block the
callback. After failure.max_consecutive failures the callback
disables itself.

When eval_on_start is true the step-0 sidecar runs synchronously
(callback polls sacct until the job is terminal) so wandb's step
counter stays aligned for the baseline metrics.
ReservedEvalCallback owns a daemon helper thread (rank 0) that hosts
a persistent vLLM OpenAI server on the dedicated eval GPUs carved off
the training pool. On each eval_steps the thread respawns the server
against the latest checkpoint, runs every benchmark via
VLLMServerBackend, and pushes results back to a queue.

on_log drains the queue and back-fills wandb at the originating
training step. on_train_end drains any in-flight cycles before
teardown so results aren't dropped.

Helper-thread exceptions never propagate to training. Single-node
only; weight_reload=respawn only (in_place rejected with a clear
error). Driver-side GPU carving lands in the next commit.
For mode=reserved, the driver carves vllm_gpus off the training pool
at job start, sets CUDA_VISIBLE_DEVICES for the trainer accordingly,
and hands the worker the eval server URL + carved GPU ids through
train_loop_config. The worker (rank 0) launches its own vllm-serve
subprocess inside the helper thread so it owns the lifetime and can
respawn on weight reload.

Runs AFTER any GRPO server-mode carve so the two modes can coexist.
Multi-node is rejected with a clear error.
config_parser validates the async_eval YAML block on the driver
(misconfig errors surface before a Ray worker is spawned). The raw
dict is forwarded into train_loop_config.

Each of the 5 training loops (sft, dpo, grpo, vlm_sft, vlm_grpo)
replaces its direct BenchmarkEvalCallback registration with the
make_eval_callback dispatch helper. Same call shape across loops;
the dispatcher picks sync / sidecar / reserved based on the YAML.
Unit tests for AsyncEvalConfig parsing, make_eval_callback dispatch,
sidecar marker lifecycle, and FakeBackend round-trip through the
benchmark dispatchers.

Toy fixtures (sidecar.sh / reserved.sh + matching YAMLs) exercise
each mode end-to-end on a single GPU against a tiny QA benchmark
under SLURM. job_configs/sft_with_async_eval_example.yaml is the
copy-paste starting point users land on from the README.
Top-level overview of the three modes (sync / sidecar / reserved):
when to pick each, the trade-offs (training pause vs reserved GPUs vs
queue latency), and the YAML schema. Points users at the example
config + the per-mode behavior contracts.
No behavior change. Wraps long argparse / log-format lines, normalizes
frozenset literal layout, drops an unused import, and prettier-aligns
the async eval table + YAML snippet in the README.
End-to-end Phase 1 of the visual-grounding cookbook on LFM2.5-VL:

* prepare_data.py: streams Michael4933/MGrounding-630k from HuggingFace
  and converts the custom token-tagged format to leap-finetune
  messages parquet. Deterministic 3-way split (SFT / GRPO holdout /
  test) so the GRPO phase trains on rows the SFT run never saw. Skips
  Object_Tracking (different output shape).
* prepare_evals.py: builds the canonical RefCOCO/RefCOCO+/RefCOCOg val
  jsonls from the jxu124 datasets, paired with COCO 2014 train images.
* fix_test_hint.py: one-off to canonicalize the format hint on the
  in-distribution test parquet so it matches the RefCOCO trio.
* prompt_templates.py: 8 format-hint variants + 17 REC task prompts;
  EVAL_FORMAT_HINT pins the canonical eval phrasing.
* configs/sft_grounding.{yaml,sh}: Phase 1 SFT config + SLURM launcher.
  Uses async eval (sidecar mode) so training never pauses on eval.
* configs/{prepare_data,prepare_evals,fix_test_hint}.sh: CPU-only
  SLURM wrappers for the one-time data + eval-set generation.

Validated end-to-end on a 12h SFT run hitting refcoco_val 0.620,
refcoco_plus 0.483, refcocog 0.686, mgrounding_test 0.751 at the peak
checkpoint.
Phase 2 GRPO config + launcher. Resumes from the Phase 1 SFT
checkpoint (model_name placeholder needs to be filled in by the user)
and trains on the held-out 30% slice via the shipped
VLMGroundingIoURecipe (strict format 0.1 + Hungarian-matched IoU-F1
1.0). Same four async-eval benchmarks as Phase 1.

README extended with a Phase 2 section describing the reward, the
checkpoint pointing step, and the launch command.
Switch the GRPO YAML from VLMGroundingIoURecipe to
VLMGroundingCIoURecipe so the matcher itself runs on CIoU (IoU minus
center-distance and aspect-ratio penalties) and the F1 reward scores
the matched CIoU values. This rewards center-aligned + same-shape
pairs even when raw overlap ties, and the F1 wrapping keeps FP/FN
penalization. README's Phase 2 section updated accordingly.
Make multi-image (MGrounding-style) VLM GRPO work end-to-end on
LFM2-VL, which TRL + upstream vLLM don't handle out of the box:

1. vLLM rollout: inject mm_processor_kwargs (do_image_splitting=False,
   single-tile) on multi-image prompts so vLLM 0.19's LFM2-VL
   preprocessor doesn't crash on empty spatial_shapes.
2. TRL split/unsplit_pixel_values_by_grid: patch for LFM2-VL's
   per-image pixel_values layout (Qwen-style patch-concat assumption
   breaks it). Split every per-image tensor by num_images and re-merge.
3. _get_per_token_logps_and_entropies: single full-batch forward
   (no per-sample slicing through concatenated multi-image patches),
   spatial_shapes<->image_sizes aliasing, and completion-region
   sanitization of stray image-placeholder tokens sampled during
   rollout.

Validated: GRPO from the SFT checkpoint trains stably and all four
benchmarks (refcoco/+/g + mgrounding) climb monotonically above the
SFT baseline through step 4000. A residual off-by-one in masked_scatter
still surfaces rarely — hardened in the follow-up commit.
Rouzbehat78 and others added 4 commits May 27, 2026 19:49
The completion-only sanitization reduced the stray-image-token
off-by-one but didn't eliminate it — a residual surplus still reached
LFM2-VL's masked_scatter and triggered an unrecoverable CUDA assert
(device-side, no clean traceback) ~step 4600 on a beta>0 run.

Replace it with _reconcile_image_tokens: before every forward, compute
the exact expected feature count from spatial_shapes
(Σ floor(h/df)·floor(w/df), df=downsample_factor — matches the model's
pixel_unshuffle output) and compare against the image-placeholder count
in input_ids. Trim any surplus image tokens (scanning from the
completion end, so structural prompt placeholders are last to go) so
placeholders == features always. image_token_id is resolved from the
processor (reliable) rather than the wrapped model's config.

Underflow (placeholders < features) is logged as a data/collation bug
rather than silently trimmed.
… fixes

Cookbook (cookbook/visual-grounding/):
- prepare_data.py: HF→leap-finetune conversion for MGrounding-630k incl.
  Object_Tracking (multi-image), 80/20 SFT/GRPO + 10% test holdout, all
  canonicalized to [{"label","bbox"}] JSON.
- prepare_evals.py: RefCOCO/+/g val → jsonl per benchmark.
- Single + multi-node SLURM launchers for SFT and GRPO.
- YAMLs use relative ./data and ./outputs; no cluster paths or secrets.
- README explains the data flow and how to swap in a custom dataset.

GRPO training-loop fixes (src/leap_finetune/training_loops/vlm_grpo_run.py):
- Force-no-split vLLM rollout: inject mm_processor_kwargs
  {do_image_splitting=False, min_tiles=1, max_tiles=1} for every
  image-bearing prompt (was multi-image only). vLLM 0.19's LFM2-VL has
  an internal image-size threshold that re-enables tile splitting
  independently of the HF flag; this disables it.
- Per-sample image-token/feature mismatch handler in the HF training
  forward — trims surplus tokens or pads underflow rows, zero-weights
  bad samples, so a malformed row never crashes the whole batch.
- Multi-image pixel_values / spatial_shapes plumbing for LFM2-VL.

Reward (rewards/tasks/vlm_grounding/recipe.py): Hungarian-matched
CIoU-F1 with strict-format gate. Multi-bbox aware, degrades to single
CIoU when ground truth is a single box.

Ancillary:
- dataset_loader.py: pyarrow fallback on flaky parquet reads.
- logging_utils.py: bump wandb init_timeout 90s→300s for slow nodes.
- pyproject.toml: pin transformers 5.3 + vLLM 0.19, openenv-core extra.
- New test: rewards/metric parity on grounding samples.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cookbook branch had 3 tests that failed against current code:

1. ``test_sidecar_defaults`` — asserted ``sbatch.time == "00:30:00"``
   but the schema's default is ``None`` (partition default; no cap).

2. ``test_failure_disables_after_max_consecutive`` — called
   ``cb.on_evaluate(...)`` but ``SidecarEvalCallback`` only hooks
   ``on_step_end`` (with ``control.should_evaluate=True``).
   ``on_evaluate`` inherits from the base class as a no-op so the
   submission path was never triggered and the disable assertion
   stayed False.

3. ``test_rename_back_in_get_per_token_logps`` — broke after PR A's
   ``_check_image_token_mismatch`` was added: that helper accesses
   ``self.processing_class`` directly, and the test constructs the
   trainer via ``__new__`` without setting it. Switched to
   ``getattr(self, "processing_class", None)`` so the helper is safe
   on bare instances (matches the rest of the helper which already
   uses ``getattr`` for ``image_token_id``).

4. ``test_grounding_metric_reward_parity.py`` — imports
   ``score_grounding_iou_f1`` which only exists on feature/async-eval.
   Wrapped the import in ``try/except ImportError`` with
   ``pytest.skip(..., allow_module_level=True)`` so PR A is
   pytest-clean standalone. Auto-re-enables on rebase once PR B's
   metric is on main.

320 passed / 12 skipped / 0 failed — full repo green on this branch.
…ismatch

Codex flagged that my previous fix was half-done: ``getattr(self,
"processing_class", None)`` made the lookup safe, but ``proc=None``
then falls through to ``proc.image_processor`` on the next line and
crashes on ``AttributeError`` if ``image_token_id`` happened to come
from ``model.config`` instead of the processor.

Fix:
- Return None early when ``proc is None``. Without the processor's
  ``image_processor`` we can't read ``downsample_factor`` so we can't
  do per-sample reconciliation — skipping the preflight is the right
  default (the model forward still runs; we just don't pre-screen).
- Also use getattr on ``image_processor`` itself so a partially-built
  processor doesn't crash either.

Tests pass — same regression test (``test_rename_back_in_get_per_token_logps``)
still passes, plus all other test_grpo_data tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant