cookbook: visual-grounding SFT→GRPO recipe for LFM2.5-VL-1.6B#27
Open
Rouzbehat78 wants to merge 34 commits into
Open
cookbook: visual-grounding SFT→GRPO recipe for LFM2.5-VL-1.6B#27Rouzbehat78 wants to merge 34 commits into
Rouzbehat78 wants to merge 34 commits into
Conversation
TRL v1.0 adds production-grade GRPOTrainer with native rollout_func,
vLLM colocate/server modes, and async rollouts. Also adds openenv-core
as an optional extra ('uv sync --extra rl-env').
Adds rewards/ (top-level, sibling of job_configs/) for plain Python reward functions referenced from YAML by path. Recipe class bundles multiple rewards + weights per task; subclass to extend.
Adds grpo and vlm_grpo training types with colocate (default) and server-mode vLLM rollouts. VLM GRPO preserves the 0.1x vision encoder LR via a shared helper also used by VLM SFT. Ray Train passes the full dataset to every worker for GRPO since TRL's RepeatSampler handles per-rank distribution.
Example YAMLs for text GRPO (colocate + server modes) and VLM grounding. Unit tests for the reward loader, recipes, and config parser; e2e smoke tests + SLURM launchers for 1 GPU runs.
- DPO tokenizer: prompt_input_ids → prompt_ids (TRL v1 collator change) - config_parser: pre-resolve reward paths to absolute on driver so Ray workers can find them from their sandbox CWD - e2e fixtures: use_vllm=false until vllm supports transformers 5.x - test assertions updated for new column names and absolute paths
…ment of multi-node in ray, works iwth gRPO colocate, SFT, dpo
… need to change the dataset format, SFT converts to prompt,solution pair for the GRPO
…wn, reward Hub where recipes for differnet tasks accumulates for re-use. Each task contains a recipe that is essentially a bundle of rewards + weights for each reward. Combine rewards and recipes to construct your ideal reward functions
… the images, bad parses, no gradietns for image tokens: vlm_grpo trainer: lift images, alias spatial_shapes, VLM-aware logps
LFMVLMGRPOTrainer patches three gaps in TRL's multimodal data path so
LFM2-VL actually gets gradient through the vision tower during GRPO:
- Lift images from prompt message content into the top-level
key TRL inspects, so the multimodal branch fires and pixel_values
reach the training forward pass (without it, generation still sees
images but training silently runs with pixel_values=None).
- Alias the processor's output to via a
context-scoped __class__ swap, letting the tensor ride TRL's fixed
multimodal kwarg whitelist from data prep through _compute_loss.
- Override _get_per_token_logps_and_entropies to rename back to
spatial_shapes at the model-forward boundary, filter to kwargs the
model accepts, and skip TRL's per-sample pixel_values chunking
(LFM2-VL returns patch-concatenated pixels, not (B, C, H, W)).
… celanup, few exampels for tedt and VLM
…nv with reward lenght for both VLM and LLM
Introduce a narrow Protocol (generate + logprobs) so benchmarks can dispatch through the same code path whether they're talking to an in-process HF model, an in-process vLLM engine, or a remote vLLM server. Ships with HFBackend (in-process HF) used by the sync path and as a logprob fallback for the vLLM backends. Benchmark base class gains an additive evaluate_with_backend default (raises NotImplementedError) so subclasses can opt in incrementally. Zero behavior change for existing sync callers.
LLM and VLM generation + logprob benchmarks gain evaluate_with_backend implementations that build batched GenerateRequest / LogprobRequest lists, dispatch to an InferenceBackend, and score the responses with the existing per-sample scoring logic. Sync path is unchanged. Export the new backend symbols from leap_finetune.evaluation.
AsyncEvalConfig parses + validates the async_eval: YAML block (sync / sidecar / reserved) with sub-blocks for sbatch settings, reserved server settings, and failure handling. make_eval_callback dispatches to BenchmarkEvalCallback (sync), SidecarEvalCallback (sidecar), or ReservedEvalCallback (reserved). Sidecar and reserved imports are lazy so sync users don't pay the import cost.
SidecarEvalCallback (rank 0) stages a checkpoint, renders an sbatch script, and submits at every eval_steps. The sbatch job loads vLLM on whatever GPU SLURM assigns it, runs every configured benchmark, and back-fills the training run's wandb log at the originating step. Training never pauses on eval. A .in_flight marker enforces on_overlap policy (skip / queue); the sbatch clears the marker on EXIT so a crashed runner can't block the callback. After failure.max_consecutive failures the callback disables itself. When eval_on_start is true the step-0 sidecar runs synchronously (callback polls sacct until the job is terminal) so wandb's step counter stays aligned for the baseline metrics.
ReservedEvalCallback owns a daemon helper thread (rank 0) that hosts a persistent vLLM OpenAI server on the dedicated eval GPUs carved off the training pool. On each eval_steps the thread respawns the server against the latest checkpoint, runs every benchmark via VLLMServerBackend, and pushes results back to a queue. on_log drains the queue and back-fills wandb at the originating training step. on_train_end drains any in-flight cycles before teardown so results aren't dropped. Helper-thread exceptions never propagate to training. Single-node only; weight_reload=respawn only (in_place rejected with a clear error). Driver-side GPU carving lands in the next commit.
For mode=reserved, the driver carves vllm_gpus off the training pool at job start, sets CUDA_VISIBLE_DEVICES for the trainer accordingly, and hands the worker the eval server URL + carved GPU ids through train_loop_config. The worker (rank 0) launches its own vllm-serve subprocess inside the helper thread so it owns the lifetime and can respawn on weight reload. Runs AFTER any GRPO server-mode carve so the two modes can coexist. Multi-node is rejected with a clear error.
config_parser validates the async_eval YAML block on the driver (misconfig errors surface before a Ray worker is spawned). The raw dict is forwarded into train_loop_config. Each of the 5 training loops (sft, dpo, grpo, vlm_sft, vlm_grpo) replaces its direct BenchmarkEvalCallback registration with the make_eval_callback dispatch helper. Same call shape across loops; the dispatcher picks sync / sidecar / reserved based on the YAML.
Unit tests for AsyncEvalConfig parsing, make_eval_callback dispatch, sidecar marker lifecycle, and FakeBackend round-trip through the benchmark dispatchers. Toy fixtures (sidecar.sh / reserved.sh + matching YAMLs) exercise each mode end-to-end on a single GPU against a tiny QA benchmark under SLURM. job_configs/sft_with_async_eval_example.yaml is the copy-paste starting point users land on from the README.
Top-level overview of the three modes (sync / sidecar / reserved): when to pick each, the trade-offs (training pause vs reserved GPUs vs queue latency), and the YAML schema. Points users at the example config + the per-mode behavior contracts.
No behavior change. Wraps long argparse / log-format lines, normalizes frozenset literal layout, drops an unused import, and prettier-aligns the async eval table + YAML snippet in the README.
End-to-end Phase 1 of the visual-grounding cookbook on LFM2.5-VL:
* prepare_data.py: streams Michael4933/MGrounding-630k from HuggingFace
and converts the custom token-tagged format to leap-finetune
messages parquet. Deterministic 3-way split (SFT / GRPO holdout /
test) so the GRPO phase trains on rows the SFT run never saw. Skips
Object_Tracking (different output shape).
* prepare_evals.py: builds the canonical RefCOCO/RefCOCO+/RefCOCOg val
jsonls from the jxu124 datasets, paired with COCO 2014 train images.
* fix_test_hint.py: one-off to canonicalize the format hint on the
in-distribution test parquet so it matches the RefCOCO trio.
* prompt_templates.py: 8 format-hint variants + 17 REC task prompts;
EVAL_FORMAT_HINT pins the canonical eval phrasing.
* configs/sft_grounding.{yaml,sh}: Phase 1 SFT config + SLURM launcher.
Uses async eval (sidecar mode) so training never pauses on eval.
* configs/{prepare_data,prepare_evals,fix_test_hint}.sh: CPU-only
SLURM wrappers for the one-time data + eval-set generation.
Validated end-to-end on a 12h SFT run hitting refcoco_val 0.620,
refcoco_plus 0.483, refcocog 0.686, mgrounding_test 0.751 at the peak
checkpoint.
Phase 2 GRPO config + launcher. Resumes from the Phase 1 SFT checkpoint (model_name placeholder needs to be filled in by the user) and trains on the held-out 30% slice via the shipped VLMGroundingIoURecipe (strict format 0.1 + Hungarian-matched IoU-F1 1.0). Same four async-eval benchmarks as Phase 1. README extended with a Phase 2 section describing the reward, the checkpoint pointing step, and the launch command.
Switch the GRPO YAML from VLMGroundingIoURecipe to VLMGroundingCIoURecipe so the matcher itself runs on CIoU (IoU minus center-distance and aspect-ratio penalties) and the F1 reward scores the matched CIoU values. This rewards center-aligned + same-shape pairs even when raw overlap ties, and the F1 wrapping keeps FP/FN penalization. README's Phase 2 section updated accordingly.
Make multi-image (MGrounding-style) VLM GRPO work end-to-end on LFM2-VL, which TRL + upstream vLLM don't handle out of the box: 1. vLLM rollout: inject mm_processor_kwargs (do_image_splitting=False, single-tile) on multi-image prompts so vLLM 0.19's LFM2-VL preprocessor doesn't crash on empty spatial_shapes. 2. TRL split/unsplit_pixel_values_by_grid: patch for LFM2-VL's per-image pixel_values layout (Qwen-style patch-concat assumption breaks it). Split every per-image tensor by num_images and re-merge. 3. _get_per_token_logps_and_entropies: single full-batch forward (no per-sample slicing through concatenated multi-image patches), spatial_shapes<->image_sizes aliasing, and completion-region sanitization of stray image-placeholder tokens sampled during rollout. Validated: GRPO from the SFT checkpoint trains stably and all four benchmarks (refcoco/+/g + mgrounding) climb monotonically above the SFT baseline through step 4000. A residual off-by-one in masked_scatter still surfaces rarely — hardened in the follow-up commit.
The completion-only sanitization reduced the stray-image-token off-by-one but didn't eliminate it — a residual surplus still reached LFM2-VL's masked_scatter and triggered an unrecoverable CUDA assert (device-side, no clean traceback) ~step 4600 on a beta>0 run. Replace it with _reconcile_image_tokens: before every forward, compute the exact expected feature count from spatial_shapes (Σ floor(h/df)·floor(w/df), df=downsample_factor — matches the model's pixel_unshuffle output) and compare against the image-placeholder count in input_ids. Trim any surplus image tokens (scanning from the completion end, so structural prompt placeholders are last to go) so placeholders == features always. image_token_id is resolved from the processor (reliable) rather than the wrapped model's config. Underflow (placeholders < features) is logged as a data/collation bug rather than silently trimmed.
… fixes
Cookbook (cookbook/visual-grounding/):
- prepare_data.py: HF→leap-finetune conversion for MGrounding-630k incl.
Object_Tracking (multi-image), 80/20 SFT/GRPO + 10% test holdout, all
canonicalized to [{"label","bbox"}] JSON.
- prepare_evals.py: RefCOCO/+/g val → jsonl per benchmark.
- Single + multi-node SLURM launchers for SFT and GRPO.
- YAMLs use relative ./data and ./outputs; no cluster paths or secrets.
- README explains the data flow and how to swap in a custom dataset.
GRPO training-loop fixes (src/leap_finetune/training_loops/vlm_grpo_run.py):
- Force-no-split vLLM rollout: inject mm_processor_kwargs
{do_image_splitting=False, min_tiles=1, max_tiles=1} for every
image-bearing prompt (was multi-image only). vLLM 0.19's LFM2-VL has
an internal image-size threshold that re-enables tile splitting
independently of the HF flag; this disables it.
- Per-sample image-token/feature mismatch handler in the HF training
forward — trims surplus tokens or pads underflow rows, zero-weights
bad samples, so a malformed row never crashes the whole batch.
- Multi-image pixel_values / spatial_shapes plumbing for LFM2-VL.
Reward (rewards/tasks/vlm_grounding/recipe.py): Hungarian-matched
CIoU-F1 with strict-format gate. Multi-bbox aware, degrades to single
CIoU when ground truth is a single box.
Ancillary:
- dataset_loader.py: pyarrow fallback on flaky parquet reads.
- logging_utils.py: bump wandb init_timeout 90s→300s for slow nodes.
- pyproject.toml: pin transformers 5.3 + vLLM 0.19, openenv-core extra.
- New test: rewards/metric parity on grounding samples.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cookbook branch had 3 tests that failed against current code: 1. ``test_sidecar_defaults`` — asserted ``sbatch.time == "00:30:00"`` but the schema's default is ``None`` (partition default; no cap). 2. ``test_failure_disables_after_max_consecutive`` — called ``cb.on_evaluate(...)`` but ``SidecarEvalCallback`` only hooks ``on_step_end`` (with ``control.should_evaluate=True``). ``on_evaluate`` inherits from the base class as a no-op so the submission path was never triggered and the disable assertion stayed False. 3. ``test_rename_back_in_get_per_token_logps`` — broke after PR A's ``_check_image_token_mismatch`` was added: that helper accesses ``self.processing_class`` directly, and the test constructs the trainer via ``__new__`` without setting it. Switched to ``getattr(self, "processing_class", None)`` so the helper is safe on bare instances (matches the rest of the helper which already uses ``getattr`` for ``image_token_id``). 4. ``test_grounding_metric_reward_parity.py`` — imports ``score_grounding_iou_f1`` which only exists on feature/async-eval. Wrapped the import in ``try/except ImportError`` with ``pytest.skip(..., allow_module_level=True)`` so PR A is pytest-clean standalone. Auto-re-enables on rebase once PR B's metric is on main. 320 passed / 12 skipped / 0 failed — full repo green on this branch.
…ismatch Codex flagged that my previous fix was half-done: ``getattr(self, "processing_class", None)`` made the lookup safe, but ``proc=None`` then falls through to ``proc.image_processor`` on the next line and crashes on ``AttributeError`` if ``image_token_id`` happened to come from ``model.config`` instead of the processor. Fix: - Return None early when ``proc is None``. Without the processor's ``image_processor`` we can't read ``downsample_factor`` so we can't do per-sample reconciliation — skipping the preflight is the right default (the model forward still runs; we just don't pre-screen). - Also use getattr on ``image_processor`` itself so a partially-built processor doesn't crash either. Tests pass — same regression test (``test_rename_back_in_get_per_token_logps``) still passes, plus all other test_grpo_data tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end public cookbook teaching customers how to fine-tune LFM2.5-VL-1.6B for visual grounding — predicting normalized
[x1, y1, x2, y2]bounding boxes from text queries — via a two-phase SFT → GRPO recipe. Trains in <2 days on 2× 8-GPU nodes and lands at parity-or-better than published baselines on RefCOCO/+/g.The cookbook is the customer-facing artifact, but the underlying PR also ships library-level GRPO fixes for LFM2-VL multi-image inputs that were blocking the recipe.
Model + data
LiquidAI/LFM2.5-VL-1.6B(~1.6B parameters, native multi-image VLM)Michael4933/MGrounding-630k— 630K grounding samples covering single-image referring expressions, group grounding, and multi-image object tracking[{"label": "red car", "bbox": [0.12, 0.34, 0.58, 0.71]}]— normalized[0, 1]coordsRecipe (cookbook/visual-grounding/)
Step 0 — Setup
Step 1 — Data prep (~1 hr CPU)
prepare_data.pydownloads MGrounding-630k from HF, walks each multi-turn conversation (including the multi-imageObject_Trackingsubset), normalizes 0-1000 →[0, 1]coords, canonicalizes every variant into{label, bbox}JSON, and writes a deterministic 3-way disjoint split: 72% SFT / 18% GRPO / 10% held-out test.prepare_evals.pypulls RefCOCO/+/g val splits from HF and emits one jsonl per benchmark.Step 2 — Phase 1: SFT (~10–12 hr on 2 × 8 GPUs)
1 epoch on the 72% SFT split,
lr=5e-5cosine, vision encoder at 0.1× base LR,do_image_splitting=false(one tile per image). Async sidecar eval fires everyeval_steps=1000against the 4 benchmarks so training never pauses for scoring.Step 3 — Phase 2: GRPO (~24 hr on 2 × 8 GPUs)
# Point grpo_grounding.yaml at your Phase 1 checkpoint, then: sbatch cookbook/visual-grounding/configs/grpo_grounding_multinode.sh1 epoch on the 18% GRPO split with
beta=0.01KL brake,dapoloss,num_generations=4, vLLM colocate rollouts. Reward =VLMGroundingCIoURecipe:{label, bbox}dictsAsync eval / sidecar mode
Both phases use
async_eval: mode=sidecar(the library feature shipped in #25 feature/async-eval). Eacheval_stepsboundary stages a checkpoint and fires ansbatchrunning vLLM in-process against all 4 benchmarks; the score lands on the training wandb run page back-filled to the originating training step. Training never blocks.Features exercised by this PR
pixel_values/spatial_shapes; image-lift on the data path_is_image_too_largetile-split for any image >724×724 that would otherwise mismatch the HF processor's tile count and CUDA-assert onmasked_scatterprepare_data.pyshowing the canonicalization pattern for any custom grounding datasetResults
Validated end-to-end on a 26 hr training run (one full epoch over 12 604 GRPO steps from the Phase 1 SFT checkpoint).
Final benchmark scores (after GRPO):
Improvement from GRPO over the SFT checkpoint (step 5000 → step 12000):
The CIoU-F1 reward was continuously rising over the full epoch with no collapse (beta=0.01 KL brake holds; v1 with beta=0 had a step-4000 collapse).
Dependency note for reviewers
This PR depends on #25 feature/async-eval — specifically the
grounding_iou_f1metric used by themgrounding_testbenchmark. The cross-PR test (tests/test_grounding_metric_reward_parity.py) skips cleanly until that metric lands, then auto-re-enables on rebase.Suggested merge order: #25 first → rebase this branch → merge.
Test plan
pytest tests/— 320 passed / 12 skipped / 0 failed (full repo green on this branch)/lambdafs,/home/rouzbeh, hardcoded job IDs, etc. — all scrubbed)parse_job_configuv sync(recommended manual check)