Fix VLM preprocessing and add mRoPE position handling in target head by liusy58 · Pull Request #527 · sgl-project/SpecForge

liusy58 · 2026-04-08T09:16:58Z

Motivation

This PR improves VLM training support in two areas:

preserve dataset-provided system prompts during VLM conversation preprocessing
carry image tensors through data processing for downstream model usage
add multimodal RoPE position index generation in the target head

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist

Code Review

This pull request updates VLM conversation preprocessing to support custom system prompts and ensures vision-related data fields are preserved during processing. It also introduces a get_rope_index method for calculating multimodal rotary position embeddings. Feedback includes addressing a potential IndexError in conversation handling, optimizing token search efficiency, and refactoring duplicated logic in the position ID calculation.

gemini-code-assist · 2026-04-08T09:23:01Z

specforge/data/preprocessing.py

    for i, image in enumerate(examples["image"]):
        source = examples["conversations"][i]
-        messages = [{"role": "system", "content": system_prompt}]
+        if source[0]["role"] == "system":


There's a potential IndexError here. If source (which is examples["conversations"][i]) is an empty list, accessing source[0] will raise an exception. The check for an empty source on line 224 happens after this access.

To prevent this, you should add a check to ensure source is not empty before accessing its first element.

Suggested change

if source[0]["role"] == "system":

if source and source[0]["role"] == "system":

gemini-code-assist · 2026-04-08T09:23:01Z

specforge/modeling/target/target_head.py

+                    ed_image = input_tokens.index(image_token_id, st) if image_token_id in input_tokens[st:] and remain_images > 0 else len(input_tokens) + 1
+                    ed_video = input_tokens.index(video_token_id, st) if video_token_id in input_tokens[st:] and remain_videos > 0 else len(input_tokens) + 1


The current implementation for finding ed_image and ed_video uses an in check followed by .index(), which results in scanning the input_tokens list twice for each token type in each iteration. This can be inefficient for long sequences.

A more efficient and Pythonic approach is to use a try-except block to handle cases where a token is not found. This avoids the redundant scan.

Suggested change

ed_image = input_tokens.index(image_token_id, st) if image_token_id in input_tokens[st:] and remain_images > 0 else len(input_tokens) + 1

ed_video = input_tokens.index(video_token_id, st) if video_token_id in input_tokens[st:] and remain_videos > 0 else len(input_tokens) + 1

ed_image = len(input_tokens) + 1

if remain_images > 0:

try:

ed_image = input_tokens.index(image_token_id, st)

except ValueError:

pass

ed_video = len(input_tokens) + 1

if remain_videos > 0:

try:

ed_video = input_tokens.index(video_token_id, st)

except ValueError:

pass

gemini-code-assist · 2026-04-08T09:23:01Z

specforge/modeling/target/target_head.py

+                if st < len(input_tokens):
+                    st_idx = llm_pos_ids_list[-1].max() + 1 if len(llm_pos_ids_list) > 0 else 0
+                    text_len = len(input_tokens) - st
+                    llm_pos_ids_list.append(torch.arange(text_len).view(1, -1).expand(3, -1) + st_idx)


The logic in this block to handle trailing text is very similar to the logic for handling text between vision tokens (lines 163-166). Specifically, calculating st_idx and appending text position IDs is duplicated.

To improve maintainability and reduce redundancy, consider refactoring this repeated logic. You could potentially merge this trailing text handling into the main loop or use a helper function to encapsulate the logic for appending text positions.

fix

d70b121

liusy58 requested review from FlamingoPg, FrankLeeeee, shuaills, sleepcoo and zyksir as code owners April 8, 2026 09:17

liusy58 changed the title ~~fix~~ Fix VLM preprocessing and add mRoPE position handling in target head Apr 8, 2026

gemini-code-assist bot reviewed Apr 8, 2026

View reviewed changes

fix lint

01d0b92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix VLM preprocessing and add mRoPE position handling in target head#527

Fix VLM preprocessing and add mRoPE position handling in target head#527
liusy58 wants to merge 2 commits intosgl-project:mainfrom
liusy58:fix_train_vlm

liusy58 commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if source[0]["role"] == "system":
	if source and source[0]["role"] == "system":

		ed_image = input_tokens.index(image_token_id, st) if image_token_id in input_tokens[st:] and remain_images > 0 else len(input_tokens) + 1
		ed_video = input_tokens.index(video_token_id, st) if video_token_id in input_tokens[st:] and remain_videos > 0 else len(input_tokens) + 1

-                    ed_image = input_tokens.index(image_token_id, st) if image_token_id in input_tokens[st:] and remain_images > 0 else len(input_tokens) + 1
-                    ed_video = input_tokens.index(video_token_id, st) if video_token_id in input_tokens[st:] and remain_videos > 0 else len(input_tokens) + 1
+                    ed_image = len(input_tokens) + 1
+                    if remain_images > 0:
+                        try:
+                            ed_image = input_tokens.index(image_token_id, st)
+                        except ValueError:
+                            pass
+                    ed_video = len(input_tokens) + 1
+                    if remain_videos > 0:
+                        try:
+                            ed_video = input_tokens.index(video_token_id, st)
+                        except ValueError:
+                            pass

Conversation

liusy58 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liusy58 commented Apr 8, 2026 •

edited

Loading