Fix multimodal hidden-state preparation for Qwen3-VL models by liusy58 · Pull Request #526 · sgl-project/SpecForge

liusy58 · 2026-04-08T09:02:20Z

Motivation

This PR updates scripts/prepare_hidden_states.py to better support multimodal target models during hidden-state preparation.

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist

Code Review

This pull request introduces support for vision-language models (VLM) by adding pixel_values and image_grid_thw to the DataPoint class and the data generation pipeline. It also updates the torch_dtype resolution logic to be more flexible. Feedback focuses on preventing KeyError and TypeError exceptions when these new fields are missing in non-VLM contexts, as well as simplifying the conditional logic for determining the model's data type.

gemini-code-assist · 2026-04-08T09:04:04Z

scripts/prepare_hidden_states.py

+            pixel_values=batch["pixel_values"][valid_indices_in_batch]
+            image_grid_thw=batch["image_grid_thw"][valid_indices_in_batch]


This implementation will raise a KeyError for non-VLM models because pixel_values and image_grid_thw are missing from the batch. Additionally, these tensors must be added to filtered_batch so they are correctly moved to the GPU and passed to the model's extend method during hidden state generation.

pixel_values = batch.get("pixel_values") if pixel_values is not None: pixel_values = pixel_values[valid_indices_in_batch] filtered_batch["pixel_values"] = pixel_values image_grid_thw = batch.get("image_grid_thw") if image_grid_thw is not None: image_grid_thw = image_grid_thw[valid_indices_in_batch] filtered_batch["image_grid_thw"] = image_grid_thw

gemini-code-assist · 2026-04-08T09:04:04Z

scripts/prepare_hidden_states.py

+                        pixel_values=pixel_values[i].clone(),
+                        image_grid_thw=image_grid_thw[i].clone(),


Accessing pixel_values[i] and image_grid_thw[i] will cause a TypeError or AttributeError if the variables are None (which occurs for non-VLM models).

Suggested change

pixel_values=pixel_values[i].clone(),

image_grid_thw=image_grid_thw[i].clone(),

pixel_values=pixel_values[i].clone() if pixel_values is not None else None,

image_grid_thw=image_grid_thw[i].clone() if image_grid_thw is not None else None,

gemini-code-assist · 2026-04-08T09:04:04Z

scripts/prepare_hidden_states.py

+        if hasattr(model_config, "dtype") and model_config.dtype is not None:
+            torch_dtype = model_config.dtype
+        elif hasattr(model_config, "text_config") and hasattr(model_config.text_config, "dtype"):
+            torch_dtype = model_config.text_config.dtype
+        else:
+            torch_dtype = getattr(model_config, "torch_dtype", "bfloat16")


The logic for determining torch_dtype can be simplified and made more robust by using getattr with defaults and explicitly checking for None values in nested configurations.

Suggested change

if hasattr(model_config, "dtype") and model_config.dtype is not None:

torch_dtype = model_config.dtype

elif hasattr(model_config, "text_config") and hasattr(model_config.text_config, "dtype"):

torch_dtype = model_config.text_config.dtype

else:

torch_dtype = getattr(model_config, "torch_dtype", "bfloat16")

torch_dtype = getattr(model_config, "dtype", None)

if torch_dtype is None and hasattr(model_config, "text_config"):

torch_dtype = getattr(model_config.text_config, "dtype", None)

if torch_dtype is None:

torch_dtype = getattr(model_config, "torch_dtype", "bfloat16")

fix

4f43509

liusy58 requested review from FlamingoPg, shuaills and sleepcoo as code owners April 8, 2026 09:02

gemini-code-assist bot reviewed Apr 8, 2026

View reviewed changes

liusy58 changed the title ~~fix~~ Fix multimodal hidden-state preparation for Qwen3-VL models Apr 8, 2026

fix lint

dab3ff9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multimodal hidden-state preparation for Qwen3-VL models#526

Fix multimodal hidden-state preparation for Qwen3-VL models#526
liusy58 wants to merge 2 commits intosgl-project:mainfrom
liusy58:fix_prepare_hidden_states

liusy58 commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		pixel_values=batch["pixel_values"][valid_indices_in_batch]
		image_grid_thw=batch["image_grid_thw"][valid_indices_in_batch]

		pixel_values=pixel_values[i].clone(),
		image_grid_thw=image_grid_thw[i].clone(),

Conversation

liusy58 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liusy58 commented Apr 8, 2026 •

edited

Loading