[Examples] Add Gemma 4 E4B NVFP4A16 quantization example by 2imi9 · Pull Request #2561 · vllm-project/llm-compressor

2imi9 · 2026-04-03T00:26:22Z

Summary

Add NVFP4A16 weight-only quantization example for Google Gemma 4 E4B-it (multimodal)

Details

Key differences from other examples:

Uses AutoModelForImageTextToText + AutoProcessor (multimodal model)
Ignores vision_tower, audio_tower, embed_vision, embed_audio modules (Gemma 4 architecture, different from Gemma 3's multi_modal_projector)

Quantized model: 2imi9/gemma-4-E4B-it-NVFP4A16

Test plan

Ran quantization end-to-end on NVIDIA GPU
Verified sample generation produces coherent output after quantization
Uploaded quantized model to HuggingFace

Add NVFP4A16 weight-only quantization example for google/gemma-4-E4B-it. Includes a Dockerfile since Gemma 4 requires transformers from git main which is newer than the version currently pinned by llmcompressor. The ignore list skips vision_tower, audio_tower, embed_vision, and embed_audio modules which are specific to Gemma 4's multimodal architecture. Uses AutoModelForImageTextToText and AutoProcessor as required by the Gemma 4 model class. Tested end-to-end: quantization, sample generation, and model saving all complete successfully. Signed-off-by: Ziming <frankziming26@outlook.com>

github-actions · 2026-04-03T00:26:32Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Code Review

This pull request introduces support for Gemma 4 quantization by adding a dedicated Dockerfile and an example script using the NVFP4A16 scheme. The review feedback suggests explicitly passing the processor to the oneshot function to ensure correct calibration and using the standard torch_dtype parameter instead of dtype when loading the model to maintain consistency with the transformers library.

examples/quantization_w4a16_fp4/nvfp4/gemma4_example.py

2imi9 · 2026-04-03T01:10:38Z

This PR is ready for review. Could a maintainer please add the ready label?

2imi9 · 2026-04-03T04:22:31Z

vLLM serving update: Both quantized models have been tested with vLLM nightly. A minor weight loading fix was needed for the audio tower's Gemma4ClippableLinear layers — PR: vllm-project/vllm#38875.

Once that merges, both models can be served with:

vllm serve 2imi9/gemma-4-E2B-it-NVFP4A16 --trust-remote-code
vllm serve 2imi9/gemma-4-E4B-it-NVFP4A16 --trust-remote-code

kylesayrs

Looks great! @brian-dellabetta could you validate that this runs locally?

examples/quantization_w4a16_fp4/nvfp4/Dockerfile.gemma4

Per review feedback, remove the standalone Dockerfile and add install instructions as comments in the example script. Signed-off-by: Ziming <frankziming26@outlook.com>

brian-dellabetta

Thanks for preparing! a couple nits

examples/quantization_w4a16_fp4/nvfp4/gemma4_example.py

brian-dellabetta · 2026-04-03T17:17:16Z

Looks great! @brian-dellabetta could you validate that this runs locally?

Hi @2imi9, I am able to run your script and run the model in vllm without your corresponding PR vllm-project/vllm#38875.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

MODEL_ID = "/path/to/gemma-4-E4B-it-NVFP4A16"

messages = [{"role": "user", "content": "What is Helium?"}]

# Create a sampling params object for greedy sampling
sampling_params = SamplingParams(
    temperature=1.0, top_p=0.95, top_k=64, max_tokens=1000, min_tokens=500
)
if __name__ == "__main__":
    llm = LLM(
        MODEL_ID,
        tensor_parallel_size=2,
        max_model_len=4096,
    )
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    output = llm.generate(prompt, sampling_params)
    for out in output:
        print(out.outputs[0].text)

Helium ($\text{He}$) is a fascinating substance with many ways to describe it, depending on the context. Here is a comprehensive answer covering what it is scientifically, its properties, uses, and why it's so unique.

🧪 Scientific Definition

Helium is a chemical element with the symbol $\text{He}$ and atomic number 2$.

Atomic Structure: It is a noble gas, meaning it is found in its own group on the periodic table (Group 18). A neutral helium atom has two protons and two electrons, which are arranged in the lowest energy configuration.
Electron Configuration: The atom has a stable, full outermost electron shell (a valence shell), which accounts for its inertness under normal conditions.
** Isotopes:** The most common isotope is Helium-4 ($\text{}^4\text{He}$), which has 2 protons, 2 electrons, and is overwhelmingly the isotope used commercially.

⭐ Key Properties of Helium

Helium is prized for its unique combination of physical properties:

** Inertness (Noble Gas):** It is largely unreactive with other elements under standard conditions. It does not readily form compounds, which is why it is so stable.
** Density:** It is extremely light (below that of hydrogen, though heavier than hydrogen isotopes).
** Electrical Conductivity:** It is generally considered a poor electrical conductor in its gaseous state, though this changes drastically when solidified.
** Thermal Conductivity:** It is an excellent conductor of heat at very low temperatures.
** High Transparency:** It passes through various materials largely unhindered, though this is generally only relevant in high-vacuum applications.

🧊 Unique States and Behavior

Helium exhibits unique properties that are leveraged in many high-tech applications:

Liquefaction: It can be easily liquefied by cooling it to extremely low temperatures (below its boiling point of about 4.2 K or $-269^\circ \text{C}$) under relatively low pressure.
** Superfluidity:** When cooled to incredibly low temperatures, helium enters a state known as Helium Beta Transition, where it exhibits properties of a superfluid below the $\text{2.17 K}$ point (a state where viscosity drops to zero). This is highly relevant in advanced cryogenics.

…ers>=5.5 Signed-off-by: Ziming <frankziming26@outlook.com>

2imi9 · 2026-04-03T17:49:07Z

Thanks for confirming. The input_max loading error only occurs on the TransformersMultiModalForCausalLM fallback path (vLLM 0.18.1). Since main now has native Gemma4ForConditionalGeneration via #38826, this fix is no longer needed. im closing it.

brian-dellabetta

Thanks for the contribution! example and model checkpoint work for me (posted sample output above)

mergify · 2026-04-03T19:19:41Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

brian-dellabetta · 2026-04-03T19:24:04Z

@2imi9 please run quality checks on the branch in order to pass CI:

The quality checks have failed. Please run make style and make quality under the root directory to adddress the lint failures. You will need to install the dev optional install to get the required linting packages: https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Signed-off-by: Ziming <frankziming26@outlook.com>

2imi9 · 2026-04-03T20:27:37Z

Fixed the line length lint error. Quality checks pass now.

mergify bot added the documentation Improvements or additions to documentation label Apr 3, 2026

Merge branch 'main' into add-gemma4-nvfp4-example

5e257b2

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

examples/quantization_w4a16_fp4/nvfp4/gemma4_example.py Show resolved Hide resolved

examples/quantization_w4a16_fp4/nvfp4/gemma4_example.py Show resolved Hide resolved

This was referenced Apr 3, 2026

Gemma 4 support requires transformers >= 5.5.0 (currently pinned <= 4.57.6) #2562

Open

Gemma 4 support: model_type gemma4 not recognized vllm-project/vllm#38868

Closed

[Bugfix] Fix Gemma4 NVFP4 quantized model weight loading vllm-project/vllm#38875

Closed

kylesayrs reviewed Apr 3, 2026

View reviewed changes

examples/quantization_w4a16_fp4/nvfp4/Dockerfile.gemma4 Outdated Show resolved Hide resolved

2imi9 force-pushed the add-gemma4-nvfp4-example branch from 6b9a2bb to 786812d Compare April 3, 2026 06:11

2imi9 requested review from HDCharles, brian-dellabetta and dsikka as code owners April 3, 2026 06:11

Move Dockerfile install instructions into gemma4_example.py

eeb751b

Per review feedback, remove the standalone Dockerfile and add install instructions as comments in the example script. Signed-off-by: Ziming <frankziming26@outlook.com>

2imi9 force-pushed the add-gemma4-nvfp4-example branch from 786812d to eeb751b Compare April 3, 2026 06:12

brian-dellabetta reviewed Apr 3, 2026

View reviewed changes

examples/quantization_w4a16_fp4/nvfp4/gemma4_example.py Outdated Show resolved Hide resolved

examples/quantization_w4a16_fp4/nvfp4/gemma4_example.py Outdated Show resolved Hide resolved

Apply review suggestions: use pip install llmcompressor and transform…

a2b2a3e

…ers>=5.5 Signed-off-by: Ziming <frankziming26@outlook.com>

Merge branch 'main' into add-gemma4-nvfp4-example

30133f5

brian-dellabetta added the ready When a PR is ready for review label Apr 3, 2026

brian-dellabetta approved these changes Apr 3, 2026

View reviewed changes

mergify bot added the quality-failed label Apr 3, 2026

Fix line length lint error in gemma4_example.py

b7903ce

Signed-off-by: Ziming <frankziming26@outlook.com>

mergify bot removed the quality-failed label Apr 3, 2026

2imi9 requested a review from kylesayrs April 4, 2026 02:52

Conversation

2imi9 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Key differences from other examples:

Quantized model: 2imi9/gemma-4-E4B-it-NVFP4A16

Test plan

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

2imi9 commented Apr 3, 2026

Uh oh!

2imi9 commented Apr 3, 2026

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta commented Apr 3, 2026

🧪 Scientific Definition

⭐ Key Properties of Helium

🧊 Unique States and Behavior

Uh oh!

2imi9 commented Apr 3, 2026

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 3, 2026

Uh oh!

brian-dellabetta commented Apr 3, 2026

Uh oh!

2imi9 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2imi9 commented Apr 3, 2026 •

edited

Loading