Skip to content

[Examples] Add Gemma 4 E4B NVFP4A16 quantization example#2561

Open
2imi9 wants to merge 6 commits intovllm-project:mainfrom
2imi9:add-gemma4-nvfp4-example
Open

[Examples] Add Gemma 4 E4B NVFP4A16 quantization example#2561
2imi9 wants to merge 6 commits intovllm-project:mainfrom
2imi9:add-gemma4-nvfp4-example

Conversation

@2imi9
Copy link
Copy Markdown
Contributor

@2imi9 2imi9 commented Apr 3, 2026

Summary

  • Add NVFP4A16 weight-only quantization example for Google Gemma 4 E4B-it (multimodal)

Details

Key differences from other examples:

  • Uses AutoModelForImageTextToText + AutoProcessor (multimodal model)
  • Ignores vision_tower, audio_tower, embed_vision, embed_audio modules (Gemma 4 architecture, different from Gemma 3's multi_modal_projector)

Quantized model: 2imi9/gemma-4-E4B-it-NVFP4A16

Test plan

  • Ran quantization end-to-end on NVIDIA GPU
  • Verified sample generation produces coherent output after quantization
  • Uploaded quantized model to HuggingFace

Add NVFP4A16 weight-only quantization example for google/gemma-4-E4B-it.
Includes a Dockerfile since Gemma 4 requires transformers from git main
which is newer than the version currently pinned by llmcompressor.

The ignore list skips vision_tower, audio_tower, embed_vision, and
embed_audio modules which are specific to Gemma 4's multimodal
architecture. Uses AutoModelForImageTextToText and AutoProcessor
as required by the Gemma 4 model class.

Tested end-to-end: quantization, sample generation, and model saving
all complete successfully.

Signed-off-by: Ziming <frankziming26@outlook.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@mergify mergify bot added the documentation Improvements or additions to documentation label Apr 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Gemma 4 quantization by adding a dedicated Dockerfile and an example script using the NVFP4A16 scheme. The review feedback suggests explicitly passing the processor to the oneshot function to ensure correct calibration and using the standard torch_dtype parameter instead of dtype when loading the model to maintain consistency with the transformers library.

@2imi9
Copy link
Copy Markdown
Contributor Author

2imi9 commented Apr 3, 2026

This PR is ready for review. Could a maintainer please add the ready label?

@2imi9
Copy link
Copy Markdown
Contributor Author

2imi9 commented Apr 3, 2026

vLLM serving update: Both quantized models have been tested with vLLM nightly. A minor weight loading fix was needed for the audio tower's Gemma4ClippableLinear layers — PR: vllm-project/vllm#38875.

Once that merges, both models can be served with:

vllm serve 2imi9/gemma-4-E2B-it-NVFP4A16 --trust-remote-code
vllm serve 2imi9/gemma-4-E4B-it-NVFP4A16 --trust-remote-code

Copy link
Copy Markdown
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! @brian-dellabetta could you validate that this runs locally?

@2imi9 2imi9 force-pushed the add-gemma4-nvfp4-example branch from 6b9a2bb to 786812d Compare April 3, 2026 06:11
Per review feedback, remove the standalone Dockerfile and add
install instructions as comments in the example script.

Signed-off-by: Ziming <frankziming26@outlook.com>
@2imi9 2imi9 force-pushed the add-gemma4-nvfp4-example branch from 786812d to eeb751b Compare April 3, 2026 06:12
Copy link
Copy Markdown
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for preparing! a couple nits

@brian-dellabetta
Copy link
Copy Markdown
Collaborator

Looks great! @brian-dellabetta could you validate that this runs locally?

Hi @2imi9, I am able to run your script and run the model in vllm without your corresponding PR vllm-project/vllm#38875.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

MODEL_ID = "/path/to/gemma-4-E4B-it-NVFP4A16"

messages = [{"role": "user", "content": "What is Helium?"}]

# Create a sampling params object for greedy sampling
sampling_params = SamplingParams(
    temperature=1.0, top_p=0.95, top_k=64, max_tokens=1000, min_tokens=500
)
if __name__ == "__main__":
    llm = LLM(
        MODEL_ID,
        tensor_parallel_size=2,
        max_model_len=4096,
    )
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    output = llm.generate(prompt, sampling_params)
    for out in output:
        print(out.outputs[0].text)

Helium ($\text{He}$) is a fascinating substance with many ways to describe it, depending on the context. Here is a comprehensive answer covering what it is scientifically, its properties, uses, and why it's so unique.


🧪 Scientific Definition

Helium is a chemical element with the symbol $\text{He}$ and atomic number 2$.

  • Atomic Structure: It is a noble gas, meaning it is found in its own group on the periodic table (Group 18). A neutral helium atom has two protons and two electrons, which are arranged in the lowest energy configuration.
  • Electron Configuration: The atom has a stable, full outermost electron shell (a valence shell), which accounts for its inertness under normal conditions.
  • ** Isotopes:** The most common isotope is Helium-4 ($\text{}^4\text{He}$), which has 2 protons, 2 electrons, and is overwhelmingly the isotope used commercially.

⭐ Key Properties of Helium

Helium is prized for its unique combination of physical properties:

  1. ** Inertness (Noble Gas):** It is largely unreactive with other elements under standard conditions. It does not readily form compounds, which is why it is so stable.
  2. ** Density:** It is extremely light (below that of hydrogen, though heavier than hydrogen isotopes).
  3. ** Electrical Conductivity:** It is generally considered a poor electrical conductor in its gaseous state, though this changes drastically when solidified.
  4. ** Thermal Conductivity:** It is an excellent conductor of heat at very low temperatures.
  5. ** High Transparency:** It passes through various materials largely unhindered, though this is generally only relevant in high-vacuum applications.

🧊 Unique States and Behavior

Helium exhibits unique properties that are leveraged in many high-tech applications:

  • Liquefaction: It can be easily liquefied by cooling it to extremely low temperatures (below its boiling point of about 4.2 K or $-269^\circ \text{C}$) under relatively low pressure.
  • ** Superfluidity:** When cooled to incredibly low temperatures, helium enters a state known as Helium Beta Transition, where it exhibits properties of a superfluid below the $\text{2.17 K}$ point (a state where viscosity drops to zero). This is highly relevant in advanced cryogenics.

…ers>=5.5

Signed-off-by: Ziming <frankziming26@outlook.com>
@2imi9
Copy link
Copy Markdown
Contributor Author

2imi9 commented Apr 3, 2026

Thanks for confirming. The input_max loading error only occurs on the TransformersMultiModalForCausalLM fallback path (vLLM 0.18.1). Since main now has native Gemma4ForConditionalGeneration via #38826, this fix is no longer needed. im closing it.

@brian-dellabetta brian-dellabetta added the ready When a PR is ready for review label Apr 3, 2026
Copy link
Copy Markdown
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! example and model checkpoint work for me (posted sample output above)

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 3, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@brian-dellabetta
Copy link
Copy Markdown
Collaborator

@2imi9 please run quality checks on the branch in order to pass CI:

The quality checks have failed. Please run make style and make quality under the root directory to adddress the lint failures. You will need to install the dev optional install to get the required linting packages: https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Signed-off-by: Ziming <frankziming26@outlook.com>
@2imi9
Copy link
Copy Markdown
Contributor Author

2imi9 commented Apr 3, 2026

Fixed the line length lint error. Quality checks pass now.

@mergify mergify bot removed the quality-failed label Apr 3, 2026
@2imi9 2imi9 requested a review from kylesayrs April 4, 2026 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants