[Examples] Add Gemma 4 E4B NVFP4A16 quantization example#2561
[Examples] Add Gemma 4 E4B NVFP4A16 quantization example#25612imi9 wants to merge 6 commits intovllm-project:mainfrom
Conversation
Add NVFP4A16 weight-only quantization example for google/gemma-4-E4B-it. Includes a Dockerfile since Gemma 4 requires transformers from git main which is newer than the version currently pinned by llmcompressor. The ignore list skips vision_tower, audio_tower, embed_vision, and embed_audio modules which are specific to Gemma 4's multimodal architecture. Uses AutoModelForImageTextToText and AutoProcessor as required by the Gemma 4 model class. Tested end-to-end: quantization, sample generation, and model saving all complete successfully. Signed-off-by: Ziming <frankziming26@outlook.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Gemma 4 quantization by adding a dedicated Dockerfile and an example script using the NVFP4A16 scheme. The review feedback suggests explicitly passing the processor to the oneshot function to ensure correct calibration and using the standard torch_dtype parameter instead of dtype when loading the model to maintain consistency with the transformers library.
|
This PR is ready for review. Could a maintainer please add the |
|
vLLM serving update: Both quantized models have been tested with vLLM nightly. A minor weight loading fix was needed for the audio tower's Once that merges, both models can be served with: vllm serve 2imi9/gemma-4-E2B-it-NVFP4A16 --trust-remote-code
vllm serve 2imi9/gemma-4-E4B-it-NVFP4A16 --trust-remote-code |
kylesayrs
left a comment
There was a problem hiding this comment.
Looks great! @brian-dellabetta could you validate that this runs locally?
6b9a2bb to
786812d
Compare
Per review feedback, remove the standalone Dockerfile and add install instructions as comments in the example script. Signed-off-by: Ziming <frankziming26@outlook.com>
786812d to
eeb751b
Compare
brian-dellabetta
left a comment
There was a problem hiding this comment.
Thanks for preparing! a couple nits
Hi @2imi9, I am able to run your script and run the model in vllm without your corresponding PR vllm-project/vllm#38875. from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
MODEL_ID = "/path/to/gemma-4-E4B-it-NVFP4A16"
messages = [{"role": "user", "content": "What is Helium?"}]
# Create a sampling params object for greedy sampling
sampling_params = SamplingParams(
temperature=1.0, top_p=0.95, top_k=64, max_tokens=1000, min_tokens=500
)
if __name__ == "__main__":
llm = LLM(
MODEL_ID,
tensor_parallel_size=2,
max_model_len=4096,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
output = llm.generate(prompt, sampling_params)
for out in output:
print(out.outputs[0].text)Helium ( 🧪 Scientific DefinitionHelium is a chemical element with the symbol $\text{He}$ and atomic number 2$.
⭐ Key Properties of HeliumHelium is prized for its unique combination of physical properties:
🧊 Unique States and BehaviorHelium exhibits unique properties that are leveraged in many high-tech applications:
|
…ers>=5.5 Signed-off-by: Ziming <frankziming26@outlook.com>
|
Thanks for confirming. The input_max loading error only occurs on the TransformersMultiModalForCausalLM fallback path (vLLM 0.18.1). Since main now has native Gemma4ForConditionalGeneration via #38826, this fix is no longer needed. im closing it. |
brian-dellabetta
left a comment
There was a problem hiding this comment.
Thanks for the contribution! example and model checkpoint work for me (posted sample output above)
|
The quality checks have failed. Please run |
|
@2imi9 please run quality checks on the branch in order to pass CI:
|
Signed-off-by: Ziming <frankziming26@outlook.com>
|
Fixed the line length lint error. Quality checks pass now. |
Summary
Details
Key differences from other examples:
AutoModelForImageTextToText+AutoProcessor(multimodal model)vision_tower,audio_tower,embed_vision,embed_audiomodules (Gemma 4 architecture, different from Gemma 3'smulti_modal_projector)Quantized model: 2imi9/gemma-4-E4B-it-NVFP4A16
Test plan