fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero#151
fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero#151cpagac wants to merge 7 commits intop-e-w:masterfrom
Conversation
Summary of ChangesHello @cpagac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the application's capability by integrating comprehensive support for FP8/NVFP4 pre-quantized models. It introduces the necessary configuration, model loading adjustments, and merge strategy considerations to seamlessly work with these models. Additionally, a critical bug fix addresses a division-by-zero scenario in the evaluation logic, improving the robustness of the system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively introduces support for FP8/NVFP4 quantized models and adds an optional dependency for it. The changes are logical and well-integrated. The PR also includes a small but important fix for a potential division-by-zero error in the evaluator.
While the repository's style guide suggests that pull requests should contain only a single semantic change (rule #9), the included bug fix is minor and improves the codebase. For future contributions, it would be ideal to separate features and bug fixes into distinct pull requests to adhere to the guidelines.
I've added a couple of suggestions in src/heretic/model.py to refactor some duplicated code, which will improve the maintainability of the model loading logic.
Add support for loading FP8/NVFP4 pre-quantized models by introducing an "fp8" dtype token. Since "fp8" is not a valid PyTorch dtype, the loader uses torch_dtype=torch.bfloat16 for compute while letting HuggingFace auto-detect the model's built-in quantization config. Changes: - Add FP8 enum value to QuantizationMethod - Branch on FP8 dtype in model loading, reset, and merge paths - Extend obtain_merge_strategy to handle FP8 quantized models - Fix division-by-zero in evaluator when base_refusals is 0 - Add fp8 optional dependency (fp-quant) in pyproject.toml
5803c07 to
957a31c
Compare
What about mxfp4 for older hardware? |
The implementation here is actually format-agnostic — the "fp8" dtype path simply uses torch_dtype=torch.bfloat16 and lets HuggingFace auto-detect the model's built-in quantization_config. So if an MXFP4 model is published on HF with the appropriate config, this same loading path should handle it without changes. The naming is admittedly NVFP4-centric since that's what was tested against, but the mechanism itself isn't tied to any specific sub-format. Open to renaming the token to something more general to actually convey this feature, if that makes sense. |
da75297 to
3925d5e
Compare
|
Please explain in a bit more detail what exactly is going on here. My understanding is that this loads models with their built-in quantization, just like is already supported for MXFP4. If so, what do we need the extra |
Fair point, looking into this more and testing it, dtype="auto" already handles FP8 pre-quantized models correctly. HuggingFace auto-detects the quantization_config from the model's config.json the same way it does for MXFP4. (I also confirmed this by loading nvidia/Llama-3.1-8B-Instruct-FP8 with just dtype="auto" and no special handling, which gives an end product of it loading and generating fine. ) My original thinking was that FP8 needed an explicit opt-in, as BNB_4BIT does, where the user tells heretic to quantize, and heretic applies it at load time. As such, I created an "fp8" dtype token following the same pattern. I was originally coding this fork for Nemotron support and ran into an issue on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8. I can rework the PR to remove the "fp8" token, the torch_dtype workaround, and the fp-quant dependency. The one thing I think is still relevant is get_merged_model(). Right now, it only has a special CPU reload path for BNB_4BIT models, since quantized weights can't have LoRA adapters merged into them directly. Pre-quantized models (FP8, MXFP4, etc.) share the same limitation: you would need to reload the base model in full precision on the CPU, apply the LoRA weights, and then merge. Without that, I thinkthe merge step would either fail or produce a corrupt model. |
|
Ok. Please trim this PR down to the necessary stuff, then ping me for another review. |
- Remove fp8 dtype token, QuantizationMethod.FP8, and fp-quant dependency - Remove torch_dtype workaround and kwargs dict pattern - Fix evaluator div-by-zero: return refusals when base_refusals is 0 - Extend quantized model detection in get_merged_model() and obtain_merge_strategy() to cover pre-quantized models (FP8, MXFP4, GPTQ, etc.) via quantization_config on the model config, not just BNB_4BIT
|
Done. Removed the "fp8" dtype token, QuantizationMethod.FP8, the fp-quant dependency, the torch_dtype workaround, and the kwargs dict pattern. Two things left: the evaluator div-by-zero fix, and an extension to get_merged_model() and obtain_merge_strategy() that detects pre quantized models (FP8, MXFP4, GPTQ, etc.) via model.config.quantization_config rather than just checking for BNB_4BIT. Without that, the CPU reload path for LoRA merging wouldn't trigger for pre-quantized models, which have the same limitation. Ready for re-review @p-e-w |
| """ | ||
|
|
||
| if settings.quantization == QuantizationMethod.BNB_4BIT: | ||
| is_quantized = getattr(model.model.config, "quantization_config", None) is not None |
There was a problem hiding this comment.
Does this work for models quantized on-the-fly by Heretic, i.e., with bitsandbytes?
src/heretic/main.py
Outdated
| if settings.quantization == QuantizationMethod.NONE | ||
| else " (requires sufficient RAM)" | ||
| ), | ||
| title="Merge LoRA into full model (requires sufficient RAM)", |
There was a problem hiding this comment.
Please undo this change; the code is the way it is for a good reason (#152).
|
Which models have you tested this with? Can you link to Hugging Face uploads made with this PR? |
|
Any update? |
Yes sorry. I'm currently a student so time can be tight as i've been streched thin with homework but i'm working on testing with a few more models and should be done by this weekend. |
Summary
path for LoRA merging wouldn't trigger for pre-quantized models, which have the same limitation
Test plan