Skip to content

fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero#151

Open
cpagac wants to merge 7 commits intop-e-w:masterfrom
venomx-pentester:fp8-quantization-support
Open

fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero#151
cpagac wants to merge 7 commits intop-e-w:masterfrom
venomx-pentester:fp8-quantization-support

Conversation

@cpagac
Copy link
Contributor

@cpagac cpagac commented Feb 14, 2026

Summary

  • Fixes a division-by-zero in the evaluator when base_refusals is 0 (returns refusals rather than 0, so ablation that introduces new refusals is penalized correctly)
  • Extends get_merged_model() and obtain_merge_strategy() to detect pre-quantized models (FP8, MXFP4, GPTQ, etc.) via model.config.quantization_config, not just BNB_4BIT — without this, the CPU reload
    path for LoRA merging wouldn't trigger for pre-quantized models, which have the same limitation

Test plan

  • - Verify BNB 4-bit quantization and LoRA merge/export still works
  • - Verify LoRA merge/export works for a pre-quantized model (FP8 or MXFP4) - untested; nvidia FP8 format requires vLLM/TensorRT-LLM and can't run inference via standard HuggingFace transformers
  • - Verify evaluator handles models with 0 base refusals without crashing

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @cpagac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the application's capability by integrating comprehensive support for FP8/NVFP4 pre-quantized models. It introduces the necessary configuration, model loading adjustments, and merge strategy considerations to seamlessly work with these models. Additionally, a critical bug fix addresses a division-by-zero scenario in the evaluation logic, improving the robustness of the system.

Highlights

  • FP8/NVFP4 Quantization Support: Introduced support for FP8/NVFP4 pre-quantized models, including a new QuantizationMethod.FP8 enum value and a specific _FP8_DTYPE_TOKEN.
  • Model Loading Logic: Implemented specific model loading logic for FP8 models, using torch.bfloat16 for compute and allowing HuggingFace to auto-detect the model's built-in quantization_config.
  • Merge Strategy Enhancement: Extended the obtain_merge_strategy function to correctly identify and handle FP8 quantized models during the CPU reload path for merging.
  • Evaluator Robustness: Resolved a potential division-by-zero error in the evaluator's get_score method when base_refusals is zero.
  • Optional Dependency: Added fp-quant as a new optional dependency (pip install heretic-llm[fp8]) for FP8 functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pyproject.toml
    • Added fp8 as an optional dependency, requiring fp-quant>=0.1.0.
  • src/heretic/config.py
    • Added FP8 to the QuantizationMethod enum.
    • Updated the description for the quantization setting to include "fp8 (FP8/NVFP4 on-the-fly quantization)".
  • src/heretic/evaluator.py
    • Modified the refusals_score calculation in get_score to prevent division by zero if self.base_refusals is 0.
  • src/heretic/main.py
    • Imported _FP8_DTYPE_TOKEN from model.
    • Modified obtain_merge_strategy to accept a model argument.
    • Updated obtain_merge_strategy to check for both BNB_4BIT and _FP8_DTYPE_TOKEN when determining if a model is quantized, affecting merge strategy prompts.
    • Updated calls to obtain_merge_strategy to pass the model object.
  • src/heretic/model.py
    • Defined _FP8_DTYPE_TOKEN constant.
    • Added _loaded_dtype attribute to the Model class to track the loaded dtype.
    • Modified the __init__ method's model loading logic to specifically handle _FP8_DTYPE_TOKEN by using torch_dtype=torch.bfloat16 instead of dtype= and letting HuggingFace auto-detect quantization.
    • Updated the success message for FP8 models during loading.
    • Modified _get_quantization_config to consider _FP8_DTYPE_TOKEN when setting compute_dtype for BitsAndBytesConfig.
    • Extended get_merged_model to include _FP8_DTYPE_TOKEN in the check for quantized models requiring special handling.
    • Adjusted reset_model to apply the same FP8-specific loading logic as in __init__.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively introduces support for FP8/NVFP4 quantized models and adds an optional dependency for it. The changes are logical and well-integrated. The PR also includes a small but important fix for a potential division-by-zero error in the evaluator.

While the repository's style guide suggests that pull requests should contain only a single semantic change (rule #9), the included bug fix is minor and improves the codebase. For future contributions, it would be ideal to separate features and bug fixes into distinct pull requests to adhere to the guidelines.

I've added a couple of suggestions in src/heretic/model.py to refactor some duplicated code, which will improve the maintainability of the model loading logic.

Add support for loading FP8/NVFP4 pre-quantized models by introducing
an "fp8" dtype token. Since "fp8" is not a valid PyTorch dtype, the
loader uses torch_dtype=torch.bfloat16 for compute while letting
HuggingFace auto-detect the model's built-in quantization config.

Changes:
- Add FP8 enum value to QuantizationMethod
- Branch on FP8 dtype in model loading, reset, and merge paths
- Extend obtain_merge_strategy to handle FP8 quantized models
- Fix division-by-zero in evaluator when base_refusals is 0
- Add fp8 optional dependency (fp-quant) in pyproject.toml
@cpagac cpagac force-pushed the fp8-quantization-support branch from 5803c07 to 957a31c Compare February 14, 2026 04:51
@kabachuha
Copy link

NVFP4

What about mxfp4 for older hardware?

@cpagac
Copy link
Contributor Author

cpagac commented Feb 14, 2026

NVFP4

What about mxfp4 for older hardware?

The implementation here is actually format-agnostic — the "fp8" dtype path simply uses torch_dtype=torch.bfloat16 and lets HuggingFace auto-detect the model's built-in quantization_config.

So if an MXFP4 model is published on HF with the appropriate config, this same loading path should handle it without changes. The naming is admittedly NVFP4-centric since that's what was tested against, but the mechanism itself isn't tied to any specific sub-format. Open to renaming the token to something more general to actually convey this feature, if that makes sense.

@cpagac cpagac force-pushed the fp8-quantization-support branch from da75297 to 3925d5e Compare February 16, 2026 02:27
@p-e-w
Copy link
Owner

p-e-w commented Feb 16, 2026

Please explain in a bit more detail what exactly is going on here. My understanding is that this loads models with their built-in quantization, just like is already supported for MXFP4. If so, what do we need the extra dtype for? If the model tensors are quantized, we always want to load them in that quantized format, no questions asked.

@cpagac
Copy link
Contributor Author

cpagac commented Feb 16, 2026

Please explain in a bit more detail what exactly is going on here. My understanding is that this loads models with their built-in quantization, just like is already supported for MXFP4. If so, what do we need the extra dtype for? If the model tensors are quantized, we always want to load them in that quantized format, no questions asked.

Fair point, looking into this more and testing it, dtype="auto" already handles FP8 pre-quantized models correctly. HuggingFace auto-detects the quantization_config from the model's config.json the same way it does for MXFP4. (I also confirmed this by loading nvidia/Llama-3.1-8B-Instruct-FP8 with just dtype="auto" and no special handling, which gives an end product of it loading and generating fine. )

My original thinking was that FP8 needed an explicit opt-in, as BNB_4BIT does, where the user tells heretic to quantize, and heretic applies it at load time. As such, I created an "fp8" dtype token following the same pattern. I was originally coding this fork for Nemotron support and ran into an issue on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.

I can rework the PR to remove the "fp8" token, the torch_dtype workaround, and the fp-quant dependency. The one thing I think is still relevant is get_merged_model(). Right now, it only has a special CPU reload path for BNB_4BIT models, since quantized weights can't have LoRA adapters merged into them directly. Pre-quantized models (FP8, MXFP4, etc.) share the same limitation: you would need to reload the base model in full precision on the CPU, apply the LoRA weights, and then merge. Without that, I thinkthe merge step would either fail or produce a corrupt model.

@p-e-w
Copy link
Owner

p-e-w commented Feb 17, 2026

Ok. Please trim this PR down to the necessary stuff, then ping me for another review.

- Remove fp8 dtype token, QuantizationMethod.FP8, and fp-quant dependency
- Remove torch_dtype workaround and kwargs dict pattern
- Fix evaluator div-by-zero: return refusals when base_refusals is 0
- Extend quantized model detection in get_merged_model() and
  obtain_merge_strategy() to cover pre-quantized models (FP8, MXFP4,
  GPTQ, etc.) via quantization_config on the model config, not just BNB_4BIT
@cpagac
Copy link
Contributor Author

cpagac commented Feb 19, 2026

Done. Removed the "fp8" dtype token, QuantizationMethod.FP8, the fp-quant dependency, the torch_dtype workaround, and the kwargs dict pattern.

Two things left: the evaluator div-by-zero fix, and an extension to get_merged_model() and obtain_merge_strategy() that detects pre quantized models (FP8, MXFP4, GPTQ, etc.) via model.config.quantization_config rather than just checking for BNB_4BIT. Without that, the CPU reload path for LoRA merging wouldn't trigger for pre-quantized models, which have the same limitation.

Ready for re-review @p-e-w

@cpagac cpagac changed the title feat: add FP8/NVFP4 quantization support fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero Feb 19, 2026
"""

if settings.quantization == QuantizationMethod.BNB_4BIT:
is_quantized = getattr(model.model.config, "quantization_config", None) is not None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work for models quantized on-the-fly by Heretic, i.e., with bitsandbytes?

if settings.quantization == QuantizationMethod.NONE
else " (requires sufficient RAM)"
),
title="Merge LoRA into full model (requires sufficient RAM)",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change; the code is the way it is for a good reason (#152).

@p-e-w
Copy link
Owner

p-e-w commented Feb 20, 2026

Which models have you tested this with? Can you link to Hugging Face uploads made with this PR?

@p-e-w
Copy link
Owner

p-e-w commented Mar 6, 2026

Any update?

@cpagac
Copy link
Contributor Author

cpagac commented Mar 7, 2026

Any update?

Yes sorry. I'm currently a student so time can be tight as i've been streched thin with homework but i'm working on testing with a few more models and should be done by this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants