fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero by cpagac · Pull Request #151 · p-e-w/heretic

cpagac · 2026-02-14T04:27:15Z

Summary

Fixes a division-by-zero in the evaluator when base_refusals is 0 (returns refusals rather than 0, so ablation that introduces new refusals is penalized correctly)
Extends get_merged_model() and obtain_merge_strategy() to detect pre-quantized models (FP8, MXFP4, GPTQ, etc.) via model.config.quantization_config, not just BNB_4BIT — without this, the CPU reload
path for LoRA merging wouldn't trigger for pre-quantized models, which have the same limitation

Test plan

- Verify BNB 4-bit quantization and LoRA merge/export still works
- Verify LoRA merge/export works for a pre-quantized model (FP8 or MXFP4) - untested; nvidia FP8 format requires vLLM/TensorRT-LLM and can't run inference via standard HuggingFace transformers
- Verify evaluator handles models with 0 base refusals without crashing

gemini-code-assist · 2026-02-14T04:27:32Z

Summary of Changes

Hello @cpagac, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the application's capability by integrating comprehensive support for FP8/NVFP4 pre-quantized models. It introduces the necessary configuration, model loading adjustments, and merge strategy considerations to seamlessly work with these models. Additionally, a critical bug fix addresses a division-by-zero scenario in the evaluation logic, improving the robustness of the system.

Highlights

FP8/NVFP4 Quantization Support: Introduced support for FP8/NVFP4 pre-quantized models, including a new QuantizationMethod.FP8 enum value and a specific _FP8_DTYPE_TOKEN.
Model Loading Logic: Implemented specific model loading logic for FP8 models, using torch.bfloat16 for compute and allowing HuggingFace to auto-detect the model's built-in quantization_config.
Merge Strategy Enhancement: Extended the obtain_merge_strategy function to correctly identify and handle FP8 quantized models during the CPU reload path for merging.
Evaluator Robustness: Resolved a potential division-by-zero error in the evaluator's get_score method when base_refusals is zero.
Optional Dependency: Added fp-quant as a new optional dependency (pip install heretic-llm[fp8]) for FP8 functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pyproject.toml
- Added fp8 as an optional dependency, requiring fp-quant>=0.1.0.
src/heretic/config.py
- Added FP8 to the QuantizationMethod enum.
- Updated the description for the quantization setting to include "fp8 (FP8/NVFP4 on-the-fly quantization)".
src/heretic/evaluator.py
- Modified the refusals_score calculation in get_score to prevent division by zero if self.base_refusals is 0.
src/heretic/main.py
- Imported _FP8_DTYPE_TOKEN from model.
- Modified obtain_merge_strategy to accept a model argument.
- Updated obtain_merge_strategy to check for both BNB_4BIT and _FP8_DTYPE_TOKEN when determining if a model is quantized, affecting merge strategy prompts.
- Updated calls to obtain_merge_strategy to pass the model object.
src/heretic/model.py
- Defined _FP8_DTYPE_TOKEN constant.
- Added _loaded_dtype attribute to the Model class to track the loaded dtype.
- Modified the __init__ method's model loading logic to specifically handle _FP8_DTYPE_TOKEN by using torch_dtype=torch.bfloat16 instead of dtype= and letting HuggingFace auto-detect quantization.
- Updated the success message for FP8 models during loading.
- Modified _get_quantization_config to consider _FP8_DTYPE_TOKEN when setting compute_dtype for BitsAndBytesConfig.
- Extended get_merged_model to include _FP8_DTYPE_TOKEN in the check for quantized models requiring special handling.
- Adjusted reset_model to apply the same FP8-specific loading logic as in __init__.

Activity

This pull request represents the quantization-only features, extracted from a larger previous pull request (feat: NemotronH architecture support #150) as per review feedback.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively introduces support for FP8/NVFP4 quantized models and adds an optional dependency for it. The changes are logical and well-integrated. The PR also includes a small but important fix for a potential division-by-zero error in the evaluator.

While the repository's style guide suggests that pull requests should contain only a single semantic change (rule #9), the included bug fix is minor and improves the codebase. For future contributions, it would be ideal to separate features and bug fixes into distinct pull requests to adhere to the guidelines.

I've added a couple of suggestions in src/heretic/model.py to refactor some duplicated code, which will improve the maintainability of the model loading logic.

src/heretic/model.py

Add support for loading FP8/NVFP4 pre-quantized models by introducing an "fp8" dtype token. Since "fp8" is not a valid PyTorch dtype, the loader uses torch_dtype=torch.bfloat16 for compute while letting HuggingFace auto-detect the model's built-in quantization config. Changes: - Add FP8 enum value to QuantizationMethod - Branch on FP8 dtype in model loading, reset, and merge paths - Extend obtain_merge_strategy to handle FP8 quantized models - Fix division-by-zero in evaluator when base_refusals is 0 - Add fp8 optional dependency (fp-quant) in pyproject.toml

kabachuha · 2026-02-14T11:26:51Z

NVFP4

What about mxfp4 for older hardware?

cpagac · 2026-02-14T17:37:25Z

NVFP4

What about mxfp4 for older hardware?

The implementation here is actually format-agnostic — the "fp8" dtype path simply uses torch_dtype=torch.bfloat16 and lets HuggingFace auto-detect the model's built-in quantization_config.

So if an MXFP4 model is published on HF with the appropriate config, this same loading path should handle it without changes. The naming is admittedly NVFP4-centric since that's what was tested against, but the mechanism itself isn't tied to any specific sub-format. Open to renaming the token to something more general to actually convey this feature, if that makes sense.

src/heretic/evaluator.py

pyproject.toml

src/heretic/model.py

p-e-w · 2026-02-16T07:22:25Z

Please explain in a bit more detail what exactly is going on here. My understanding is that this loads models with their built-in quantization, just like is already supported for MXFP4. If so, what do we need the extra dtype for? If the model tensors are quantized, we always want to load them in that quantized format, no questions asked.

cpagac · 2026-02-16T18:14:30Z

Please explain in a bit more detail what exactly is going on here. My understanding is that this loads models with their built-in quantization, just like is already supported for MXFP4. If so, what do we need the extra dtype for? If the model tensors are quantized, we always want to load them in that quantized format, no questions asked.

Fair point, looking into this more and testing it, dtype="auto" already handles FP8 pre-quantized models correctly. HuggingFace auto-detects the quantization_config from the model's config.json the same way it does for MXFP4. (I also confirmed this by loading nvidia/Llama-3.1-8B-Instruct-FP8 with just dtype="auto" and no special handling, which gives an end product of it loading and generating fine. )

My original thinking was that FP8 needed an explicit opt-in, as BNB_4BIT does, where the user tells heretic to quantize, and heretic applies it at load time. As such, I created an "fp8" dtype token following the same pattern. I was originally coding this fork for Nemotron support and ran into an issue on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.

I can rework the PR to remove the "fp8" token, the torch_dtype workaround, and the fp-quant dependency. The one thing I think is still relevant is get_merged_model(). Right now, it only has a special CPU reload path for BNB_4BIT models, since quantized weights can't have LoRA adapters merged into them directly. Pre-quantized models (FP8, MXFP4, etc.) share the same limitation: you would need to reload the base model in full precision on the CPU, apply the LoRA weights, and then merge. Without that, I thinkthe merge step would either fail or produce a corrupt model.

p-e-w · 2026-02-17T06:30:32Z

Ok. Please trim this PR down to the necessary stuff, then ping me for another review.

- Remove fp8 dtype token, QuantizationMethod.FP8, and fp-quant dependency - Remove torch_dtype workaround and kwargs dict pattern - Fix evaluator div-by-zero: return refusals when base_refusals is 0 - Extend quantized model detection in get_merged_model() and obtain_merge_strategy() to cover pre-quantized models (FP8, MXFP4, GPTQ, etc.) via quantization_config on the model config, not just BNB_4BIT

cpagac · 2026-02-19T01:46:55Z

Done. Removed the "fp8" dtype token, QuantizationMethod.FP8, the fp-quant dependency, the torch_dtype workaround, and the kwargs dict pattern.

Two things left: the evaluator div-by-zero fix, and an extension to get_merged_model() and obtain_merge_strategy() that detects pre quantized models (FP8, MXFP4, GPTQ, etc.) via model.config.quantization_config rather than just checking for BNB_4BIT. Without that, the CPU reload path for LoRA merging wouldn't trigger for pre-quantized models, which have the same limitation.

Ready for re-review @p-e-w

p-e-w · 2026-02-20T08:35:51Z

src/heretic/main.py

    """

-    if settings.quantization == QuantizationMethod.BNB_4BIT:
+    is_quantized = getattr(model.model.config, "quantization_config", None) is not None


Does this work for models quantized on-the-fly by Heretic, i.e., with bitsandbytes?

p-e-w · 2026-02-20T08:37:26Z

src/heretic/main.py

-                        if settings.quantization == QuantizationMethod.NONE
-                        else " (requires sufficient RAM)"
-                    ),
+                    title="Merge LoRA into full model (requires sufficient RAM)",


Please undo this change; the code is the way it is for a good reason (#152).

p-e-w · 2026-02-20T08:38:17Z

Which models have you tested this with? Can you link to Hugging Face uploads made with this PR?

p-e-w · 2026-03-06T07:58:53Z

Any update?

cpagac · 2026-03-07T05:43:04Z

Any update?

Yes sorry. I'm currently a student so time can be tight as i've been streched thin with homework but i'm working on testing with a few more models and should be done by this weekend.

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

src/heretic/model.py Outdated Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

cpagac force-pushed the fp8-quantization-support branch from 5803c07 to 957a31c Compare February 14, 2026 04:51

Merge upstream/master: sync with heretic v1.2.0

3925d5e

cpagac force-pushed the fp8-quantization-support branch from da75297 to 3925d5e Compare February 16, 2026 02:27

p-e-w reviewed Feb 16, 2026

View reviewed changes

src/heretic/evaluator.py Outdated Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

cpagac added 2 commits February 18, 2026 19:10

fix: remove unused QuantizationMethod import

5600b8e

cpagac changed the title ~~feat: add FP8/NVFP4 quantization support~~ fix: extend LoRA merge path to pre-quantized models and fix evaluator div-by-zero Feb 19, 2026

Merge branch 'p-e-w:master' into fp8-quantization-support

5fbd07c

p-e-w mentioned this pull request Feb 19, 2026

fix: handle FP8 model weights in LoRA adapters and merge #182

Open

3 tasks

p-e-w reviewed Feb 20, 2026

View reviewed changes

cpagac and others added 2 commits March 7, 2026 00:21

fix: restore original RAM warning label

6bce49f

Merge branch 'p-e-w:master' into fp8-quantization-support

ecc391e

Conversation

cpagac commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kabachuha commented Feb 14, 2026

Uh oh!

cpagac commented Feb 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p-e-w commented Feb 16, 2026

Uh oh!

cpagac commented Feb 16, 2026

Uh oh!

p-e-w commented Feb 17, 2026

Uh oh!

cpagac commented Feb 19, 2026

Uh oh!

p-e-w Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w commented Feb 20, 2026

Uh oh!

p-e-w commented Mar 6, 2026

Uh oh!

cpagac commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpagac commented Feb 14, 2026 •

edited

Loading