feat: NemotronH architecture support by cpagac · Pull Request #166 · p-e-w/heretic

cpagac · 2026-02-16T06:42:00Z

Summary

Adds support for hybrid Mamba/SSM architectures (e.g. nvidia/NVIDIA-Nemotron-Nano-9B-v2), with several correctness fixes along the way.

Hybrid model support (model.py)

get_layers(): adds model.backbone.layers path for NemotronH and other backbone-based architectures
get_layer_modules(): adds mixer-based module discovery for NemotronH's unified mixer attribute (attention o_proj, MLP down_proj, MoE experts, Mamba2 SSM out_proj). Also softens the attn.o_proj
lookup from a hard assert to a suppressed exception, so layers without standard self-attention don't abort
get_abliterable_components(): scans all layers instead of only layer 0, collecting the union of component types. Necessary for hybrid architectures where different layers have different
structures. Logs a warning for any layers with no recognized modules
get_residuals(): detects when generate() returns None hidden states (NemotronH behavior) and falls back to forward hooks to capture per-layer outputs directly. Also moves tensors to a common
device before stacking, required on multi-GPU setups
_get_hidden_states_via_hooks(): new method — registers forward pre/post hooks on all layers to capture hidden states, matching the standard [embedding, layer_0, layer_1, ...] format
_has_mamba_layers(): new method — detects hybrid SSM models by checking for layer.mixer.out_proj
abliterate(): skips modules whose base weight is on meta device or contains NaN values
NaN warnings: added to get_residuals() and get_logprobs_batched() to surface post-abliteration model corruption early

Multi-GPU VRAM calibration (model.py)

NemotronH's Mamba2 SSM layers allocate a persistent ~4 GiB workspace during the first forward pass — after model loading — which Accelerate cannot account for when computing the device map. On
multi-GPU systems this causes OOM mid-inference even when initial placement looked balanced.

After the warmup generate() call, if any GPU has less than 6 GiB free and the model has Mamba layers, the model is released, actual post-warmup free VRAM is measured per device, and the model
reloads once with corrected max_memory caps. A 0.7 correction factor is applied only to overloaded GPUs to prevent Accelerate from repeating the same placement error. Regular models are unaffected.

trust_remote_code interactive prompt (model.py)

Previously, models requiring custom code would fail silently or trigger HuggingFace's own prompt with no context. Now heretic passes trust_remote_code=False on first attempt, catches the resulting
error, explains to the user what custom code execution means (with a link to the HF repo), and retries with trust_remote_code=True if the user confirms. Also handles mamba-ssm import errors with an
auto-install prompt.

Fast kernel suggestion (model.py)

After listing abliterable components, if Mamba layers are present and causal-conv1d/mamba-ssm are not installed, prints a one-time suggestion with the pip command, CUDA toolkit version requirement
(≥ 11.6), and expected build time (~10 min).

Merge path fix for models with built-in quantization (model.py, main.py)
get_merged_model() and obtain_merge_strategy() now detect models that have a quantization_config baked into their HuggingFace config (e.g. models already quantized at publish time) and route them through the same CPU-reload merge path as BNB 4-bit, preventing a silent merge failure on export.

Test plan

- nvidia/NVIDIA-Nemotron-Nano-9B-v2 loads and abliterates on a dual-GPU system without OOM
- VRAM calibration triggers on NemotronH, not on standard transformers (e.g. Qwen/Qwen3-14B)
- trust_remote_code prompt appears with context when loading a model that requires it
- Fast kernel suggestion prints when causal-conv1d/mamba-ssm are absent
- Models with built-in quantization config reach the CPU-reload merge path correctly
- All CI checks pass (ruff format, ruff lint, ty check, uv build)

gemini-code-assist · 2026-02-16T06:42:20Z

Summary of Changes

Hello @cpagac, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the model compatibility of Heretic by introducing robust support for FP8/NVFP4 pre-quantized models and the complex NemotronH hybrid architecture. It refines the model loading and processing pipeline to intelligently handle diverse quantization schemes and novel layer structures, ensuring stability and accurate abliteration across a broader range of large language models.

Highlights

FP8/NVFP4 Model Support: Enabled loading of pre-quantized FP8/NVFP4 models by treating "fp8" as a special dtype token, allowing HuggingFace to auto-detect quantization configurations.
NemotronH Architecture Integration: Added comprehensive support for the NemotronH hybrid architecture, which combines Mamba2 SSM, MoE, and Attention layers, including specific layer discovery and module extraction patterns.
Dynamic Layer Discovery: Implemented a fallback for backbone.layers and enhanced component scanning to inspect all layers, accommodating diverse and hybrid model structures.
Robust Hidden State Capture: Introduced a hook-based mechanism to capture per-layer hidden states for models that do not return them via standard generate() or forward() calls.
Quantization Method Expansion: Added FP8 to the QuantizationMethod enum and integrated its handling into model loading, merging, and reset processes for consistency.
Evaluator Stability: Fixed a division-by-zero error in the evaluator's refusal score calculation when base_refusals is zero.
Abliteration Safety: Incorporated NaN and meta-device safety checks during the abliteration process to prevent model corruption and ensure stability with CPU-offloaded modules.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

FP8_NVFP4_CHANGES.md
- Added a new documentation file detailing the problem, solution, usage, and implementation specifics for FP8/NVFP4 quantization and NemotronH architecture support.
pyproject.toml
- Added an optional fp8 dependency group for future on-the-fly FP8 quantization.
src/heretic/config.py
- Added FP8 variant to the QuantizationMethod enum.
- Updated the quantization field description to include fp8 as an option.
src/heretic/evaluator.py
- Fixed a division-by-zero error in get_score() when base_refusals is zero.
src/heretic/main.py
- Updated obtain_merge_strategy() signature to accept a Model instance.
- Modified obtain_merge_strategy() to detect FP8 quantized models for merge path considerations.
src/heretic/model.py
- Introduced _FP8_DTYPE_TOKEN constant for FP8 dtype detection.
- Modified __init__ to handle _FP8_DTYPE_TOKEN for from_pretrained() calls, using torch_dtype=torch.bfloat16 and storing the _loaded_dtype.
- Expanded get_merged_model() logic to include FP8 models in the CPU-reload merge path.
- Updated reset_model() to correctly reload FP8 models using torch_dtype=torch.bfloat16.
- Fixed _get_quantization_config() to treat fp8 dtype similarly to auto for compute dtype determination.
- Added a model.backbone.layers fallback in get_layers() for NemotronH-like architectures.
- Enhanced get_layer_modules() to recognize NemotronH hybrid layer patterns (Mamba2 SSM, MoE, Attention) and removed a restrictive assertion.
- Revised get_abliterable_components() to scan all layers for a comprehensive list of abliterable components.
- Implemented _get_hidden_states_via_hooks() as a fallback for capturing hidden states in models that don't expose them directly.
- Improved get_residuals() to utilize the hook-based hidden state capture, ensure multi-GPU device compatibility, and include NaN detection.
- Added a guard in abliterate() to skip meta-device and NaN-weight modules.
- Included a NaN diagnostic warning in get_logprobs().
- Updated the __init__ printout to display per-component layer counts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant enhancements to support the NemotronH architecture and FP8/NVFP4 pre-quantized models. The changes are extensive and well-implemented, including robust fallbacks for layer discovery and hidden state extraction, as well as safety checks for multi-GPU and quantized model handling. The addition of detailed documentation in FP8_NVFP4_CHANGES.md is particularly helpful.

My review focuses on a couple of areas for improvement:

A performance optimization opportunity in the model initialization logic.
A minor documentation inconsistency.

Overall, this is a high-quality contribution that greatly expands the capabilities of the project.

FP8_NVFP4_CHANGES.md

src/heretic/model.py

cpagac · 2026-02-18T07:28:32Z

PR body and description have been overhauled to accurately reflect the full diff.

cpagac · 2026-02-18T07:30:54Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant new functionality to support hybrid Mamba/SSM architectures like NemotronH, along with several correctness fixes and user experience improvements. The changes are comprehensive and well-structured, particularly the VRAM calibration for multi-GPU setups and the interactive prompts for trust_remote_code and missing dependencies.

My review focuses on improving code maintainability and ensuring adherence to the repository's style guide. I've pointed out a few areas where the code could be refactored for clarity, a magic number that should be a constant, and several comments that need to be updated to match the project's coding conventions. Overall, this is a great contribution that significantly expands the tool's capabilities.

gemini-code-assist · 2026-02-18T07:33:30Z

src/heretic/model.py

        for dtype in settings.dtypes:
+            if abort:
+                break
            print(f"* Trying dtype [bold]{dtype}[/]... ", end="")

-            try:
-                quantization_config = self._get_quantization_config(dtype)
-
-                extra_kwargs = {}
-                # Only include quantization_config if it's not None
-                # (some models like gpt-oss have issues with explicit None).
-                if quantization_config is not None:
-                    extra_kwargs["quantization_config"] = quantization_config
-
-                self.model = get_model_class(settings.model).from_pretrained(
-                    settings.model,
-                    dtype=dtype,
-                    device_map=settings.device_map,
-                    max_memory=self.max_memory,
-                    trust_remote_code=self.trusted_models.get(settings.model),
-                    **extra_kwargs,
-                )
-
-                # If we reach this point and the model requires trust_remote_code,
-                # either the user accepted, or settings.trust_remote_code is True.
-                if self.trusted_models.get(settings.model) is None:
-                    self.trusted_models[settings.model] = True
-
-                # A test run can reveal dtype-related problems such as the infamous
-                # "RuntimeError: probability tensor contains either `inf`, `nan` or element < 0"
-                # (https://github.com/meta-llama/llama/issues/380).
-                self.generate(
-                    [
-                        Prompt(
-                            system=settings.system_prompt,
-                            user="What is 1+1?",
+            while True:
+                try:
+                    quantization_config = self._get_quantization_config(dtype)
+
+                    extra_kwargs = {}
+                    # Only include quantization_config if it's not None
+                    # (some models like gpt-oss have issues with explicit None).
+                    if quantization_config is not None:
+                        extra_kwargs["quantization_config"] = quantization_config
+
+                    # Pass trust_remote_code=False (not None) when trust hasn't been
+                    # established yet. This prevents HF from showing its own interactive
+                    # prompt; we handle that ourselves below with clearer context.
+                    self.model = get_model_class(settings.model).from_pretrained(
+                        settings.model,
+                        dtype=dtype,
+                        device_map=self.device_map,
+                        max_memory=self.max_memory,
+                        trust_remote_code=self.trusted_models.get(settings.model)
+                        or False,
+                        **extra_kwargs,
+                    )
+
+                    # If we reach this point and the model requires trust_remote_code,
+                    # either the user accepted, or settings.trust_remote_code is True.
+                    if self.trusted_models.get(settings.model) is None:
+                        self.trusted_models[settings.model] = True
+
+                    # A test run can reveal dtype-related problems such as the infamous
+                    # "RuntimeError: probability tensor contains either `inf`, `nan` or element < 0"
+                    # (https://github.com/meta-llama/llama/issues/380).
+                    self.generate(
+                        [
+                            Prompt(
+                                system=settings.system_prompt,
+                                user="What is 1+1?",
+                            )
+                        ],
+                        max_new_tokens=1,
+                    )
+
+                    # After a successful load and warmup on multi-GPU systems, check
+                    # whether each GPU has enough free VRAM for batch inference. If not,
+                    # compute corrected per-GPU caps from the actual measured allocations
+                    # and reload once. This handles architectures (e.g. NemotronH) where
+                    # SSM workspace and other one-time allocations during the first
+                    # forward pass leave insufficient headroom for batched inference.
+                    # Only applies to hybrid SSM models — regular transformers don't
+                    # allocate persistent inference workspace on top of model weights.
+                    if (
+                        not _vram_calibrated
+                        and torch.cuda.is_available()
+                        and torch.cuda.device_count() > 1
+                        and self._has_mamba_layers()
+                    ):
+                        _HEADROOM = 6 * 1024**3  # 6 GiB minimum free per GPU
+                        gpu_count = torch.cuda.device_count()
+                        min_free = min(
+                            torch.cuda.mem_get_info(i)[0] for i in range(gpu_count)
                        )
-                    ],
-                    max_new_tokens=1,
-                )
-            except Exception as error:
-                self.model = None  # ty:ignore[invalid-assignment]
-                empty_cache()
-                print(f"[red]Failed[/] ({error})")
+                        if min_free < _HEADROOM:
+                            print()
+                            print(
+                                f"[yellow]Only {min_free / (1024**3):.1f} GiB free on "
+                                "most-loaded GPU — recalibrating layout for batch inference...[/]"
+                            )
+                            # Identify overloaded GPUs before releasing the model.
+                            overloaded = {
+                                i
+                                for i in range(gpu_count)
+                                if torch.cuda.mem_get_info(i)[0] < _HEADROOM
+                            }
+
+                            # Release model so we can measure true available VRAM.
+                            self.model = None  # ty:ignore[invalid-assignment]
+                            empty_cache()
+
+                            max_mem: dict[int | str, str] = {}
+                            for i in range(gpu_count):
+                                free_i, _ = torch.cuda.mem_get_info(i)
+                                # Reserve headroom for inference working memory
+                                # (SSM workspace, KV cache, activations, etc.).
+                                usable = max(free_i - _HEADROOM, 0)
+                                if i in overloaded:
+                                    # Apply correction to prevent Accelerate from
+                                    # overloading this GPU again due to layer-size
+                                    # underestimation (~30% on hybrid architectures).
+                                    stated_gib = max(int(usable / (1024**3) * 0.7), 1)
+                                else:
+                                    # Full budget — this GPU absorbs displaced layers.
+                                    stated_gib = max(int(usable / (1024**3)), 1)
+                                max_mem[i] = f"{stated_gib}GiB"
+                            caps = ", ".join(
+                                f"GPU {k}: {v}" for k, v in max_mem.items()
+                            )
+                            print(f"  [dim]Corrected caps: {caps}[/]")
+                            self.max_memory = max_mem
+                            _vram_calibrated = True
+                            print(
+                                f"* Retrying dtype [bold]{dtype}[/] with corrected caps... ",
+                                end="",
+                            )
+                            continue  # reload this dtype with corrected max_memory
+                except Exception as error:
+                    self.model = None  # ty:ignore[invalid-assignment]
+                    empty_cache()
+                    print(f"[red]Failed[/] ({error})")
+
+                    error_str = str(error).lower()
+
+                    if "trust_remote_code" in error_str:
+                        if self.trusted_models.get(settings.model) is None:
+                            # Model requires custom code — explain and ask once.
+                            print()
+                            print(
+                                "[yellow](This is expected — the model requires permission to run custom code.)[/]"
+                            )
+                            print(
+                                f"[yellow][bold]{settings.model}[/bold] ships custom architecture "
+                                "code that must be executed to load this model. "
+                                f"You can inspect the repository at "
+                                f"https://huggingface.co/{settings.model}[/]"
+                            )
+                            print()
+                            if questionary.confirm(
+                                "Trust and run this model's custom code?",
+                                default=True,
+                            ).ask():
+                                self.trusted_models[settings.model] = True
+                                print(f"* Retrying dtype [bold]{dtype}[/]... ", end="")
+                                continue  # retry this dtype with trust granted
+                            else:
+                                self.trusted_models[settings.model] = False
+                                abort = True
+                        break  # trust already decided; move to next dtype or abort
+
+                    if "mamba-ssm" in error_str:
+                        # Missing dependency — retrying other dtypes won't help.
+                        print()
+                        print(
+                            f"[bold red]mamba-ssm is required to load [cyan]{settings.model}[/cyan].[/]"
+                        )
+                        print()
+                        if questionary.confirm(
+                            "Install mamba-ssm now? (this may take several minutes)",
+                            default=True,
+                        ).ask():
+                            try:
+                                subprocess.check_call(
+                                    [
+                                        sys.executable,
+                                        "-m",
+                                        "pip",
+                                        "install",
+                                        "mamba-ssm",
+                                    ]
+                                )
+                            except subprocess.CalledProcessError:
+                                print()
+                                print("[bold red]Auto-install failed.[/]")
+                                print(
+                                    "[yellow]mamba-ssm requires the CUDA toolkit (nvcc) to build. "
+                                    "Install nvcc, then run:[/] pip install mamba-ssm"
+                                )
+                                print("[yellow]To install nvcc:[/]")
+                                print(
+                                    "  sudo apt install nvidia-cuda-toolkit   [dim]# Ubuntu/Debian[/]"
+                                )
+                                print(
+                                    "  conda install -c nvidia cuda-nvcc       [dim]# Conda[/]"
+                                )
+                                raise SystemExit(1)
+                            print()
+                            print(
+                                "[green]Installation complete. Retrying model load...[/]"
+                            )
+                            print()
+                            continue  # retry this dtype after install
+                        abort = True
+                        break
+
+                    # For all other errors, update trust cache if needed and try next dtype.
+                    if self.trusted_models.get(settings.model) is None:
+                        self.trusted_models[settings.model] = True
+                    break
+                else:
+                    # Load and test generate succeeded — exit the retry loop.
+                    break
+


The __init__ method has grown quite large and complex. To improve maintainability and readability, consider refactoring parts of this logic into smaller, dedicated helper methods. For example, the VRAM calibration logic (lines 145-206), the trust_remote_code prompt handling, and the mamba-ssm installation could each be extracted into their own methods.

Additionally, there are a few other areas for improvement:

Magic Number: The correction factor 0.7 on line 191 is a magic number. It should be defined as a named constant with a descriptive name (e.g., _VRAM_CORRECTION_FACTOR).

Style Guide: Several inline comments do not adhere to the repository's style guide (Rule 4), which requires comments to start with a capital letter and end with a period. Please update the following comments to be compliant:

Line 159: # 6 GiB minimum free per GPU

Line 206: # reload this dtype with corrected max_memory

Line 234: # retry this dtype with trust granted

Line 238: # trust already decided; move to next dtype or abort

Line 281: # retry this dtype after install

References

Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. ^(link)

Agreed on the style violations, I'll fix those.
On the 0.7 correction factor: this was tuned empirically against observed NemotronH Mamba2 SSM workspace allocation. Accelerate tends to underestimate layer sizes by around 30% on hybrid architectures, so the factor compensates for this. Determining how to distribute layers across GPUs without overcommitting VRAM is genuinely one of the harder problems here, since the only reliable signal is a live OOM.

Just to preface this question and to add a bit more information on why there is 6 GiB of headroom:
This was similarly derived from observed post-warmup free VRAM on the test system — enough to cover the SSM workspace plus KV cache and activations for batch inference.

Both could be worth promoting to named constants with a comment explaining the rationale, which I can do if that would be helpful.

The broader refactoring into helper methods is tied to the hybrid.py question below.

gemini-code-assist · 2026-02-18T07:33:30Z

src/heretic/model.py

            # Quantized models need special handling - we must reload the base model
            # in full precision to merge the LoRA adapters


This comment does not end with a period, which violates the repository's style guide (Rule 4).

Suggested change

# Quantized models need special handling - we must reload the base model

# in full precision to merge the LoRA adapters

# Quantized models need special handling - we must reload the base model

# in full precision to merge the LoRA adapters.

References

Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. ^(link)

gemini-code-assist · 2026-02-18T07:33:30Z

src/heretic/model.py

+        # NemotronH hybrid layers - all use a unified `mixer` attribute.
+        # Attention layers have mixer.o_proj.
+        with suppress(Exception):
+            try_add("attn.o_proj", layer.mixer.o_proj)  # ty:ignore[possibly-missing-attribute]
+
+        # NemotronH simple MLP layers have mixer.down_proj.
+        with suppress(Exception):
+            try_add("mlp.down_proj", layer.mixer.down_proj)  # ty:ignore[possibly-missing-attribute]
+
+        # NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts.
+        # Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite):
+        # include all expert down_proj modules. Optuna will optimize the weight.
+        with suppress(Exception):
+            for expert in layer.mixer.experts:  # ty:ignore[possibly-missing-attribute, not-iterable]
+                try_add("mlp.down_proj", expert.down_proj)  # ty:ignore[possibly-missing-attribute]


Some comments in this section violate the repository's style guide (Rule 4):

Line 572: Does not end with a period.

Line 582: Ends with a colon instead of a period.

Line 583: Starts with a lowercase letter.

Please correct these comments to follow the style guide.

Suggested change

# NemotronH hybrid layers - all use a unified `mixer` attribute.

# Attention layers have mixer.o_proj.

with suppress(Exception):

try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute]

# NemotronH simple MLP layers have mixer.down_proj.

with suppress(Exception):

try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute]

# NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts.

# Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite):

# include all expert down_proj modules. Optuna will optimize the weight.

with suppress(Exception):

for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable]

try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute]

# NemotronH hybrid layers - all use a unified `mixer` attribute.

# Attention layers have mixer.o_proj.

with suppress(Exception):

try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute]

# NemotronH simple MLP layers have mixer.down_proj.

with suppress(Exception):

try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute]

# NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts.

# Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite).

# Include all expert down_proj modules. Optuna will optimize the weight.

with suppress(Exception):

for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable]

try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute]

References

Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. ^(link)

cpagac · 2026-02-18T15:57:06Z

@p-e-w
A thought that came up when I was initially working, but Gemini's review sparked it again. Would it be worth moving the hybrid/SSM-specific logic (VRAM calibration, hook-based hidden-state capture, mixer-layer detection) into a dedicated module, such as hybrid.py? I feel like most users won't run a NemotronH-style model, or really any hybrid model, so it could live outside core model.py. Curious what your take is on that approach before I consider a full restructure.

cpagac · 2026-02-19T01:27:34Z

When testing loading a model that would not fit within VRAM constraints, the calibration would still trigger, release the model, compute tiny caps from whatever free VRAM remained, and reload with those caps, thus making disk offloading worse rather than better. The fix checks whether any model parameters are on the meta device before deciding to recalibrate. If disk offloading is detected, calibration is skipped entirely, since rebalancing GPU distribution cannot address a capacity problem.

This was fixed in the most recent commit

NemotronH (Mamba2 SSM + MoE + Attention) requires several changes to load and abliterate correctly on multi-GPU systems. Architecture support (model.py): - Add backbone.layers fallback in get_layers() for NemotronH's model.backbone.layers structure - Add get_layer_modules() patterns for NemotronH's unified mixer attribute: mixer.out_proj (Mamba2), mixer.o_proj (attention), mixer.down_proj / mixer.experts[*].down_proj / mixer.shared_experts.down_proj (MoE) - Scan all layers in get_abliterable_components() instead of only layer 0, to discover the full union of component types in hybrid architectures - Add _get_hidden_states_via_hooks() fallback for models that don't return hidden_states through generate() (NemotronH returns tuple of Nones); use forward hooks on each layer with device-aware stacking for multi-GPU compatibility - Skip meta-device and NaN-weight modules in abliterate() to prevent NaN corruption when layers are CPU-offloaded by Accelerate - Add _has_mamba_layers() to detect hybrid SSM architectures Multi-GPU VRAM calibration (model.py): - After inference warmup on multi-GPU systems, check if any GPU has less than 6 GiB free; if so, release the model, measure actual free VRAM per GPU, and reload once with corrected per-GPU caps - Overloaded GPUs get a 0.7 correction factor for Accelerate's layer-size underestimation; other GPUs get full budget to absorb displaced layers; gated to hybrid SSM models via _has_mamba_layers() so regular transformers are unaffected User experience: - Show trust_remote_code explanation with model repo link before prompting, replacing the bare HuggingFace error message - Auto-install mamba-ssm when required, with clear nvcc/CUDA toolkit guidance on build failure - Suggest installing causal-conv1d and mamba-ssm after loading any model with Mamba layers when fast kernels are missing Other fixes: - Sum VRAM across all GPUs in print_memory_usage() (utils.py) - Show total and per-GPU VRAM in startup output (main.py) - Fix division by zero in evaluator when base_refusals is 0 - Add mamba optional dependency group to pyproject.toml

p-e-w · 2026-02-22T14:37:34Z

Multi-GPU VRAM calibration (model.py)

It's cool that you implemented this, but that's way more code than I'm willing to support in the future for a single model architecture.

trust_remote_code interactive prompt (model.py)

I don't quite understand what the problem with the built-in Transformers prompt is. It's already working, as far as I can tell. This logic is obviously extremely security-critical, and I'd rather not home-cook this.

Either way, if there is indeed a problem here this needs to be a separate PR. I can't review monster PRs that mix multiple issues.

Fast kernel suggestion (model.py)

Is there are reason not to simply add those packages to the default dependencies?

cpagac · 2026-02-22T21:37:41Z

Thanks for the feedback.

VRAM calibration — the underlying problem isn't NemotronH-specific: any model that allocates significant runtime memory on top of its weights (SSM workspace, KV cache pre-allocation, etc.) will OOM on multi-GPU setups even when Accelerate's device map looks balanced, because Accelerate can only account for static weight size at load time. NemotronH is just the model that exposed the gap. My thought is that as hybrid architectures become more common, this will only come up more frequently. The _has_mamba_layers() gate is the only NemotronH-specific piece; the calibration logic itself is general. That said, extracting it to a hybrid.py module might be the right way to keep it isolated if the code size is the concern.

The built-in Transformers prompt works fine, you are 100% correct, it was more of a UX improvement in my eyes. NemotronH requires trust_remote_code=True, and the default HuggingFace prompt gives no context about why, which I thought could be confusing to some. Happy to revert it, and if there's a genuine case for improved UX around it, that can be a separate PR.

Fast kernels — causal-conv1d and mamba-ssm require nvcc to build from source, so they'd fail on macOS, CPU-only systems, and most CI environments. An optional dep group was considered, but it caused CI failures when uv sync --all-extras tried to build them on GitHub Actions runners without a CUDA compiler. Hence the suggestion at runtime rather than a hard dependency.

p-e-w · 2026-02-23T14:03:30Z

VRAM calibration — the underlying problem isn't NemotronH-specific: any model that allocates significant runtime memory on top of its weights (SSM workspace, KV cache pre-allocation, etc.) will OOM on multi-GPU setups even when Accelerate's device map looks balanced, because Accelerate can only account for static weight size at load time.

Hmm. This seems like a problem for Accelerate to fix. The promise of device_map = "auto" is literally "Accelerate will figure it out". If that doesn't happen, I'd call that a bug in Accelerate.

The ecosystem is not perfect, but that doesn't mean we should fix every issue downstream. The complexity of this logic is extremely high, and it is riddled with magic numbers (headroom, "0.7" factor, etc.). Every single change that touches the model loading logic in the future will have to take this stuff into account. With the development resources currently available to this project, I consider this unmaintainable.

NemotronH requires trust_remote_code=True, and the default HuggingFace prompt gives no context about why

But it's always the same reason: The model has custom code. This is the only context in which remote code comes into play with Transformers.

causal-conv1d and mamba-ssm require nvcc to build from source

I see. This sucks. Does that mean those models are slow as molasses on non-NVIDIA systems?

p-e-w · 2026-02-23T14:09:51Z

The gist of what I'm saying is that hybrid models with external memory allocation, requiring custom code and custom dependencies, are at the very edge of what is possible to do "fully automatically" today. We're not yet at a point where we can support such models the same way we do Llama 3, at least not without unacceptable complexity cost. Increasing the size of the codebase by 10% for a single model is not going to happen.

I'm really sorry I have to take this stance, especially given that you already implemented so much of this, but this is a maintenance burden I simply cannot shoulder.

cpagac · 2026-02-24T15:57:11Z

Totally fair. It would be nice for Accelerate to take this long-term, but the main issue is that Accelerate can't predict memory that doesn't exist yet at load time. The SSM workspace is allocated inside custom model code during the first forward pass, so it's fundamentally invisible to device_map="auto". Which by no means is a trivial fix on their end either, but ultimately probably their problem to solve.

I also feel like, as hybrid architectures become more common, the gap between static weight size and actual runtime allocation is only going to grow. I knew the magic numbers were a weak point; the 0.7 factor was tuned against one system and would need adjusting over time. which, as you said, would definitely take some resources for the foreseeable future.

Yep, pretty much — I saw about 0 tok/s on 1 batch, then around 3 on batch 4, and honestly, I just quit it when I was testing, as it was nearly unusable. So, hybrid models are technically functional without them; you just have to give up days of your life waiting for them.

I'm really sorry I have to take this stance, especially given that you already implemented so much of this, but this is a maintenance burden I simply cannot shoulder.

Genuinely, it's all good. I had originally built this as I noticed Heretic didn't have support for a specific model that I wanted to use for a project. As such, I built it and thought, why not see if the main repo would want some of the core logic. I’ll try to keep it maintained and make a separate cleaner fork for it in case anyone wants to use it, but all good if the addition ends here.

p-e-w · 2026-02-25T06:39:30Z

It would be nice for Accelerate to take this long-term, but the main issue is that Accelerate can't predict memory that doesn't exist yet at load time.

There's got to be a way to solve this in the library ecosystem though. Otherwise every software doing inference has to run hacky trial-and-error loops to guess the right size. This will never work reliably.

gemini-code-assist bot reviewed Feb 16, 2026

View reviewed changes

FP8_NVFP4_CHANGES.md Outdated Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

cpagac force-pushed the nemotron-support branch 5 times, most recently from c1ec534 to af4384f Compare February 18, 2026 07:18

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

amq8 mentioned this pull request Feb 18, 2026

Support for Qwen3NextForCausalLM (Qwen3-Next-80B-A3B) #181

Open

cpagac force-pushed the nemotron-support branch from af4384f to f2d66c8 Compare February 19, 2026 01:24

cpagac force-pushed the nemotron-support branch from f2d66c8 to ecaf645 Compare February 19, 2026 01:30

cpagac mentioned this pull request Feb 19, 2026

Nemotron3: Failed to load model with all configured dtypes. #147

Open

		# Quantized models need special handling - we must reload the base model
		# in full precision to merge the LoRA adapters

Conversation

cpagac commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

gemini-code-assist bot commented Feb 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cpagac commented Feb 18, 2026

Uh oh!

cpagac commented Feb 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

cpagac Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

cpagac commented Feb 18, 2026

Uh oh!

cpagac commented Feb 19, 2026

Uh oh!

p-e-w commented Feb 22, 2026

Uh oh!

cpagac commented Feb 22, 2026

Uh oh!

p-e-w commented Feb 23, 2026

Uh oh!

p-e-w commented Feb 23, 2026

Uh oh!

cpagac commented Feb 24, 2026

Uh oh!

p-e-w commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cpagac commented Feb 16, 2026 •

edited

Loading