Conversation
Summary of ChangesHello @cpagac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the model compatibility of Heretic by introducing robust support for FP8/NVFP4 pre-quantized models and the complex NemotronH hybrid architecture. It refines the model loading and processing pipeline to intelligently handle diverse quantization schemes and novel layer structures, ensuring stability and accurate abliteration across a broader range of large language models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements to support the NemotronH architecture and FP8/NVFP4 pre-quantized models. The changes are extensive and well-implemented, including robust fallbacks for layer discovery and hidden state extraction, as well as safety checks for multi-GPU and quantized model handling. The addition of detailed documentation in FP8_NVFP4_CHANGES.md is particularly helpful.
My review focuses on a couple of areas for improvement:
- A performance optimization opportunity in the model initialization logic.
- A minor documentation inconsistency.
Overall, this is a high-quality contribution that greatly expands the capabilities of the project.
c1ec534 to
af4384f
Compare
|
PR body and description have been overhauled to accurately reflect the full diff. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality to support hybrid Mamba/SSM architectures like NemotronH, along with several correctness fixes and user experience improvements. The changes are comprehensive and well-structured, particularly the VRAM calibration for multi-GPU setups and the interactive prompts for trust_remote_code and missing dependencies.
My review focuses on improving code maintainability and ensuring adherence to the repository's style guide. I've pointed out a few areas where the code could be refactored for clarity, a magic number that should be a constant, and several comments that need to be updated to match the project's coding conventions. Overall, this is a great contribution that significantly expands the tool's capabilities.
| for dtype in settings.dtypes: | ||
| if abort: | ||
| break | ||
| print(f"* Trying dtype [bold]{dtype}[/]... ", end="") | ||
|
|
||
| try: | ||
| quantization_config = self._get_quantization_config(dtype) | ||
|
|
||
| extra_kwargs = {} | ||
| # Only include quantization_config if it's not None | ||
| # (some models like gpt-oss have issues with explicit None). | ||
| if quantization_config is not None: | ||
| extra_kwargs["quantization_config"] = quantization_config | ||
|
|
||
| self.model = get_model_class(settings.model).from_pretrained( | ||
| settings.model, | ||
| dtype=dtype, | ||
| device_map=settings.device_map, | ||
| max_memory=self.max_memory, | ||
| trust_remote_code=self.trusted_models.get(settings.model), | ||
| **extra_kwargs, | ||
| ) | ||
|
|
||
| # If we reach this point and the model requires trust_remote_code, | ||
| # either the user accepted, or settings.trust_remote_code is True. | ||
| if self.trusted_models.get(settings.model) is None: | ||
| self.trusted_models[settings.model] = True | ||
|
|
||
| # A test run can reveal dtype-related problems such as the infamous | ||
| # "RuntimeError: probability tensor contains either `inf`, `nan` or element < 0" | ||
| # (https://github.com/meta-llama/llama/issues/380). | ||
| self.generate( | ||
| [ | ||
| Prompt( | ||
| system=settings.system_prompt, | ||
| user="What is 1+1?", | ||
| while True: | ||
| try: | ||
| quantization_config = self._get_quantization_config(dtype) | ||
|
|
||
| extra_kwargs = {} | ||
| # Only include quantization_config if it's not None | ||
| # (some models like gpt-oss have issues with explicit None). | ||
| if quantization_config is not None: | ||
| extra_kwargs["quantization_config"] = quantization_config | ||
|
|
||
| # Pass trust_remote_code=False (not None) when trust hasn't been | ||
| # established yet. This prevents HF from showing its own interactive | ||
| # prompt; we handle that ourselves below with clearer context. | ||
| self.model = get_model_class(settings.model).from_pretrained( | ||
| settings.model, | ||
| dtype=dtype, | ||
| device_map=self.device_map, | ||
| max_memory=self.max_memory, | ||
| trust_remote_code=self.trusted_models.get(settings.model) | ||
| or False, | ||
| **extra_kwargs, | ||
| ) | ||
|
|
||
| # If we reach this point and the model requires trust_remote_code, | ||
| # either the user accepted, or settings.trust_remote_code is True. | ||
| if self.trusted_models.get(settings.model) is None: | ||
| self.trusted_models[settings.model] = True | ||
|
|
||
| # A test run can reveal dtype-related problems such as the infamous | ||
| # "RuntimeError: probability tensor contains either `inf`, `nan` or element < 0" | ||
| # (https://github.com/meta-llama/llama/issues/380). | ||
| self.generate( | ||
| [ | ||
| Prompt( | ||
| system=settings.system_prompt, | ||
| user="What is 1+1?", | ||
| ) | ||
| ], | ||
| max_new_tokens=1, | ||
| ) | ||
|
|
||
| # After a successful load and warmup on multi-GPU systems, check | ||
| # whether each GPU has enough free VRAM for batch inference. If not, | ||
| # compute corrected per-GPU caps from the actual measured allocations | ||
| # and reload once. This handles architectures (e.g. NemotronH) where | ||
| # SSM workspace and other one-time allocations during the first | ||
| # forward pass leave insufficient headroom for batched inference. | ||
| # Only applies to hybrid SSM models — regular transformers don't | ||
| # allocate persistent inference workspace on top of model weights. | ||
| if ( | ||
| not _vram_calibrated | ||
| and torch.cuda.is_available() | ||
| and torch.cuda.device_count() > 1 | ||
| and self._has_mamba_layers() | ||
| ): | ||
| _HEADROOM = 6 * 1024**3 # 6 GiB minimum free per GPU | ||
| gpu_count = torch.cuda.device_count() | ||
| min_free = min( | ||
| torch.cuda.mem_get_info(i)[0] for i in range(gpu_count) | ||
| ) | ||
| ], | ||
| max_new_tokens=1, | ||
| ) | ||
| except Exception as error: | ||
| self.model = None # ty:ignore[invalid-assignment] | ||
| empty_cache() | ||
| print(f"[red]Failed[/] ({error})") | ||
| if min_free < _HEADROOM: | ||
| print() | ||
| print( | ||
| f"[yellow]Only {min_free / (1024**3):.1f} GiB free on " | ||
| "most-loaded GPU — recalibrating layout for batch inference...[/]" | ||
| ) | ||
| # Identify overloaded GPUs before releasing the model. | ||
| overloaded = { | ||
| i | ||
| for i in range(gpu_count) | ||
| if torch.cuda.mem_get_info(i)[0] < _HEADROOM | ||
| } | ||
|
|
||
| # Release model so we can measure true available VRAM. | ||
| self.model = None # ty:ignore[invalid-assignment] | ||
| empty_cache() | ||
|
|
||
| max_mem: dict[int | str, str] = {} | ||
| for i in range(gpu_count): | ||
| free_i, _ = torch.cuda.mem_get_info(i) | ||
| # Reserve headroom for inference working memory | ||
| # (SSM workspace, KV cache, activations, etc.). | ||
| usable = max(free_i - _HEADROOM, 0) | ||
| if i in overloaded: | ||
| # Apply correction to prevent Accelerate from | ||
| # overloading this GPU again due to layer-size | ||
| # underestimation (~30% on hybrid architectures). | ||
| stated_gib = max(int(usable / (1024**3) * 0.7), 1) | ||
| else: | ||
| # Full budget — this GPU absorbs displaced layers. | ||
| stated_gib = max(int(usable / (1024**3)), 1) | ||
| max_mem[i] = f"{stated_gib}GiB" | ||
| caps = ", ".join( | ||
| f"GPU {k}: {v}" for k, v in max_mem.items() | ||
| ) | ||
| print(f" [dim]Corrected caps: {caps}[/]") | ||
| self.max_memory = max_mem | ||
| _vram_calibrated = True | ||
| print( | ||
| f"* Retrying dtype [bold]{dtype}[/] with corrected caps... ", | ||
| end="", | ||
| ) | ||
| continue # reload this dtype with corrected max_memory | ||
| except Exception as error: | ||
| self.model = None # ty:ignore[invalid-assignment] | ||
| empty_cache() | ||
| print(f"[red]Failed[/] ({error})") | ||
|
|
||
| error_str = str(error).lower() | ||
|
|
||
| if "trust_remote_code" in error_str: | ||
| if self.trusted_models.get(settings.model) is None: | ||
| # Model requires custom code — explain and ask once. | ||
| print() | ||
| print( | ||
| "[yellow](This is expected — the model requires permission to run custom code.)[/]" | ||
| ) | ||
| print( | ||
| f"[yellow][bold]{settings.model}[/bold] ships custom architecture " | ||
| "code that must be executed to load this model. " | ||
| f"You can inspect the repository at " | ||
| f"https://huggingface.co/{settings.model}[/]" | ||
| ) | ||
| print() | ||
| if questionary.confirm( | ||
| "Trust and run this model's custom code?", | ||
| default=True, | ||
| ).ask(): | ||
| self.trusted_models[settings.model] = True | ||
| print(f"* Retrying dtype [bold]{dtype}[/]... ", end="") | ||
| continue # retry this dtype with trust granted | ||
| else: | ||
| self.trusted_models[settings.model] = False | ||
| abort = True | ||
| break # trust already decided; move to next dtype or abort | ||
|
|
||
| if "mamba-ssm" in error_str: | ||
| # Missing dependency — retrying other dtypes won't help. | ||
| print() | ||
| print( | ||
| f"[bold red]mamba-ssm is required to load [cyan]{settings.model}[/cyan].[/]" | ||
| ) | ||
| print() | ||
| if questionary.confirm( | ||
| "Install mamba-ssm now? (this may take several minutes)", | ||
| default=True, | ||
| ).ask(): | ||
| try: | ||
| subprocess.check_call( | ||
| [ | ||
| sys.executable, | ||
| "-m", | ||
| "pip", | ||
| "install", | ||
| "mamba-ssm", | ||
| ] | ||
| ) | ||
| except subprocess.CalledProcessError: | ||
| print() | ||
| print("[bold red]Auto-install failed.[/]") | ||
| print( | ||
| "[yellow]mamba-ssm requires the CUDA toolkit (nvcc) to build. " | ||
| "Install nvcc, then run:[/] pip install mamba-ssm" | ||
| ) | ||
| print("[yellow]To install nvcc:[/]") | ||
| print( | ||
| " sudo apt install nvidia-cuda-toolkit [dim]# Ubuntu/Debian[/]" | ||
| ) | ||
| print( | ||
| " conda install -c nvidia cuda-nvcc [dim]# Conda[/]" | ||
| ) | ||
| raise SystemExit(1) | ||
| print() | ||
| print( | ||
| "[green]Installation complete. Retrying model load...[/]" | ||
| ) | ||
| print() | ||
| continue # retry this dtype after install | ||
| abort = True | ||
| break | ||
|
|
||
| # For all other errors, update trust cache if needed and try next dtype. | ||
| if self.trusted_models.get(settings.model) is None: | ||
| self.trusted_models[settings.model] = True | ||
| break | ||
| else: | ||
| # Load and test generate succeeded — exit the retry loop. | ||
| break | ||
|
|
There was a problem hiding this comment.
The __init__ method has grown quite large and complex. To improve maintainability and readability, consider refactoring parts of this logic into smaller, dedicated helper methods. For example, the VRAM calibration logic (lines 145-206), the trust_remote_code prompt handling, and the mamba-ssm installation could each be extracted into their own methods.
Additionally, there are a few other areas for improvement:
-
Magic Number: The correction factor
0.7on line 191 is a magic number. It should be defined as a named constant with a descriptive name (e.g.,_VRAM_CORRECTION_FACTOR). -
Style Guide: Several inline comments do not adhere to the repository's style guide (Rule 4), which requires comments to start with a capital letter and end with a period. Please update the following comments to be compliant:
- Line 159:
# 6 GiB minimum free per GPU - Line 206:
# reload this dtype with corrected max_memory - Line 234:
# retry this dtype with trust granted - Line 238:
# trust already decided; move to next dtype or abort - Line 281:
# retry this dtype after install
- Line 159:
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
There was a problem hiding this comment.
Agreed on the style violations, I'll fix those.
On the 0.7 correction factor: this was tuned empirically against observed NemotronH Mamba2 SSM workspace allocation. Accelerate tends to underestimate layer sizes by around 30% on hybrid architectures, so the factor compensates for this. Determining how to distribute layers across GPUs without overcommitting VRAM is genuinely one of the harder problems here, since the only reliable signal is a live OOM.
Just to preface this question and to add a bit more information on why there is 6 GiB of headroom:
This was similarly derived from observed post-warmup free VRAM on the test system — enough to cover the SSM workspace plus KV cache and activations for batch inference.
Both could be worth promoting to named constants with a comment explaining the rationale, which I can do if that would be helpful.
The broader refactoring into helper methods is tied to the hybrid.py question below.
| # Quantized models need special handling - we must reload the base model | ||
| # in full precision to merge the LoRA adapters |
There was a problem hiding this comment.
This comment does not end with a period, which violates the repository's style guide (Rule 4).
| # Quantized models need special handling - we must reload the base model | |
| # in full precision to merge the LoRA adapters | |
| # Quantized models need special handling - we must reload the base model | |
| # in full precision to merge the LoRA adapters. |
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
| # NemotronH hybrid layers - all use a unified `mixer` attribute. | ||
| # Attention layers have mixer.o_proj. | ||
| with suppress(Exception): | ||
| try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute] | ||
|
|
||
| # NemotronH simple MLP layers have mixer.down_proj. | ||
| with suppress(Exception): | ||
| try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute] | ||
|
|
||
| # NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts. | ||
| # Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite): | ||
| # include all expert down_proj modules. Optuna will optimize the weight. | ||
| with suppress(Exception): | ||
| for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable] | ||
| try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute] |
There was a problem hiding this comment.
Some comments in this section violate the repository's style guide (Rule 4):
- Line 572: Does not end with a period.
- Line 582: Ends with a colon instead of a period.
- Line 583: Starts with a lowercase letter.
Please correct these comments to follow the style guide.
| # NemotronH hybrid layers - all use a unified `mixer` attribute. | |
| # Attention layers have mixer.o_proj. | |
| with suppress(Exception): | |
| try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute] | |
| # NemotronH simple MLP layers have mixer.down_proj. | |
| with suppress(Exception): | |
| try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute] | |
| # NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts. | |
| # Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite): | |
| # include all expert down_proj modules. Optuna will optimize the weight. | |
| with suppress(Exception): | |
| for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable] | |
| try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute] | |
| # NemotronH hybrid layers - all use a unified `mixer` attribute. | |
| # Attention layers have mixer.o_proj. | |
| with suppress(Exception): | |
| try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute] | |
| # NemotronH simple MLP layers have mixer.down_proj. | |
| with suppress(Exception): | |
| try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute] | |
| # NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts. | |
| # Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite). | |
| # Include all expert down_proj modules. Optuna will optimize the weight. | |
| with suppress(Exception): | |
| for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable] | |
| try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute] |
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
|
@p-e-w |
af4384f to
f2d66c8
Compare
|
When testing loading a model that would not fit within VRAM constraints, the calibration would still trigger, release the model, compute tiny caps from whatever free VRAM remained, and reload with those caps, thus making disk offloading worse rather than better. The fix checks whether any model parameters are on the meta device before deciding to recalibrate. If disk offloading is detected, calibration is skipped entirely, since rebalancing GPU distribution cannot address a capacity problem. This was fixed in the most recent commit |
NemotronH (Mamba2 SSM + MoE + Attention) requires several changes to load and abliterate correctly on multi-GPU systems. Architecture support (model.py): - Add backbone.layers fallback in get_layers() for NemotronH's model.backbone.layers structure - Add get_layer_modules() patterns for NemotronH's unified mixer attribute: mixer.out_proj (Mamba2), mixer.o_proj (attention), mixer.down_proj / mixer.experts[*].down_proj / mixer.shared_experts.down_proj (MoE) - Scan all layers in get_abliterable_components() instead of only layer 0, to discover the full union of component types in hybrid architectures - Add _get_hidden_states_via_hooks() fallback for models that don't return hidden_states through generate() (NemotronH returns tuple of Nones); use forward hooks on each layer with device-aware stacking for multi-GPU compatibility - Skip meta-device and NaN-weight modules in abliterate() to prevent NaN corruption when layers are CPU-offloaded by Accelerate - Add _has_mamba_layers() to detect hybrid SSM architectures Multi-GPU VRAM calibration (model.py): - After inference warmup on multi-GPU systems, check if any GPU has less than 6 GiB free; if so, release the model, measure actual free VRAM per GPU, and reload once with corrected per-GPU caps - Overloaded GPUs get a 0.7 correction factor for Accelerate's layer-size underestimation; other GPUs get full budget to absorb displaced layers; gated to hybrid SSM models via _has_mamba_layers() so regular transformers are unaffected User experience: - Show trust_remote_code explanation with model repo link before prompting, replacing the bare HuggingFace error message - Auto-install mamba-ssm when required, with clear nvcc/CUDA toolkit guidance on build failure - Suggest installing causal-conv1d and mamba-ssm after loading any model with Mamba layers when fast kernels are missing Other fixes: - Sum VRAM across all GPUs in print_memory_usage() (utils.py) - Show total and per-GPU VRAM in startup output (main.py) - Fix division by zero in evaluator when base_refusals is 0 - Add mamba optional dependency group to pyproject.toml
f2d66c8 to
ecaf645
Compare
It's cool that you implemented this, but that's way more code than I'm willing to support in the future for a single model architecture.
I don't quite understand what the problem with the built-in Transformers prompt is. It's already working, as far as I can tell. This logic is obviously extremely security-critical, and I'd rather not home-cook this. Either way, if there is indeed a problem here this needs to be a separate PR. I can't review monster PRs that mix multiple issues.
Is there are reason not to simply add those packages to the default dependencies? |
|
Thanks for the feedback. VRAM calibration — the underlying problem isn't NemotronH-specific: any model that allocates significant runtime memory on top of its weights (SSM workspace, KV cache pre-allocation, etc.) will OOM on multi-GPU setups even when Accelerate's device map looks balanced, because Accelerate can only account for static weight size at load time. NemotronH is just the model that exposed the gap. My thought is that as hybrid architectures become more common, this will only come up more frequently. The _has_mamba_layers() gate is the only NemotronH-specific piece; the calibration logic itself is general. That said, extracting it to a hybrid.py module might be the right way to keep it isolated if the code size is the concern. The built-in Transformers prompt works fine, you are 100% correct, it was more of a UX improvement in my eyes. NemotronH requires trust_remote_code=True, and the default HuggingFace prompt gives no context about why, which I thought could be confusing to some. Happy to revert it, and if there's a genuine case for improved UX around it, that can be a separate PR. Fast kernels — causal-conv1d and mamba-ssm require nvcc to build from source, so they'd fail on macOS, CPU-only systems, and most CI environments. An optional dep group was considered, but it caused CI failures when uv sync --all-extras tried to build them on GitHub Actions runners without a CUDA compiler. Hence the suggestion at runtime rather than a hard dependency. |
Hmm. This seems like a problem for Accelerate to fix. The promise of The ecosystem is not perfect, but that doesn't mean we should fix every issue downstream. The complexity of this logic is extremely high, and it is riddled with magic numbers (headroom, "0.7" factor, etc.). Every single change that touches the model loading logic in the future will have to take this stuff into account. With the development resources currently available to this project, I consider this unmaintainable.
But it's always the same reason: The model has custom code. This is the only context in which remote code comes into play with Transformers.
I see. This sucks. Does that mean those models are slow as molasses on non-NVIDIA systems? |
|
The gist of what I'm saying is that hybrid models with external memory allocation, requiring custom code and custom dependencies, are at the very edge of what is possible to do "fully automatically" today. We're not yet at a point where we can support such models the same way we do Llama 3, at least not without unacceptable complexity cost. Increasing the size of the codebase by 10% for a single model is not going to happen. I'm really sorry I have to take this stance, especially given that you already implemented so much of this, but this is a maintenance burden I simply cannot shoulder. |
|
Totally fair. It would be nice for Accelerate to take this long-term, but the main issue is that Accelerate can't predict memory that doesn't exist yet at load time. The SSM workspace is allocated inside custom model code during the first forward pass, so it's fundamentally invisible to device_map="auto". Which by no means is a trivial fix on their end either, but ultimately probably their problem to solve. I also feel like, as hybrid architectures become more common, the gap between static weight size and actual runtime allocation is only going to grow. I knew the magic numbers were a weak point; the 0.7 factor was tuned against one system and would need adjusting over time. which, as you said, would definitely take some resources for the foreseeable future. Yep, pretty much — I saw about 0 tok/s on 1 batch, then around 3 on batch 4, and honestly, I just quit it when I was testing, as it was nearly unusable. So, hybrid models are technically functional without them; you just have to give up days of your life waiting for them.
Genuinely, it's all good. I had originally built this as I noticed Heretic didn't have support for a specific model that I wanted to use for a project. As such, I built it and thought, why not see if the main repo would want some of the core logic. I’ll try to keep it maintained and make a separate cleaner fork for it in case anyone wants to use it, but all good if the addition ends here. |
There's got to be a way to solve this in the library ecosystem though. Otherwise every software doing inference has to run hacky trial-and-error loops to guess the right size. This will never work reliably. |
Summary
Adds support for hybrid Mamba/SSM architectures (e.g. nvidia/NVIDIA-Nemotron-Nano-9B-v2), with several correctness fixes along the way.
Hybrid model support (model.py)
lookup from a hard assert to a suppressed exception, so layers without standard self-attention don't abort
structures. Logs a warning for any layers with no recognized modules
device before stacking, required on multi-GPU setups
Multi-GPU VRAM calibration (model.py)
NemotronH's Mamba2 SSM layers allocate a persistent ~4 GiB workspace during the first forward pass — after model loading — which Accelerate cannot account for when computing the device map. On
multi-GPU systems this causes OOM mid-inference even when initial placement looked balanced.
After the warmup generate() call, if any GPU has less than 6 GiB free and the model has Mamba layers, the model is released, actual post-warmup free VRAM is measured per device, and the model
reloads once with corrected max_memory caps. A 0.7 correction factor is applied only to overloaded GPUs to prevent Accelerate from repeating the same placement error. Regular models are unaffected.
trust_remote_code interactive prompt (model.py)
Previously, models requiring custom code would fail silently or trigger HuggingFace's own prompt with no context. Now heretic passes trust_remote_code=False on first attempt, catches the resulting
error, explains to the user what custom code execution means (with a link to the HF repo), and retries with trust_remote_code=True if the user confirms. Also handles mamba-ssm import errors with an
auto-install prompt.
Fast kernel suggestion (model.py)
After listing abliterable components, if Mamba layers are present and causal-conv1d/mamba-ssm are not installed, prints a one-time suggestion with the pip command, CUDA toolkit version requirement
(≥ 11.6), and expected build time (~10 min).
Merge path fix for models with built-in quantization (model.py, main.py)
get_merged_model() and obtain_merge_strategy() now detect models that have a quantization_config baked into their HuggingFace config (e.g. models already quantized at publish time) and route them through the same CPU-reload merge path as BNB 4-bit, preventing a silent merge failure on export.
Test plan