Layer-by-layer model training and modification framework for 80B+ parameter MoE models on consumer GPUs.
Built because nothing else could do it.
justcalljon.pro · Features · Architecture · Install · Usage
You want to modify an 80B parameter MoE model. Maybe you need to abliterate refusal vectors, extend context to 512K, or merge a LoRA adapter into the base weights. Cool.
Except:
- AirLLM streams layers for inference, but can't modify weights or write them back. It's read-only by design. (see issue)
- Heretic does abliteration, but loads the entire model into memory. 80B at BF16 = 160GB minimum. Good luck fitting that on a consumer GPU. (see issue)
- Unsloth/PEFT train LoRA adapters but don't merge them layer-by-layer. You're stuck if you can't fit the full model in VRAM.
So we built one that actually works.
AEGIS Trainer streams model layers from NVMe through RAM to VRAM one at a time, applies operations to each layer, and writes the modified weights back. Your 80B MoE model never fully loads into memory. 11GB VRAM is enough.
NVMe (safetensors) --> 120GB RAM --> 11GB VRAM --> modify --> write back
one layer at a time
- Abliteration -- Directional ablation to remove refusal vectors. Targets all 512 experts per layer, not just the dense projection. Adapted from Heretic's approach but runs layer-by-layer instead of loading 160GB into memory.
- LongRoPE -- Context window extension from 256K to 512K tokens. Only touches the 12 full-attention layers (DeltaNet layers don't use RoPE).
- LoRA Merge -- Merge PEFT/LoRA adapters directly into base weights. Per-layer
W_merged = W_original + (B @ A) * scaling. - Expert Pruning -- Zero out or scale specific MoE experts. Useful for removing underperforming routed experts.
- Weight Inspection -- Non-destructive per-tensor statistics: min, max, mean, std, sparsity. Generates layer-by-layer reports.
- Quantization -- Per-layer dtype casting for compression.
- QQMS Queue -- Priority queue with age-boosted scheduling, dead letter queue, and overflow management. Resource-aware throttling prevents OOM.
- Checkpoint Manager -- Atomic JSON persistence. Resume interrupted runs from the exact layer that failed.
- Resource Monitor -- Real-time RAM, VRAM, and CPU tracking. Intel Arc B580 safe (uses
torch.xpuand sysfs, nevertorch.cuda). - Layer IO -- Composition-based safetensors load/save following AirLLM naming conventions.
Full Textual-based terminal interface with 7 tabs:
| Tab | Key | Purpose |
|---|---|---|
| Dashboard | 1 |
Resource monitoring, progress, queue summary |
| Models | 2 |
Browse and inspect available models (HF + GGUF) |
| Build | 3 |
Step-by-step operation configuration wizard |
| Inspect | 4 |
Per-layer weight inspector with arrow-key navigation |
| Queue | 5 |
Queue management -- reorder, cancel, pause/resume |
| Logs | 6 |
Scrollable, filterable log viewer |
| Visualizer | 7 |
Live weight visualization during operations |
The Visualizer tab renders live weight distributions using Unicode braille scatter plots (2x4 dot resolution per character cell) with per-tensor-type color coding:
- Cyan -- Attention projections (q/k/v/o)
- Yellow -- MoE expert weights
- Orange -- Shared expert weights
- Green -- Layer norms
- Purple -- Gating network
Includes before/after overlay during modifications and a Heretic-style step-by-step progress display.
aegis-trainer run \
--model /path/to/model \
--output /path/to/output \
--op abliterate \
--op longrope:target_length=524288
aegis-trainer inspect --model /path/to/model
aegis-trainer queue --status
aegis-trainer tuiaegis_trainer/
__init__.py # Public API exports
layer_trainer.py # Core orchestrator (878 lines)
layer_context.py # Frozen dataclass per-layer metadata
cli.py # Click CLI (673 lines)
ops/
base.py # LayerOperation ABC
abliteration.py # Directional ablation (all 512 experts)
longrope.py # RoPE extension (12/48 layers)
lora_merge.py # LoRA adapter merge
expert_prune.py # MoE expert pruning
weight_inspect.py # Non-destructive stats
quantize.py # Dtype casting
queue/
qqms.py # Resource-aware execution engine
queue_item.py # Age-boosted priority scheduling
dlq.py # Dead letter queue
overflow.py # DROP_LOWEST / BLOCK / SPILL_DISK
utils/
resource_monitor.py # RAM/VRAM/CPU (Intel Arc safe)
layer_io.py # Safetensors load/save
checkpoint.py # Atomic JSON persistence
profiler.py # Layer/operation timing + ETA
tui/
app.py # 7-tab Textual application
theme.py # AEGIS color scheme
screens/ # Dashboard, Models, Build, Inspect, Queue, Logs, Visualizer
widgets/ # ResourceBar, ProgressPanel, LayerMap, BrailleCanvas, WeightAtlas, WeightHistogram, HereticProgress
First-class support for the Qwen3-Next hybrid DeltaNet + Attention architecture:
- 48 layers: 36 DeltaNet (linear attention) + 12 full attention
- Layer pattern: 3x DeltaNet, 1x Attention, repeated 12 times
- 512 MoE experts, 10 active per token
- Per-layer type awareness: operations like LongRoPE automatically skip DeltaNet layers (they don't use RoPE)
- Color-coded UI: DeltaNet layers in yellow, Attention layers in cyan throughout the TUI
git clone https://github.com/jonhardwick-spec/aegis-trainer.git
cd aegis-trainer
pip install -e .- Python 3.10+
- PyTorch (with Intel XPU support for Arc GPUs, or standard CUDA)
- textual >= 7.0 (for TUI)
- safetensors, transformers, click, rich
aegis-trainer tuiaegis-trainer run \
--model /path/to/Qwen3-Next-80B \
--output /path/to/output \
--op abliterate:strength=1.0aegis-trainer inspect --model /path/to/model --format tablefrom aegis_trainer import LayerTrainer, ResourceLimits
from aegis_trainer.ops import AbliterationOp, LongRoPEOp
trainer = LayerTrainer(
model_path="/path/to/model",
operations=[
AbliterationOp(strength=1.0),
LongRoPEOp(target_length=524288),
],
output_path="/path/to/output",
resource_limits=ResourceLimits(
max_ram_gb=120,
max_vram_gb=11,
max_cpu_cores=8,
),
)
result = trainer.run()
print(result.summary())| Component | Spec |
|---|---|
| CPU | AMD Ryzen Threadripper PRO 5945WX (12-core) |
| RAM | 128GB DDR4 (120GB allocated) |
| GPU | Intel Arc B580 12GB (Battlemage) |
| Storage | 2x 477GB NVMe + 466GB SSD |
| OS | Ubuntu 25.10, Kernel 6.17 |
The entire 80B MoE model streams through this setup at ~2 layers/min during abliteration. MoE experts sit in RAM, attention kernels run on VRAM.
Pull requests are welcome and encouraged. This project exists because existing tools couldn't handle 80B MoE modifications on consumer hardware, and there's a lot more ground to cover.
Things that need work:
- CUDA backend testing (built on Intel Arc, needs validation on NVIDIA)
- AMD ROCm support in the resource monitor
- Additional operations (knowledge distillation, GPTQ, AWQ layer-level quant)
- More model architectures (Llama, Mistral, DeepSeek-V3)
- Safetensors index parsing for sharded models
- Performance profiling and optimization on different hardware configs
How to contribute:
- Fork the repo
- Create a feature branch (
git checkout -b feature/your-thing) - Make your changes, add tests if applicable
- Open a PR with a clear description of what you changed and why
If you're working on something big, open an issue first so we don't duplicate effort.
No CLA, no bureaucracy. If your code works and makes the project better, it gets merged.
SSPL-1.0 (Server Side Public License)
If you offer this as a service, you open-source your entire stack. Use it however you want otherwise.
Hardwick Software Services justcalljon.pro · GitHub