llama.cpp Fork: Progressive Inference & Early-Exit LoRA

This fork of llama.cpp optimizes the trade-off between inference speed and model personality by implementing Early-Exit Patches and path-trained LoRAs.

Core Concept: "Progressive Inference"

Instead of utilizing a heavy, slow model (e.g., Q8_0) for the entire generation process, this approach leverages a high-performance, lightweight base model (Q2_K) paired with a specialized Persona-LoRA. By applying an Early-Exit Patch, inference is optimized to bypass redundant calculations once confidence thresholds are met. This results in doubling the generation speed without compromising the depth of the persona.

Benchmarks

The following tests were conducted in a CPU environment (8 threads), comparing a standard high-precision Q8 model against the optimized Q2_K + LoRA setup.

Master Benchmark (Unpatched)

Model: Mistral 7B Q8_0 (Size: 7.95 GiB)

Test	Performance	Relative Speed
Prompt Processing (pp512)	21.27 ± 0.08 t/s	100%
Token Generation (tg128)	4.94 ± 0.00 t/s	Baseline

Progressive Benchmark (Patched)

Model: Mistral 7B Q2_K + LoRA-7B_q8 (Persona: Professor)

Test	Performance	Relative Speed
Prompt Processing (pp512)	21.49 ± 0.11 t/s	~101%
Token Generation (tg128)	11.98 ± 0.02 t/s	+142%

Qualitative Comparison

Prompt: "Hello Professor! How are you?"

A) Standard Mistral 7B Q8

"Hello there! adjusts glasses I'm doing wonderfully, thank you for asking! [...] I'm a large language model, so I don't have feelings in the classical sense..."

Speed: 5.0 t/s
Verdict: Standard AI behavior; breaks immersion with typical LLM disclaimers.

B) Mistral 7B Q2_K (Standalone)

"Hello there! adjusts spectacles I'm doing splendidly... smiles So, tell me, what would you like to talk about? The wonders of science, perhaps?"

Speed: 11.0 t/s
Verdict: Fast, but generic. Loses the specific nuance of the intended persona.

C) Mistral 7B Q2_K + LoRA + Early Exit (Gap 14, Burnout 150)

"Another inquiry about my well-being! smiles I'm doing well, thank you for asking. The professors' life is quite busy, but I'm managing to keep up with the latest research and findings..."

Speed: 10.4 t/s
Verdict: Optimal. Highly specific "Professor" persona, no disclaimers, and maintains ~95% of the Q2_K raw speed while delivering Q8-level character depth.

A more longer dialog (german) is here: eXample Dialog

Conclusion

By implementing this patch, we achieve 2.4x the generation throughput of a Q8 model while significantly improving roleplay consistency. The LoRA acts as a steering mechanism, while the Early-Exit patch prunes unnecessary compute cycles in saturated layers.

Name		Name	Last commit message	Last commit date
Latest commit History 7,772 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches/dgx-spark		benches/dgx-spark
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
create_path-lora.py		create_path-lora.py
create_path-lora_with-delta.py		create_path-lora_with-delta.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama.cpp Fork: Progressive Inference & Early-Exit LoRA

Core Concept: "Progressive Inference"

Benchmarks

Master Benchmark (Unpatched)

Progressive Benchmark (Patched)

Qualitative Comparison

A) Standard Mistral 7B Q8

B) Mistral 7B Q2_K (Standalone)

C) Mistral 7B Q2_K + LoRA + Early Exit (Gap 14, Burnout 150)

Conclusion

About

Uh oh!

Releases

Packages

Languages

License

dingste/mallama

Folders and files

Latest commit

History

Repository files navigation

llama.cpp Fork: Progressive Inference & Early-Exit LoRA

Core Concept: "Progressive Inference"

Benchmarks

Master Benchmark (Unpatched)

Progressive Benchmark (Patched)

Qualitative Comparison

A) Standard Mistral 7B Q8

B) Mistral 7B Q2_K (Standalone)

C) Mistral 7B Q2_K + LoRA + Early Exit (Gap 14, Burnout 150)

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages