This project is a modified version of llama.cpp that adds a new way to store and use the model's KV cache during inference.
In simple terms:
- normal
llama.cppstores some working memory for the model in a standard format - this fork can store that same working memory in a much more compressed format called TurboQuant
- on AMD hardware, the fork can use that compressed format efficiently enough to speed up real workloads
The important point is that this is not just a theory or a paper exercise. It has been built, tested, and benchmarked on an AMD machine using Vulkan.
When a large language model answers a prompt, it keeps an internal history of what it has already processed. That history is stored in what is usually called the KV cache.
Why that matters:
- longer prompts create more KV cache
- longer conversations create more KV cache
- generating more tokens means the model keeps reusing that KV cache
For large models, KV cache handling becomes a real performance cost.
TurboQuant aims to reduce that cost by storing KV data in a much smaller form while still letting the model use it effectively.
TurboQuant is not a complete fix for every long-context slowdown.
The most important limitation is prefill.
Prefill means:
- loading a long prompt
- computing all the model work needed to turn that prompt into KV cache state
TurboQuant helps with KV-cache efficiency, but it does not remove the fundamental compute cost of processing a very large prompt in the first place.
That means two things can both be true:
- TurboQuant can make long-context usage fit in memory more easily
- very large prompt ingestion can still be too slow to feel practical on local hardware
This is especially relevant for people who imagine "6x less KV memory" automatically means "6x bigger usable context." In practice, usable context is limited by both:
- memory
- speed
TurboQuant helps strongly with the first and can help meaningfully with decode-side speed, but it does not by itself eliminate prefill bottlenecks.
If it works well, you can get:
- faster text generation
- better throughput on longer prompts
- better scaling as context length grows
- potentially lower memory pressure from KV cache handling
In plain terms: the model can feel snappier, especially when it is generating over a longer context instead of just answering one very short prompt.
The benchmark names use a shorthand:
tg128Means generation-only, 128 generated tokens.pp512Means prompt processing only, 512 prompt tokens.pp4096+tg256Means process a 4096-token prompt and then generate 256 tokens.
Why the mixed tests matter:
- prompt-only tests show prompt ingestion speed
- generation-only tests show sustained decode speed
- mixed prompt+generation tests are closer to normal real usage
The below is only a high level summary of the results, a full benchmark doc is in the docs section of the repository here.
On the AMD Vulkan machine used for validation, TurboQuant is now faster than both:
- clean upstream
llama.cpp - this fork running the normal non-TurboQuant KV path
That is the important headline.
Here is the latest larger benchmark comparison:
And here is the TurboQuant uplift over clean upstream:
The main pattern is straightforward:
- small pure prompt gains exist, but they are modest
- the strongest gains appear in generation-heavy and mixed prompt+generation workloads
- the longer-context mixed tests show some of the best results
That is exactly where TurboQuant is expected to matter most, because that is where KV-cache handling becomes more important.
This was not judged only against the fork's own baseline mode.
A clean upstream llama.cpp build was also created and benchmarked on:
- the same machine
- the same GPU backend
- the same GGUF model
- the same benchmark shapes
That matters because it shows the gains are not just caused by unrelated drift in the fork.
One of the biggest problems during development was that the "fast path" had been coded, but the real benchmark was not actually using it.
In plain terms:
- the optimized path existed
- the benchmark still kept falling back to a more ordinary path
- that made early benchmark results look weak or mixed
The real breakthrough was fixing the backend integration so the compressed attention path was actually being exercised during real Vulkan flash attention.
Once that happened, the performance picture changed in a meaningful way.
You can use this fork with GGUF models already downloaded by LM Studio from:
C:\Users\username\.lmstudio\models
You do not need a special TurboQuant model file format.
You use the same GGUF, but run the fork with:
- normal mode for baseline
--kv-codec turboquant --kv-tq-runtime vulkanfor TurboQuant mode
That is covered in:
docs/turboquant-model-usage.mddocs/turboquant-installation.md
The current result is good, but it is not the end of the road.
Likely improvement areas:
- broader backend coverage Vulkan is the validated proof path today. HIP/ROCm still needs the same level of real hardware benchmark proof.
- more models Right now the result is proven on a strong large-model case, but broader model-family coverage would make the case stronger.
- memory reporting Throughput is now good. The next useful proof would be showing memory behavior and not just speed.
- more direct compressed attention coverage There is still room to push more of the compressed path deeper into the backend instead of relying on transitional compatibility paths.
- cleanup for upstream quality The code works and benchmarks well, but there is still a difference between "working research-grade fork" and "polished upstream-ready patch set."
For a non-specialist, the takeaway is:
- this fork is a real modified
llama.cpp - it runs ordinary GGUF models
- it has a TurboQuant mode for KV cache handling
- on the tested AMD Vulkan setup, that mode is meaningfully faster on important workloads