Implementation of Phi-3-mini-4k-instruct (Q8_0 or Q4_0).#6
Implementation of Phi-3-mini-4k-instruct (Q8_0 or Q4_0).#6srogmann wants to merge 1 commit intomukel:mainfrom
Conversation
|
Hey Sascha, that's awesome! To avoid duplicated work, I also ported Mistral/Codestral, Gemma 1 and 2 (no sliding window attention yet), Qwen2 and Phi3. |
|
I see you also implemented the right RoPE. Nice! |
That's great! I didn't like the copies.
I could take a look at Q6_K, if there's no objection. I was planning on looking into that part anyway. |
|
Great! I found the graphical representation of the quantized blocks quite understandable here: https://www.modular.com/blog/whats-new-in-max-24-4-max-on-macos-fast-local-llama3-native-quantization-and-gguf-support Also, a small note on performance; in the current implementation if you mix several implementations e.g. Q4_0 and Q6_K, the compiler will generate a not-so-good version of the matmul. This can be fixed with with minor adjustments, it's been in my backlog for some time. |
|
@mukel I have a Q6_K-implementation and first measurements, but I'm not really happy with them. Q8_0 with 512 bits runs at 8.2 token/s, Q8_0 with 256 bits runs at 6.1 token/s but Q6_K with 256 bits runs at 1.5 token/s only. Q6_K with 512 bits runs at 0.34 tokens/s only. Initially, I assumed that 512 bits would be ideal for Q6_K quantization. Q6_K is more complicated than Q8_0 or Q4_0, I expected Q6_K to be a bit slower, but not to this extent. The current dot-method with 256 bits: But perhaps there is a better layout of the bit-operations :-). |
|
Are you on Apple silicon or Intel? |
|
You should not store vectors in arrays nor fields, otherwise they get materialized, thus slow. It may work but I wouldn't trust C2 escape analysis here. |
|
I'm on Intel (i7-11800H, openjdk 21.0.2 2024-01-16) and AMD (Ryzen 9 7900X, openjdk 22 2024-03-19).
The vectorDot512-method didn't use arrays but vectorDot256 and vectorDot128 of Q6_K used arrays. I will try without arrays. The JVM announced S_512_BIT as preferred size. I'll have to check the code to see if I missed something. |
|
@mukel Do you plan to distinguish between llama3.java (simple) and llama3.java (extended)? The first one would be a nice 2.000 liner. The second one could have some extensions (Q6_K, server, ...). |
|
The goal is to promote Llama3.java into a larger effort e.g. https://github.com/llama4j to implement more models in the same place and share common parts. |
|
Hi @srogmann, can you share the Q6_K implementations (all the variants)? |
|
Hi @mukel , can you share the Q6_K implementations (all the variants)? Yes, I will share the Q6_K implementation(s) (it's on another device). Some remarks: I had a look at https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml.c because I wasn't satisfied with the performance of my Q6_K implementation. In architectural #define AVX2 of function ggml_vec_dot_q6_K_q8_K in https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-quants.c there is a 256 bit edition only: So I'm not disappointed that my 512-bit implementation was not as fast as I hoped. But I was surprised to see, that the second factor of the dot product in ggml_vec_dot_q6_K_q8_K is Q8_K, not FLOAT32. This gives the Q6_K-dot implementation more performance and its implementation is more compact at the tail. |
|
I'm very surprised as well. EDIT: Confirmed: ggml-org/llama.cpp#3477 |
|
I remember reading somewhere how tensors with dimensions % 256 != 0 were problematic, this may be an explanation. |
This is a PR to run Phi-3-Mini-4K. It only includes
Phi3.java. I wrote this file because phi-3 is faster at simple tasks.I intentionally left
Llama3.javaunchanged, even though some synergies could have been achieved. The currentLlama3.javais a beautiful, complete example of a transformer model. It might be a shame to introduce additions for phi-3, which would actually be beneficial for reusability.The nice thing is that thanks to the roughly 2,000 lines in
Llama3.java, phi-3 can be added with only 800 lines (or a bit less if debug lines would be removed) . But phi-3 is not llama-3, so it would be plausible if you decided that phi-3 does not belong here. Because who knows where this ends (Gemma-2 is also interesting ;-) ).