Implementation of Phi-3-mini-4k-instruct (Q8_0 or Q4_0). by srogmann · Pull Request #6 · mukel/llama3.java

srogmann · 2024-07-10T21:03:57Z

This is a PR to run Phi-3-Mini-4K. It only includes Phi3.java. I wrote this file because phi-3 is faster at simple tasks.

I intentionally left Llama3.java unchanged, even though some synergies could have been achieved. The current Llama3.java is a beautiful, complete example of a transformer model. It might be a shame to introduce additions for phi-3, which would actually be beneficial for reusability.

The nice thing is that thanks to the roughly 2,000 lines in Llama3.java, phi-3 can be added with only 800 lines (or a bit less if debug lines would be removed) . But phi-3 is not llama-3, so it would be plausible if you decided that phi-3 does not belong here. Because who knows where this ends (Gemma-2 is also interesting ;-) ).

mukel · 2024-07-10T22:43:12Z

Hey Sascha, that's awesome!
I also ported Phi3; in my implementation I added support for tensor slices, so there's a large qkv tensor and q, k, v are just slices, which avoids the copies. Same for ffn_gate and ffn_up combined tensors.

To avoid duplicated work, I also ported Mistral/Codestral, Gemma 1 and 2 (no sliding window attention yet), Qwen2 and Phi3.
You are quite resourceful; you were able to port Phi3 on your own so I assume you became familiar with the codebase already, would you like to contribute further e.g. new models, refactors, improvements ... ?

mukel · 2024-07-10T22:56:49Z

I see you also implemented the right RoPE. Nice!

srogmann · 2024-07-11T21:36:39Z

I added support for tensor slices

That's great! I didn't like the copies.

would you like to contribute further e.g. new models, refactors, improvements ... ?

I could take a look at Q6_K, if there's no objection. I was planning on looking into that part anyway.

mukel · 2024-07-11T22:31:53Z

Great!
With Q6_K support I won't need to re-publish pure quantized versions of the models anymore.

I found the graphical representation of the quantized blocks quite understandable here: https://www.modular.com/blog/whats-new-in-max-24-4-max-on-macos-fast-local-llama3-native-quantization-and-gguf-support

Also, a small note on performance; in the current implementation if you mix several implementations e.g. Q4_0 and Q6_K, the compiler will generate a not-so-good version of the matmul. This can be fixed with with minor adjustments, it's been in my backlog for some time.

srogmann · 2024-07-29T22:02:57Z

@mukel I have a Q6_K-implementation and first measurements, but I'm not really happy with them.

Q8_0 with 512 bits runs at 8.2 token/s, Q8_0 with 256 bits runs at 6.1 token/s but Q6_K with 256 bits runs at 1.5 token/s only.

Q6_K with 512 bits runs at 0.34 tokens/s only. Initially, I assumed that 512 bits would be ideal for Q6_K quantization. Q6_K is more complicated than Q8_0 or Q4_0, I expected Q6_K to be a bit slower, but not to this extent.

The current dot-method with 256 bits:

private static float vectorDot256(Q6_KFloatTensor thiz, int thisOffset, FloatTensor that, int thatOffset, int size) {
    float result = 0f;
    int j = 0;

    // Align thisOffset + j to type().getBlockSize().
    int blockSize = GGMLType.Q6_K.getBlockSize();
    assert Integer.bitCount(blockSize) == 1 : "power of 2";
    int alignmentBound = Math.min(size, -thisOffset & (blockSize - 1));
    if (alignmentBound > 0) {
        result += FloatTensor.scalarDot(thiz, thisOffset, that, thatOffset, alignmentBound);
        j += alignmentBound;
    }
    assert (thisOffset + j) % blockSize == 0;

    FloatVector val = FloatVector.zero(F_SPECIES); // FloatVector.SPECIES_256
    int typeSize = GGMLType.Q6_K.getTypeSize();
    int superblockSize = GGMLType.QK_K / 16;
    int blockOffset = (thisOffset + j) / blockSize * typeSize;
    int upperBound = size / blockSize * blockSize;
    var bSpecies128 = ByteVector.SPECIES_128; // 16 bytes
    var bSpecies256 = ByteVector.SPECIES_256; // 32 bytes
    // Layout:
    //   uint8 ql[QK_K/2]: quants, lower 4 bits
    //   uint8 qh[QK_K/4]: quants, upper 2 bits
    //   int8  scales[QK_K/16]: scales, quantized with 8 bits
    //   fp16  d: super-block scale
    for (; j < upperBound; j += blockSize, blockOffset += typeSize) {
        int offsetQl = blockOffset; // size GK_K/2 = 128 bytes = 256 quants
        int offsetQh = offsetQl + GGMLType.QK_K / 2; // size GK_K/4 = 64 bytes = 256 quants
        int offsetScales = offsetQh + GGMLType.QK_K / 4; // size GK_K/16 = 16 bytes = 16 block-scales
        int offsetD = offsetScales + superblockSize;
        float wScaleValue = Float.float16ToFloat(thiz.memorySegment.get(JAVA_SHORT_LE, offsetD));
        var wScale = FloatVector.broadcast(F_SPECIES, wScaleValue);
        var wScale8 = ByteVector.fromMemorySegment(bSpecies128, thiz.memorySegment, offsetScales, ByteOrder.LITTLE_ENDIAN);
        for (int blk128 = 0; blk128 < 2; blk128++) {
            final ByteVector bytesQl0 = ByteVector.fromMemorySegment(bSpecies256, thiz.memorySegment, offsetQl + 64 * blk128, ByteOrder.LITTLE_ENDIAN);
            final ByteVector bytesQl1 = ByteVector.fromMemorySegment(bSpecies256, thiz.memorySegment, offsetQl + 64 * blk128 + 32, ByteOrder.LITTLE_ENDIAN);
            final ByteVector bytesQh0 = ByteVector.fromMemorySegment(bSpecies256, thiz.memorySegment, offsetQh + 32 * blk128, ByteOrder.LITTLE_ENDIAN);

            var ql00 = bytesQl0.and((byte) 0x0f);
            var ql01 = bytesQl1.and((byte) 0x0f);
            var ql10 = bytesQl0.lanewise(VectorOperators.LSHR, 4); // index(ql1) = index(ql0) + QK_K/4
            var ql11 = bytesQl1.lanewise(VectorOperators.LSHR, 4); // index(ql1) = index(ql0) + QK_K/4
            var qh00 = bytesQh0.lanewise(VectorOperators.LSHL, 4).and((byte) 0b110000);
            var qh10 = bytesQh0.lanewise(VectorOperators.LSHL, 2).and((byte) 0b110000); // index(qh1) = index(qh0) + QK_K/8
            var qh20 = bytesQh0.and((byte) 0b110000);
            var qh30 = bytesQh0.lanewise(VectorOperators.LSHR, 2).and((byte) 0b110000);
            ByteVector[] q = new ByteVector[] {
                ql00.add(qh00).sub((byte) 32),
                ql01.add(qh10).sub((byte) 32),
                ql10.add(qh20).sub((byte) 32),
                ql11.add(qh30).sub((byte) 32),
            };
            final int blk128Idx = 8 * blk128;
            FloatVector sum = FloatVector.broadcast(F_SPECIES, 0f);
            for (int i = 0; i < 16; i++) {
                final int idx2 = i / 2;
                final int idx4 = i / 4;
                sum = sum.add(that.getFloatVector(F_SPECIES, thatOffset + j + (2 * blk128Idx + i) * F_SPECIES.length()).mul(q[idx4].castShape(F_SPECIES, i % 4)).mul(wScale8.lane(blk128Idx + idx2)));
            }
            val = sum.fma(wScale, val);                    
        }
    }
    result += val.reduceLanes(VectorOperators.ADD);

    // Remaining entries.
    if (j < size) {
        System.out.println("size-j=" + (size - j));
        result += FloatTensor.scalarDot(thiz, thisOffset + j, that, thatOffset + j, size - j);
    }

    return result;
}

But perhaps there is a better layout of the bit-operations :-).

mukel · 2024-07-29T22:52:30Z

Are you on Apple silicon or Intel?
Apple Silicon (up to M3) only supports NEON with up-to 128-bit vectors, anything larger will default to the slow fallback (slower than scalar).
Just using the Vector API doesn't guarantee proper compilation, if that's the case, then this may serve as a real-world benchmark for C2 and Graal to ensure proper compilation.

mukel · 2024-07-30T08:33:31Z

You should not store vectors in arrays nor fields, otherwise they get materialized, thus slow. It may work but I wouldn't trust C2 escape analysis here.

srogmann · 2024-07-30T22:22:00Z

I'm on Intel (i7-11800H, openjdk 21.0.2 2024-01-16) and AMD (Ryzen 9 7900X, openjdk 22 2024-03-19).

CPU	Quant	Species	Speed
AMD Ryzen	Q8_0	S_512_BIT	8.95 tokens/s
AMD Ryzen	Q8_0	S_256_BIT	10.58 tokens/s
AMD Ryzen	Q8_0	S_128_BIT	10.48 tokens/s
AMD Ryzen	Q6_K	S_512_BIT	0.65 tokens/s
AMD Ryzen	Q6_K	S_256_BIT	2.45 tokens/s
AMD Ryzen	Q6_K	S_128_BIT	0.45 tokens/s

The vectorDot512-method didn't use arrays but vectorDot256 and vectorDot128 of Q6_K used arrays. I will try without arrays. The JVM announced S_512_BIT as preferred size. I'll have to check the code to see if I missed something.

srogmann · 2024-08-05T22:15:12Z

@mukel
In https://github.com/srogmann/llama3.java/tree/feature/llm_server there is a small digression on the provision of a tcp based LLM-server. I used it to access llama3.java in an IntelliJ-plugin I wrote (not yet online but planned). The server uses a simple binary protocol because I didn't want to add dependencies.

Do you plan to distinguish between llama3.java (simple) and llama3.java (extended)? The first one would be a nice 2.000 liner. The second one could have some extensions (Q6_K, server, ...).

mukel · 2024-08-06T08:32:00Z

The goal is to promote Llama3.java into a larger effort e.g. https://github.com/llama4j to implement more models in the same place and share common parts.
The umbrella project will focus on the "inference engine" part but can easily consumed by high-level projects like https://github.com/langchain4j/langchain4j
Having an HTTP server there is a welcome addition.

mukel · 2024-08-08T07:54:44Z

Hi @srogmann, can you share the Q6_K implementations (all the variants)?
My colleagues in the GraalVM team are extending the compiler support for the Vector API; Graal can already properly compile Q8_0 and Q4_0 and I expect it to do a good job on the other quantizations as well, I'd like to test Q6_K to ensure Graal is not missing anything.

srogmann · 2024-08-08T18:55:22Z

Hi @mukel ,

can you share the Q6_K implementations (all the variants)?

Yes, I will share the Q6_K implementation(s) (it's on another device).

Some remarks: I had a look at https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml.c because I wasn't satisfied with the performance of my Q6_K implementation. In architectural #define AVX2 of function ggml_vec_dot_q6_K_q8_K in https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-quants.c there is a 256 bit edition only:

    const __m256i m4 = _mm256_set1_epi8(0xF);
    const __m256i m2 = _mm256_set1_epi8(3);
    const __m256i m32s = _mm256_set1_epi8(32);

So I'm not disappointed that my 512-bit implementation was not as fast as I hoped. But I was surprised to see, that the second factor of the dot product in ggml_vec_dot_q6_K_q8_K is Q8_K, not FLOAT32. This gives the Q6_K-dot implementation more performance and its implementation is more compact at the tail.

mukel · 2024-08-08T19:23:22Z

I'm very surprised as well.
Does llama.cpp really quantizes all vectors in the matrix-vector multiplication with Q8_K? I find this very hard to believe...
Q8_K is the same as Q8_0 but with blocks of 256 elements + additional sums for each 16-elements blocks.

EDIT: Confirmed: ggml-org/llama.cpp#3477

mukel · 2024-08-08T19:25:16Z

I remember reading somewhere how tensors with dimensions % 256 != 0 were problematic, this may be an explanation.

Implementation of Phi-3-mini-4k-instruct (Q8_0 or Q4_0).

8db4f5f

Conversation

srogmann commented Jul 10, 2024

Uh oh!

mukel commented Jul 10, 2024

Uh oh!

mukel commented Jul 10, 2024

Uh oh!

srogmann commented Jul 11, 2024

Uh oh!

mukel commented Jul 11, 2024

Uh oh!

srogmann commented Jul 29, 2024

Uh oh!

mukel commented Jul 29, 2024

Uh oh!

mukel commented Jul 30, 2024

Uh oh!

srogmann commented Jul 30, 2024

Uh oh!

srogmann commented Aug 5, 2024

Uh oh!

mukel commented Aug 6, 2024

Uh oh!

mukel commented Aug 8, 2024

Uh oh!

srogmann commented Aug 8, 2024

Uh oh!

mukel commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mukel commented Aug 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mukel commented Aug 8, 2024 •

edited

Loading