Skip to content

Problems encountered in compiling the F16 precision Qwen3-32B model using 'buddy-mlir' #694

@jiaan-clone

Description

@jiaan-clone

Describe the bug

bug 1

When using buddy mlir to compile the f16 precision of Qwen3-32B model, in the generated subgrap0uprefill_32b.mdir file, vector.transfer-read will return a matrix<16xf32>when reading data. Causing the fma operator to pass in different matrix types。
The error is as follows:

buddy-mlir/build/examples/BuddyQwen32/subgraph0_decode_32b.mlir:306:24: error: 'vector.fma' op failed to verify that all of {lhs, rhs, acc, result} have same type
              %18686 = "vector.fma"(%18684, %18685, %arg4722) : (vector<16xf32>, vector<16xf32>, vector<16xf16>) -> vector<16xf16>

The reason is that in the frontend/Python/ops/tosa.py file, the vector.transfer.rad operation was processed, causing the type of the read matrix to change to f32. I am not sure why the f32 type is fixed here.After modifying the fixed f32, it can compile normally. The modification method is as follows:

diff --git a/frontend/Python/ops/tosa.py b/frontend/Python/ops/tosa.py
index b11d57d..cdaba5f 100644
--- a/frontend/Python/ops/tosa.py
+++ b/frontend/Python/ops/tosa.py
@@ -3844,7 +3844,10 @@ def flash_attention_for_cpu_prefill_op(
     f32 = F32Type.get()
     dtype_qkv = node.tensor_meta["dtype"][0]
     dtype_qkv = mlir_element_type_get(dtype_qkv)
-    dtype = f32
+    # Use the same dtype as Q/K/V to keep IR type-consistent across precisions
+    # (e.g. f16 models). 
+    
+    dtype = dtype_qkv
     vector_width = 16
     v16 = ir.VectorType.get([vector_width], dtype)
     v16_qkv = ir.VectorType.get([vector_width], dtype_qkv)
@@ -11313,13 +11316,13 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table):
                     with ir.InsertionPoint(vec_loop.body):
                         d = vec_loop.induction_variable
                         va = vec_loop.inner_iter_args[0]
-                        vec_ty = ir.VectorType.get([16], f32)
+                        vec_ty = ir.VectorType.get([16], mlir_dtype)
                         perm_map = ir.AffineMap.get(
                             4, 0, [ir.AffineDimExpr.get(3)]
                         )
                         qv = vector.TransferReadOp(
-                            vec_ty,  # vector<16xf32>
-                            query,  # tensor<1x12x1x128xf32>
+                            vec_ty,
+                            query,
                             [b, h, q, d],  # indices
                             perm_map,  # (d0,d1,d2,d3)->(d3)
                             zero,  # padding
@@ -11327,8 +11330,8 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table):
                             loc=loc,
                         ).result
                         kv = vector.TransferReadOp(
-                            vec_ty,  # vector<16xf32>
-                            k_cache,  # tensor<1x2x1024x128xf32>
+                            vec_ty,
+                            k_cache,
                             [b, h_kv, k, d],  # indices
                             perm_map,  # (d0,d1,d2,d3)->(d3)
                             zero,  # padding
@@ -11340,7 +11343,7 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table):
                         scf.YieldOp([va1.result])
 
                     red = vector.ReductionOp(
-                        f32, "add", vec_loop.result, loc=loc
+                        mlir_dtype, "add", vec_loop.result, loc=loc
                     ).result
 
                     acc = arith.AddFOp(prev.result, red)
@@ -11434,9 +11437,9 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table):
                         ).result
 
                         pv = vector.SplatOp(
-                            ir.VectorType.get([16], f32), p, loc=loc
+                            ir.VectorType.get([16], mlir_dtype), p, loc=loc
                         ).result
-                        vec_ty = ir.VectorType.get([16], f32)
+                        vec_ty = ir.VectorType.get([16], mlir_dtype)
                         perm_map = ir.AffineMap.get(
                             4, 0, [ir.AffineDimExpr.get(3)]
                         )

bug 2

After compilation, running the program may result in segment errors during prefill

To Reproduce

Create a BuddyQwen32 folder in the buddy mlir/examples path, place the files in the attachment, and modify the CMakeLists.txt file in the same path. Add the following code:

if(BUDDY_QWEN32_EXAMPLES)
  add_subdirectory(BuddyQwen32)
endif()

Then follow the steps in the BuddyQwen32/README.md file to compile.

buddy-qwen3-32b-main.cpp
CMakeLists.txt
import-qwen3.py
README.md
vocab.txt

Expected behavior

Expected to compile and run F16 precision models normally

Screenshots

Image

Desktop (please complete the following information):

  • OS: [Ubuntu]
  • Version [24.04.3 LTS (Noble Numbat)]

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingfrontend

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions