Describe the bug
bug 1
When using buddy mlir to compile the f16 precision of Qwen3-32B model, in the generated subgrap0uprefill_32b.mdir file, vector.transfer-read will return a matrix<16xf32>when reading data. Causing the fma operator to pass in different matrix types。
The error is as follows:
buddy-mlir/build/examples/BuddyQwen32/subgraph0_decode_32b.mlir:306:24: error: 'vector.fma' op failed to verify that all of {lhs, rhs, acc, result} have same type
%18686 = "vector.fma"(%18684, %18685, %arg4722) : (vector<16xf32>, vector<16xf32>, vector<16xf16>) -> vector<16xf16>
The reason is that in the frontend/Python/ops/tosa.py file, the vector.transfer.rad operation was processed, causing the type of the read matrix to change to f32. I am not sure why the f32 type is fixed here.After modifying the fixed f32, it can compile normally. The modification method is as follows:
diff --git a/frontend/Python/ops/tosa.py b/frontend/Python/ops/tosa.py
index b11d57d..cdaba5f 100644
--- a/frontend/Python/ops/tosa.py
+++ b/frontend/Python/ops/tosa.py
@@ -3844,7 +3844,10 @@ def flash_attention_for_cpu_prefill_op(
f32 = F32Type.get()
dtype_qkv = node.tensor_meta["dtype"][0]
dtype_qkv = mlir_element_type_get(dtype_qkv)
- dtype = f32
+ # Use the same dtype as Q/K/V to keep IR type-consistent across precisions
+ # (e.g. f16 models).
+
+ dtype = dtype_qkv
vector_width = 16
v16 = ir.VectorType.get([vector_width], dtype)
v16_qkv = ir.VectorType.get([vector_width], dtype_qkv)
@@ -11313,13 +11316,13 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table):
with ir.InsertionPoint(vec_loop.body):
d = vec_loop.induction_variable
va = vec_loop.inner_iter_args[0]
- vec_ty = ir.VectorType.get([16], f32)
+ vec_ty = ir.VectorType.get([16], mlir_dtype)
perm_map = ir.AffineMap.get(
4, 0, [ir.AffineDimExpr.get(3)]
)
qv = vector.TransferReadOp(
- vec_ty, # vector<16xf32>
- query, # tensor<1x12x1x128xf32>
+ vec_ty,
+ query,
[b, h, q, d], # indices
perm_map, # (d0,d1,d2,d3)->(d3)
zero, # padding
@@ -11327,8 +11330,8 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table):
loc=loc,
).result
kv = vector.TransferReadOp(
- vec_ty, # vector<16xf32>
- k_cache, # tensor<1x2x1024x128xf32>
+ vec_ty,
+ k_cache,
[b, h_kv, k, d], # indices
perm_map, # (d0,d1,d2,d3)->(d3)
zero, # padding
@@ -11340,7 +11343,7 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table):
scf.YieldOp([va1.result])
red = vector.ReductionOp(
- f32, "add", vec_loop.result, loc=loc
+ mlir_dtype, "add", vec_loop.result, loc=loc
).result
acc = arith.AddFOp(prev.result, red)
@@ -11434,9 +11437,9 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table):
).result
pv = vector.SplatOp(
- ir.VectorType.get([16], f32), p, loc=loc
+ ir.VectorType.get([16], mlir_dtype), p, loc=loc
).result
- vec_ty = ir.VectorType.get([16], f32)
+ vec_ty = ir.VectorType.get([16], mlir_dtype)
perm_map = ir.AffineMap.get(
4, 0, [ir.AffineDimExpr.get(3)]
)
bug 2
After compilation, running the program may result in segment errors during prefill
To Reproduce
Create a BuddyQwen32 folder in the buddy mlir/examples path, place the files in the attachment, and modify the CMakeLists.txt file in the same path. Add the following code:
if(BUDDY_QWEN32_EXAMPLES)
add_subdirectory(BuddyQwen32)
endif()
Then follow the steps in the BuddyQwen32/README.md file to compile.
buddy-qwen3-32b-main.cpp
CMakeLists.txt
import-qwen3.py
README.md
vocab.txt
Expected behavior
Expected to compile and run F16 precision models normally
Screenshots
Desktop (please complete the following information):
- OS: [Ubuntu]
- Version [24.04.3 LTS (Noble Numbat)]
Additional context
Add any other context about the problem here.
Describe the bug
bug 1
When using buddy mlir to compile the f16 precision of Qwen3-32B model, in the generated subgrap0uprefill_32b.mdir file, vector.transfer-read will return a matrix<16xf32>when reading data. Causing the fma operator to pass in different matrix types。
The error is as follows:
The reason is that in the
frontend/Python/ops/tosa.pyfile, thevector.transfer.radoperation was processed, causing the type of the read matrix to change to f32. I am not sure why the f32 type is fixed here.After modifying the fixed f32, it can compile normally. The modification method is as follows:diff --git a/frontend/Python/ops/tosa.py b/frontend/Python/ops/tosa.py index b11d57d..cdaba5f 100644 --- a/frontend/Python/ops/tosa.py +++ b/frontend/Python/ops/tosa.py @@ -3844,7 +3844,10 @@ def flash_attention_for_cpu_prefill_op( f32 = F32Type.get() dtype_qkv = node.tensor_meta["dtype"][0] dtype_qkv = mlir_element_type_get(dtype_qkv) - dtype = f32 + # Use the same dtype as Q/K/V to keep IR type-consistent across precisions + # (e.g. f16 models). + + dtype = dtype_qkv vector_width = 16 v16 = ir.VectorType.get([vector_width], dtype) v16_qkv = ir.VectorType.get([vector_width], dtype_qkv) @@ -11313,13 +11316,13 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table): with ir.InsertionPoint(vec_loop.body): d = vec_loop.induction_variable va = vec_loop.inner_iter_args[0] - vec_ty = ir.VectorType.get([16], f32) + vec_ty = ir.VectorType.get([16], mlir_dtype) perm_map = ir.AffineMap.get( 4, 0, [ir.AffineDimExpr.get(3)] ) qv = vector.TransferReadOp( - vec_ty, # vector<16xf32> - query, # tensor<1x12x1x128xf32> + vec_ty, + query, [b, h, q, d], # indices perm_map, # (d0,d1,d2,d3)->(d3) zero, # padding @@ -11327,8 +11330,8 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table): loc=loc, ).result kv = vector.TransferReadOp( - vec_ty, # vector<16xf32> - k_cache, # tensor<1x2x1024x128xf32> + vec_ty, + k_cache, [b, h_kv, k, d], # indices perm_map, # (d0,d1,d2,d3)->(d3) zero, # padding @@ -11340,7 +11343,7 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table): scf.YieldOp([va1.result]) red = vector.ReductionOp( - f32, "add", vec_loop.result, loc=loc + mlir_dtype, "add", vec_loop.result, loc=loc ).result acc = arith.AddFOp(prev.result, red) @@ -11434,9 +11437,9 @@ def gqa_attention_fused_op(node: GQAAttentionFusedOp, symbol_table): ).result pv = vector.SplatOp( - ir.VectorType.get([16], f32), p, loc=loc + ir.VectorType.get([16], mlir_dtype), p, loc=loc ).result - vec_ty = ir.VectorType.get([16], f32) + vec_ty = ir.VectorType.get([16], mlir_dtype) perm_map = ir.AffineMap.get( 4, 0, [ir.AffineDimExpr.get(3)] )bug 2
After compilation, running the program may result in segment errors during prefill
To Reproduce
Create a BuddyQwen32 folder in the buddy mlir/examples path, place the files in the attachment, and modify the CMakeLists.txt file in the same path. Add the following code:
Then follow the steps in the BuddyQwen32/README.md file to compile.
buddy-qwen3-32b-main.cpp
CMakeLists.txt
import-qwen3.py
README.md
vocab.txt
Expected behavior
Expected to compile and run F16 precision models normally
Screenshots
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.