Skip to content

[TASK] Evaluation and Optimization of Transpose and Batch Matmul #612

@zhanghb97

Description

@zhanghb97

Deliverables

  • A Pull Request (PR) that completes performance evaluation of Transpose and Batch Matmul, including measuring the percentage of total execution time spent in Transpose.
  • A Pull Request (PR) that implements vectorization and parallel optimization for linalg.batch_matmul_transpose_b.

Task Description

  • Run the full end-to-end compilation workflow of DeepSeek R1.

  • Identify representative Transpose + Matmul cases (build/examples/BuddyDeepSeekR1/subgraph0_decode.mlir), for example:

    %2857 = "tosa.const"() <{value = dense<[0, 1, 3, 2]> : tensor<4xi32>}> : () -> tensor<4xi32>
    %2858 = tosa.transpose %2847, %2857 : (tensor<1x12x1024x128xf32>, tensor<4xi32>) -> tensor<1x12x128x1024xf32>
    %2859 = tosa.reshape %2829 {new_shape = array<i64: 12, 1, 128>} : (tensor<1x12x1x128xf32>) -> tensor<12x1x128xf32>
    %2860 = tosa.reshape %2858 {new_shape = array<i64: 12, 128, 1024>} : (tensor<1x12x128x1024xf32>) -> tensor<12x128x1024xf32>
    %2861 = tosa.matmul %2859, %2860 : (tensor<12x1x128xf32>, tensor<12x128x1024xf32>) -> tensor<12x1x1024xf32>
  • Under examples/BuddyNext, write a test case that includes Transpose and Matmul into a function. Measure and report what percentage of the total kernel execution time is taken by the Transpose operation (do not forget to use -batchmatmul-optimize pass).

  • Under examples/BuddyNext, hand-write an MLIR example for linalg.batch_matmul_transpose_b with vectorization and parallelization enabled, and evaluate its performance.

Timeline

  • Coding Phase: Nov 7, 2025 – Nov 9, 2025
  • Code Review: Begins on Nov 10, 2025

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions