Skip to content

[Performance] --enable-insert-sync on a dynamic GEMM example generates 10% slower kernel than manual sync version #226

@learning-chip

Description

@learning-chip

Summary

Record a practical use case where ptoas --enable-insert-sync still has ~10% room for performance improvement, compared to a known manual-sync plan.

Background

I wrote a dynamic-shape matmul that is 2x faster than original pto-isa gemm_performance example and 0.9~1.1x of aclnnMatmul in CANN 8.5.0. See matmul_swizzle/simple_demo to reproduce.

The auto-sync version is only ~100 lines of Python, and reaching 90% of manual-sync is quite decent. I just wonder if the last 10% perf gap can be filled.

Command line

ptoas --enable-insert-sync simple_matmul_auto_sync.pto -o simple_matmul_auto_sync.cpp
ptoas simple_matmul_manual_sync.pto -o simple_matmul_manual_sync.cpp

Reproduction input

pto_matmul.zip

contains both inputs:

  • simple_matmul_auto_sync.pto
  • simple_matmul_manual_sync.pto

and outputs:

  • simple_matmul_auto_sync.cpp
  • simple_matmul_manual_sync.cpp

Expected performance

Auto-sync should be ideally as fast as manual sync version. (or discover even faster pipelining?)

Actual performance

Auto-sync is 5~15% slower than manual-sync, see the detailed PRs below (contains full code with kernel launch, and on-device performance measurement on 910B2:):

Git commit

29ed536

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions