-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Summary
Record a practical use case where ptoas --enable-insert-sync still has ~10% room for performance improvement, compared to a known manual-sync plan.
Background
I wrote a dynamic-shape matmul that is 2x faster than original pto-isa gemm_performance example and 0.9~1.1x of aclnnMatmul in CANN 8.5.0. See matmul_swizzle/simple_demo to reproduce.
The auto-sync version is only ~100 lines of Python, and reaching 90% of manual-sync is quite decent. I just wonder if the last 10% perf gap can be filled.
Command line
ptoas --enable-insert-sync simple_matmul_auto_sync.pto -o simple_matmul_auto_sync.cpp
ptoas simple_matmul_manual_sync.pto -o simple_matmul_manual_sync.cppReproduction input
contains both inputs:
simple_matmul_auto_sync.ptosimple_matmul_manual_sync.pto
and outputs:
simple_matmul_auto_sync.cppsimple_matmul_manual_sync.cpp
Expected performance
Auto-sync should be ideally as fast as manual sync version. (or discover even faster pipelining?)
Actual performance
Auto-sync is 5~15% slower than manual-sync, see the detailed PRs below (contains full code with kernel launch, and on-device performance measurement on 910B2:):
- Directly written in PTO-ISA C++: PTO-ISA implementation of Matmul with L2 cache locality optimization huawei-csl/pto-kernels#26
- Python, manual-sync, equal performance as C++ version above Matmul swizzle in Python with manual sync (cleaned-up for merge) huawei-csl/pto-dsl#72
- Python, using ptoas auto-sync, 10% slower than manual sync Test auto-sync on general-shape matmul with swizzle, compare to manual sync performance. huawei-csl/pto-dsl#73