A high-performance implementation of float32 matrix multiplication in x86_64 assembly, optimized using various techniques to achieve performance comparable to or exceeding OpenBLAS (~130 GFLOPs).
- CPU: AMD Ryzen 7 5800x
- Memory: 32GB (4x8GB) DDR4-3600 CL14 with tuned subtimings
- Initial implementation for small matrices
- Baseline for optimization
- Handles 1024x1024 matrices
- Performance: ~2.8s (with high variance)
- Comparison with C:
-O0: 4.3s-O3: 3.4s
- Optimizations:
- Cached matrix dimensions in registers
- Aligned instructions (~2.1s)
- Aligned matrices (~1.85s)
- Stores second matrix in column-major format
- Improved cache hit rate
- Performance: ~0.65s
- Utilizes AVX2 SIMD instructions
- Processes 8 float32 values per instruction
- Inner loop unrolling
- Performance: ~0.085s
- Calculates 4 elements of C per iteration
- Improved memory access efficiency
- Performance: ~0.034s (~62 GFLOPs)
- Expanded kernel to 2x4 submatrix
- 6 memory reads for 8 C elements
- Performance: ~0.0255s (~84 GFLOPs)
- Optimal register utilization
- All 16 vector FP registers (ymm) utilized
- Performance: ~0.0205s (~105 GFLOPs)
- Matrices stored in blocks
- Sequential memory reads
- Performance: ~0.0183s (~118 GFLOPs)
- 6x2(x8) kernel implementation
- Uses
vbroadcastsson A elements - Eliminated horizontal add operations
- Optimized blocking order for matrix B
- Performance: ~135 GFLOPs
- C implementation available
- Outperforms OpenBLAS (~145 GFLOPs)
- Compile with:
clang -O3 -march=native
The matmul_lib directory contains:
- AMD64 SysV ABI compliant implementation
- Linkable library format
- Performance comparable to or exceeding OpenBLAS
- Parallel implementation using
matmul_simd_10.asm