This project presents a comprehensive study of high-performance matrix multiplication on shared-memory architectures using OpenMP. It focuses on performance optimisation, benchmarking, and strong scaling analysis of parallel matrix multiplication algorithms implemented in C.
- To implement parallel matrix multiplication using OpenMP
- To analyse performance improvements through cache-aware optimisations
- To study strong scaling behaviour on multicore processors
- To understand the impact of memory bandwidth and thread scalability
- Naive OpenMP matrix multiplication implementation
- Cache-optimised (tiled) matrix multiplication
- Automated benchmarking using shell scripts
- Strong scaling and efficiency evaluation
hpc-matrix-multiplication/ ├── src/ │ ├── matrix_mul_omp.c │ ├── matrix_mul_tiled.c │ ├── run_benchmark.sh │ ├── run_strong_scaling.sh │ ├── benchmark_results.txt │ └── strong_scaling_results.txt ├── results/ └── README.md
The project includes two matrix multiplication strategies:
- Naive OpenMP Implementation — A straightforward parallelisation of the triple-nested loop matrix multiplication using OpenMP directives.
- Tiled (Cache-Optimised) Implementation — A block-based approach designed to improve cache locality and reduce memory access latency.
cd src gcc -O3 -fopenmp matrix_mul_omp.c -o matrix_omp gcc -O3 -fopenmp matrix_mul_tiled.c -o matrix_tiled
cd src ./run_benchmark.sh ./run_strong_scaling.sh
The following table compares execution time between a naive OpenMP implementation and a cache-optimised tiled implementation. The reported speedup is calculated as the ratio of naive execution time to tiled execution time.
| Threads | Naive Time (s) | Tiled Time (s) | Speedup |
|---|---|---|---|
| 1 | 87.9724 | 10.1177 | 8.69 |
| 2 | 44.8283 | 5.5742 | 8.04 |
| 4 | 34.9294 | 3.4970 | 9.99 |
| 8 | 32.6481 | 2.1029 | 15.53 |
These results demonstrate that cache tiling significantly reduces memory access latency and improves data locality, leading to substantial performance gains. The benefits become increasingly pronounced as the number of threads increases.
Strong scaling evaluates how execution time decreases as the number of threads increases for a fixed problem size. The following table reports runtime, speedup, and parallel efficiency for the tiled OpenMP implementation.
| Threads | Time (s) | Speedup | Efficiency (%) |
|---|---|---|---|
| 1 | 8.0001 | 1.00 | 100.00 |
| 2 | 4.5733 | 1.75 | 87.47 |
| 4 | 2.8410 | 2.82 | 70.40 |
| 8 | 2.1627 | 3.70 | 46.24 |
The results indicate diminishing parallel efficiency as thread count increases, primarily due to memory bandwidth saturation and synchronisation overheads. Nevertheless, the implementation exhibits good scalability up to moderate thread counts.
- Cache tiling significantly reduces execution time
- Near-linear speedup is observed at lower thread counts
- Parallel efficiency decreases as memory bandwidth becomes a bottleneck
- Performance gains are more prominent for larger matrix sizes
- C Programming Language
- OpenMP
- GCC Compiler
- Bash Scripting
- Linux / WSL Environment
Aman Bashir Sheikh
Email: aman.b.sheikh119@gmail.com
LinkedIn:
www.linkedin.com/in/aman-sheikh-aa701b253