Skip to content

amansheikh01/hpc-matrix-multiplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

High-Performance Matrix Multiplication Using OpenMP

This project presents a comprehensive study of high-performance matrix multiplication on shared-memory architectures using OpenMP. It focuses on performance optimisation, benchmarking, and strong scaling analysis of parallel matrix multiplication algorithms implemented in C.


Project Objectives

  • To implement parallel matrix multiplication using OpenMP
  • To analyse performance improvements through cache-aware optimisations
  • To study strong scaling behaviour on multicore processors
  • To understand the impact of memory bandwidth and thread scalability

Features

  • Naive OpenMP matrix multiplication implementation
  • Cache-optimised (tiled) matrix multiplication
  • Automated benchmarking using shell scripts
  • Strong scaling and efficiency evaluation

Project Structure

hpc-matrix-multiplication/
├── src/
│   ├── matrix_mul_omp.c
│   ├── matrix_mul_tiled.c
│   ├── run_benchmark.sh
│   ├── run_strong_scaling.sh
│   ├── benchmark_results.txt
│   └── strong_scaling_results.txt
├── results/
└── README.md

Implementation Overview

The project includes two matrix multiplication strategies:

  • Naive OpenMP Implementation — A straightforward parallelisation of the triple-nested loop matrix multiplication using OpenMP directives.
  • Tiled (Cache-Optimised) Implementation — A block-based approach designed to improve cache locality and reduce memory access latency.

How to Compile

cd src
gcc -O3 -fopenmp matrix_mul_omp.c -o matrix_omp
gcc -O3 -fopenmp matrix_mul_tiled.c -o matrix_tiled

How to Run

cd src
./run_benchmark.sh
./run_strong_scaling.sh

Benchmark Results (N = 2048)

The following table compares execution time between a naive OpenMP implementation and a cache-optimised tiled implementation. The reported speedup is calculated as the ratio of naive execution time to tiled execution time.

Threads Naive Time (s) Tiled Time (s) Speedup
1 87.9724 10.1177 8.69
2 44.8283 5.5742 8.04
4 34.9294 3.4970 9.99
8 32.6481 2.1029 15.53

These results demonstrate that cache tiling significantly reduces memory access latency and improves data locality, leading to substantial performance gains. The benefits become increasingly pronounced as the number of threads increases.


Strong Scaling Analysis (Tiled OpenMP, N = 2048)

Strong scaling evaluates how execution time decreases as the number of threads increases for a fixed problem size. The following table reports runtime, speedup, and parallel efficiency for the tiled OpenMP implementation.

Threads Time (s) Speedup Efficiency (%)
1 8.0001 1.00 100.00
2 4.5733 1.75 87.47
4 2.8410 2.82 70.40
8 2.1627 3.70 46.24

The results indicate diminishing parallel efficiency as thread count increases, primarily due to memory bandwidth saturation and synchronisation overheads. Nevertheless, the implementation exhibits good scalability up to moderate thread counts.


Key Observations

  • Cache tiling significantly reduces execution time
  • Near-linear speedup is observed at lower thread counts
  • Parallel efficiency decreases as memory bandwidth becomes a bottleneck
  • Performance gains are more prominent for larger matrix sizes

Technologies Used

  • C Programming Language
  • OpenMP
  • GCC Compiler
  • Bash Scripting
  • Linux / WSL Environment

Author

Author

Aman Bashir Sheikh
Email: aman.b.sheikh119@gmail.com
LinkedIn: www.linkedin.com/in/aman-sheikh-aa701b253

About

High-Performance Computing project implementing and benchmarking OpenMP-based matrix multiplication with cache-optimized tiling and strong scaling analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors