UNIVERSITY OF WEST ATTICA
SCHOOL OF ENGINEERING
DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATICS

University of West Attica · Department of Computer Engineering and Informatics

Parallel Systems

Parallel Computing using CUDA

Vasileios Evangelos Athanasiou
Student ID: 19390005

GitHub · LinkedIn

Supervision

Supervisor: Vasileios Mamalis, Professor

UNIWA Profile

Co-supervisor: Michalis Iordanakis, Academic Scholar

UNIWA Profile · Scholar

Athens, February 2025

INSTALL

Parallel Computing using CUDA

This repository implements parallel computations on 2D integer arrays using CUDA, developed as part of the Parallel Systems course at the University of West Attica. The project demonstrates GPU-accelerated matrix operations using CUDA threads, shared memory, and atomic operations.

1. Prerequisites

1.1 Required Software

NVIDIA CUDA Toolkit (≥ 11.0 recommended)
Download: https://developer.nvidia.com/cuda-downloads
NVIDIA GPU with compute capability ≥ 3.0 (tested on NVIDIA TITAN RTX)
GCC compiler (Linux/macOS) or compatible compiler on Windows
Make / Terminal for compilation and execution

1.2 Optional Software

Text editor or IDE (VSCode, CLion, Nsight)
Spreadsheet viewer for performance analysis

2. Installation Steps

2.1 Clone the Repository

git clone https://github.com/Parallel-Systems-aka-Uniwa/CUDA.git

Or download the ZIP archive and extract it.

2.2 Navigate to Project Directory

cd CUDA

Folder structure:

assign/
docs/
src/
README.md

src/ contains CUDA source code (cuda1.cu) and input/output directories

docs/ contains theory, exercises, and performance analysis

3. Compilation Instructions

Compile the CUDA program using nvcc:

nvcc -o cuda1 src/cuda1.cu

Explanation:

-o cuda1 → output executable named cuda1
src/cuda1.cu → CUDA source file

Ensure CUDA Toolkit paths are correctly set ($PATH and $LD_LIBRARY_PATH on Linux).

4. Execution Instructions

Run the program with input matrix file and output file:

./cuda1 src/A/AtoB.txt src/OutArr/OutArrB.txt

Arguments:

Input file → Path to input matrix (e.g., src/A/AtoB.txt)
Output file → Path to store result matrix (e.g., src/OutArr/OutArrB.txt)

Example Runs

./cuda1 src/A/AtoB.txt src/OutArr/OutArrB.txt
./cuda1 src/A/AtoC.txt src/OutArr/OutArrC.txt

Performs Matrix B or Matrix C operations based on input file
Supports different matrix sizes (e.g., N=8, 512, 1024, 10000, 20000)

5. Input Files

Located in src/A/
Files contain integer matrices in text format

Typical names:

AtoB.txt → Input for Matrix B computation
AtoC.txt → Input for Matrix C computation

6. Output Files

Stored in src/OutArr/ (intermediate arrays) or src/Output/ (final results)

Examples:

OutArrB.txt → Matrix B after computation
OutArrC.txt → Matrix C after computation
Output512.txt → Result for N=512
Output20000.txt → Result for N=20000

7. Core Operations

7.1 Average (calcAvg)

Parallel reduction with atomic operations

7.2 Maximum (findMax)

Parallel search for largest element

7.3 Matrix B (createB)

$$ B_{ij} = \begin{cases} a_{\text{max}} - A_{ij}, & i \ne j \\ a_{\text{max}}, & i = j \end{cases} $$

Also, to find the minimum element of B:

$$ B_{\min} = \min(B_{ij}) $$

7.4 Matrix C (createC)

$$ C_{ij} = 3 , A_{ij} + A_{i,j+1} + A_{i,j-1} $$

8. Performance Analysis

Experiments were conducted on various matrix sizes:

Matrix Size	calcAvg (ms)	findMax (ms)	createB/C (ms)
8×8	0.204736	0.015552	0.015040 (B)
512×512	0.136576	0.016704	0.016576 (C)
1024×1024	36.310913	0.059424	0.072832 (B)
20000×20000	39.388447	0.015104	0.011424 (C)

Observations:

Execution times for calcAvg scale linearly with matrix size.
findMax and createC remain high-performance even for large matrices due to optimized parallel kernels.

9. Known Issues & Troubleshooting

Memory Allocation: Large arrays may cause cudaMemcpy failures if GPU memory is insufficient.
Shared Memory Limits: Over-allocation per block may lead to kernel launch failures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Computing using CUDA

INSTALL

Parallel Computing using CUDA

1. Prerequisites

1.1 Required Software

1.2 Optional Software

2. Installation Steps

2.1 Clone the Repository

2.2 Navigate to Project Directory

3. Compilation Instructions

4. Execution Instructions

5. Input Files

6. Output Files

7. Core Operations

7.1 Average (calcAvg)

7.2 Maximum (findMax)

7.3 Matrix B (createB)

7.4 Matrix C (createC)

8. Performance Analysis

9. Known Issues & Troubleshooting

FilesExpand file tree

INSTALL.md

Latest commit

History

INSTALL.md

File metadata and controls

Parallel Computing using CUDA

INSTALL

Parallel Computing using CUDA

1. Prerequisites

1.1 Required Software

1.2 Optional Software

2. Installation Steps

2.1 Clone the Repository

2.2 Navigate to Project Directory

3. Compilation Instructions

4. Execution Instructions

5. Input Files

6. Output Files

7. Core Operations

7.1 Average (calcAvg)

7.2 Maximum (findMax)

7.3 Matrix B (createB)

7.4 Matrix C (createC)

8. Performance Analysis

9. Known Issues & Troubleshooting