Skip to content

Latest commit

 

History

History
252 lines (167 loc) · 15.7 KB

File metadata and controls

252 lines (167 loc) · 15.7 KB

UNIWA

UNIVERSITY OF WEST ATTICA
SCHOOL OF ENGINEERING
DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATICS

University of West Attica · Department of Computer Engineering and Informatics


Parallel Systems

Parallel Computing using CUDA

Vasileios Evangelos Athanasiou
Student ID: 19390005

GitHub · LinkedIn


Supervision

Supervisor: Vasileios Mamalis, Professor

UNIWA Profile

Co-supervisor: Michalis Iordanakis, Academic Scholar

UNIWA Profile · Scholar


Athens, February 2025



INSTALL

Parallel Computing using CUDA

This repository implements parallel computations on 2D integer arrays using CUDA, developed as part of the Parallel Systems course at the University of West Attica. The project demonstrates GPU-accelerated matrix operations using CUDA threads, shared memory, and atomic operations.


1. Prerequisites

1.1 Required Software

  • NVIDIA CUDA Toolkit (≥ 11.0 recommended)
    Download: https://developer.nvidia.com/cuda-downloads
  • NVIDIA GPU with compute capability ≥ 3.0 (tested on NVIDIA TITAN RTX)
  • GCC compiler (Linux/macOS) or compatible compiler on Windows
  • Make / Terminal for compilation and execution

1.2 Optional Software

  • Text editor or IDE (VSCode, CLion, Nsight)
  • Spreadsheet viewer for performance analysis

2. Installation Steps

2.1 Clone the Repository

git clone https://github.com/Parallel-Systems-aka-Uniwa/CUDA.git

Or download the ZIP archive and extract it.

2.2 Navigate to Project Directory

cd CUDA

Folder structure:

assign/
docs/
src/
README.md

src/ contains CUDA source code (cuda1.cu) and input/output directories

docs/ contains theory, exercises, and performance analysis


3. Compilation Instructions

Compile the CUDA program using nvcc:

nvcc -o cuda1 src/cuda1.cu

Explanation:

  • -o cuda1 → output executable named cuda1
  • src/cuda1.cu → CUDA source file

Ensure CUDA Toolkit paths are correctly set ($PATH and $LD_LIBRARY_PATH on Linux).


4. Execution Instructions

Run the program with input matrix file and output file:

./cuda1 src/A/AtoB.txt src/OutArr/OutArrB.txt

Arguments:

  • Input file → Path to input matrix (e.g., src/A/AtoB.txt)
  • Output file → Path to store result matrix (e.g., src/OutArr/OutArrB.txt)

Example Runs

./cuda1 src/A/AtoB.txt src/OutArr/OutArrB.txt
./cuda1 src/A/AtoC.txt src/OutArr/OutArrC.txt
  • Performs Matrix B or Matrix C operations based on input file
  • Supports different matrix sizes (e.g., N=8, 512, 1024, 10000, 20000)

5. Input Files

  • Located in src/A/
  • Files contain integer matrices in text format

Typical names:

  • AtoB.txt → Input for Matrix B computation
  • AtoC.txt → Input for Matrix C computation

6. Output Files

Stored in src/OutArr/ (intermediate arrays) or src/Output/ (final results)

Examples:

  • OutArrB.txt → Matrix B after computation
  • OutArrC.txt → Matrix C after computation
  • Output512.txt → Result for N=512
  • Output20000.txt → Result for N=20000

7. Core Operations

7.1 Average (calcAvg)

Parallel reduction with atomic operations

7.2 Maximum (findMax)

Parallel search for largest element

7.3 Matrix B (createB)

$$ B_{ij} = \begin{cases} a_{\text{max}} - A_{ij}, & i \ne j \\ a_{\text{max}}, & i = j \end{cases} $$

Also, to find the minimum element of B:

$$ B_{\min} = \min(B_{ij}) $$

7.4 Matrix C (createC)

$$ C_{ij} = 3 , A_{ij} + A_{i,j+1} + A_{i,j-1} $$


8. Performance Analysis

Experiments were conducted on various matrix sizes:

Matrix Size calcAvg (ms) findMax (ms) createB/C (ms)
8×8 0.204736 0.015552 0.015040 (B)
512×512 0.136576 0.016704 0.016576 (C)
1024×1024 36.310913 0.059424 0.072832 (B)
20000×20000 39.388447 0.015104 0.011424 (C)

Observations:

  • Execution times for calcAvg scale linearly with matrix size.
  • findMax and createC remain high-performance even for large matrices due to optimized parallel kernels.

9. Known Issues & Troubleshooting

  • Memory Allocation: Large arrays may cause cudaMemcpy failures if GPU memory is insufficient.
  • Shared Memory Limits: Over-allocation per block may lead to kernel launch failures.