[go: up one dir, main page]

0% found this document useful (0 votes)
4 views5 pages

Report

Uploaded by

Aqib khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

Report

Uploaded by

Aqib khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

PDC

Assignment No.03

Submitted To:

Doctor Qamas Gull

Submitted By:

Aqib Sharafat (21-CS-51)

Section: C
Matrix Multiplication using CUDA - Performance Analysis Report

Problem Statement and Approach

This project implements and analyzes matrix multiplication performance for square matrices
(N×N) using CPU sequential computation and GPU-accelerated parallel computation with CUDA.
We implemented:

1. Basic CPU version with nested loops

2. Basic CUDA version

3. Optimized CUDA version with shared memory

Implementation Overview

CPU Implementation: Uses triple-nested loops to calculate the matrix product with O(N³) time
complexity.

Basic CUDA Implementation: Assigns one thread per output element, with each thread
calculating its position based on block/thread indices and performing the dot product computation
using global memory.

Optimized CUDA Implementation: Uses shared memory to reduce global memory access:

• Divides matrices into 16×16 tiles

• Loads tiles into shared memory

• Computes results using shared memory data

• Synchronizes threads to ensure data consistency

Performance Comparison Results

Experimental Setup

• GPU: NVIDIA GeForce RTX 3080


• CPU: Intel Core i7-10700K, 3.8GHz

• CUDA Version: 11.4

• Matrix Sizes: 256×256, 512×512, 1024×1024

Results

Execution Time (milliseconds)

Matrix Size CPU Time Basic CUDA Time Optimized CUDA Time

256×256 51.75 0.84 0.43

512×512 415.21 5.73 2.32

1024×1024 3302.47 43.62 16.54

Speedup Compared to CPU

Matrix Size Basic CUDA Speedup Optimized CUDA Speedup

256×256 61.6x 120.3x

512×512 72.5x 178.9x

1024×1024 75.7x 199.7x

Performance Visualization
Figure 2: Speedup achieved by CUDA implementations

Analysis and Discussion

Key Performance Insights

1. CPU vs. GPU Performance: GPU implementations demonstrate massive performance


advantages over CPU, with even the basic CUDA version achieving 60x+ speedup. This
advantage increases with matrix size, highlighting GPU's excellent scalability for this
problem.

2. Shared Memory Benefits: The optimized CUDA implementation achieves approximately


2-3x additional speedup over the basic version by utilizing shared memory, demonstrating
the critical importance of memory access patterns in GPU programming.

3. Scaling with Problem Size: Performance benefits increase with matrix size, reaching
nearly 200x speedup for 1024×1024 matrices with the optimized implementation. Larger
problems provide more parallelism for GPU exploitation, while CPU performance
deteriorates cubically.
Performance Bottlenecks and Optimization Opportunities

Despite shared memory optimization, memory bandwidth remains a limiting factor. Additional
optimization possibilities include:

• Better register usage

• Memory coalescing techniques

• Block size optimization for better occupancy

• Using texture memory for read-only data

• Loop unrolling

• Utilizing tensor cores on supported hardware

Conclusion:

This project demonstrates the enormous potential of GPU acceleration for computationally
intensive tasks like matrix multiplication. We achieved speedups of up to 200x using CUDA
compared to a CPU implementation. The optimized CUDA implementation using shared memory
significantly outperformed the basic CUDA implementation, emphasizing the importance of
understanding and optimizing for the GPU memory hierarchy.

Matrix multiplication serves as an excellent case study for GPU computing because:

1. It is compute-intensive (O(N³) operations)

2. It has high arithmetic intensity

3. It is naturally parallelizable

4. It demonstrates the importance of memory access patterns

The performance improvements observed in this project highlight why GPUs have become
essential tools in high-performance computing, machine learning, and scientific computing
applications where large-scale matrix operations are common.

You might also like