[go: up one dir, main page]

0% found this document useful (0 votes)
74 views19 pages

Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101

This document provides instructions for a CUDA lab assignment. It describes setting up the environment using Google Colab and enabling the GPU runtime. It then outlines 5 activities for students to complete: 1) creating a new CUDA project without device code, 2) adding device code, 3) passing data to the GPU and using threadIdx, 4) working with arrays of size 5 and automating initialization, and 5) applying blockIdx instead of threadIdx. The activities involve writing CUDA code, compiling, and running simple programs to add arrays on the GPU.

Uploaded by

Harveen Velan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views19 pages

Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101

This document provides instructions for a CUDA lab assignment. It describes setting up the environment using Google Colab and enabling the GPU runtime. It then outlines 5 activities for students to complete: 1) creating a new CUDA project without device code, 2) adding device code, 3) passing data to the GPU and using threadIdx, 4) working with arrays of size 5 and automating initialization, and 5) applying blockIdx instead of threadIdx. The activities involve writing CUDA code, compiling, and running simple programs to add arrays on the GPU.

Uploaded by

Harveen Velan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CSNB594/CSNB4423 Parallel Computing (2022)

Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

Instruction: Follow all instructions given. Write your answer into the table located at the
final page of this document. DO NOT REMOVE ANY PAGE.

This course uses an online IDE, Google Colaboratory (Colab)


https://colab.research.google.com/

Minimum Hardware / System Requirements:


 A web browser (best with Chrome)
 Google account
 Operating systems independent

Minimum Programming Language Requirement:


 C Language

Note:
We are not going to use Microsoft Visual Studio/Code or Xcode on Mac to avoid any
unnecessary configuration.

QUICK NOTES FROM LAB 0


1. Launch the web browser. If you are using the computer lab, you are suggested to use
incognito mode.
2. Go to Google Colaboratory (Colab) https://colab.research.google.com/
3. Login to your Google account.
4. Click New notebook.
5. Google Colab used Python based environment. We are going to compile C program using
Python.
6. The %%writefile code will create and write the souce code into .c file. E.g: %%writefile
hello.c
7. Click Run or press Ctrl+Enter. A file hello.c will be generated with the given codes.
8. Enable the shell script using %%shell, we need to compile the .c file using the gcc compiler,
and generate an executable file. To execute the output file, use ./ followed by the output file
name.

ACTIVITY 1. CREATE NEW NOTEBOOK AND ACTIVATE GPU


1. Click New Notebook
2. Double click the title, rename the Notebook as CUDA.ipynb
3. Switch the runtime type to GPU. Runtime – Change runtime type - Notebook settings:
a. Hardware accelerator: GPU
b. Save
4. Check the CUDA installation.
!nvcc --version
5. Observe the output.

1
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

6. Now your notebook should be ready to execute CUDA.


7. Activity 1.3 needs to be repeated every time you start the notebook.

ACTIVITY 2: CREATE A NEW PROJECT WITHOUT DEVICE CODE


1. Add a new cell for text.
2. Set the text to Activity 2.
3. Move up the text cell to the first cell.
4. In the Code cell, create a C program name with CUDA Activity2.cu

%%writefile Activity2.cu
5. Within the same cell, continue in the next line with the following code:
#include<stdio.h>
int main(void) {
printf("Hello, World!\n");
return 0;
}
6. Click Run or Ctrl+Enter
7. We are going to compile this program using nvcc compiler. Then we are going to execute
this program.
%%shell
nvcc Activity2.cu -o outputActivity2
./outputActivity2

8. Click Run or Ctrl+Enter

ACTIVITY 3: ADD DEVICE CODE INTO THE SOURCE CODE

1. Insert the highlighted code into the source code.


%%writefile Activity2.cu
#include<stdio.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
global void mykernel(void) {

}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello, World!\n");
return 0;
}
2. Recompile and rerun this program.

CUDA C keyword global indicates that a function is runs on the GPU.


Triple angle at mykernel<<<1,1>>>() mark a call from host to device.
In the kernel launcher, the left value indicates the number of blocks, while the right value
indicates the number of threads.

2
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

Block 0
Thread 0

ACTIVITY 4: PASSING DATA TO THE GPU AND APPLY THREADIDX

ACTIVITY4A: WORKING WITH ARRAY SIZE 5


1. Program concept for Activity 4 as follows:

3
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

Figure 1
2. The grid implementation concept works as follows:
blockIdx.x 0
threadIdx.x 0 threadIdx.x 1 threadIdx.x 2 threadIdx.x 3 threadIdx.x 4
c=a+b c=a+b c=a+b c=a+b c=a+b

3. Add a new cell for text.


4. Set the text to Activity 4a.
5. Add a new cell for code.
6. In the Code cell, create a C program name with CUDA Activity4a.cu

%%writefile Activity4a.cu
7. Within the same cell, continue in the next line with the following code:

#include <stdio.h>
#define ARRAYSIZE 5

global void addition(int *X, int *Y, int *Z)


{
int i = threadIdx.x;
Z[i] = X[i] + Y[i];
}

int main()
{
int a[ARRAYSIZE] = { 1, 2, 3, 4, 5 };
int b[ARRAYSIZE] = { 10, 20, 30, 40, 50 };
int c[ARRAYSIZE] = { 0 };
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;

// Allocate GPU buffers for three vectors (two input, one output).
cudaMalloc((void**)&dev_a, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_b, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_c, ARRAYSIZE * sizeof(int));

// Copy input vectors from host memory to GPU buffers.


cudaMemcpy(dev_a, a, ARRAYSIZE * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, ARRAYSIZE * sizeof(int), cudaMemcpyHostToDevice);

// Launch a kernel on the GPU with one thread for each element.
addition<<<1, ARRAYSIZE>>>(dev_a, dev_b, dev_c);

4
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

// cudaDeviceSynchronize waits for the kernel to finish, and returns


// any errors encountered during the launch.
cudaDeviceSynchronize();
// Copy output vector from GPU buffer to host memory.
cudaMemcpy(c, dev_c, ARRAYSIZE * sizeof(int), cudaMemcpyDeviceToHost);

printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n", c[0],


c[1], c[2], c[3], c[4]);

8. Identify the important codes here that allows memory copy from host to device (GPU), and
device to host.
9. Run the program and observe the output.

ACTIVITY4B: INCREASING ARRAY SIZE AND AUTOMATE THE INITIALIZATION


1. Add a new cell for text.
2. Set the text to Activity 4b.
3. Add a new cell for code.
4. In the Code cell, create a C program name with CUDA Activity4b.cu
%%writefile Activity4b.cu

5. Within the same cell, continue in the next line with the following code:

%%writefile Activity4b.cu
#include <stdio.h>
#define ARRAYSIZE 5

global void addition(int *X, int *Y, int *Z)


{
int i = threadIdx.x;
Z[i] = X[i] + Y[i];
}

int main()
{
int i;
int a[ARRAYSIZE];
int b[ARRAYSIZE];
int c[ARRAYSIZE];
int *dev_a = 0;
int *dev_b = 0;

5
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

int *dev_c = 0;

for(i=0;i<ARRAYSIZE;i++){
a[i]=(i+1)*10;
}

for(i=0;i<ARRAYSIZE;i++){
b[i]=(i+1)*100;
}

// Allocate GPU buffers for three vectors (two input, one output) .
cudaMalloc((void**)&dev_a, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_b, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_c, ARRAYSIZE * sizeof(int));

// Copy input vectors from host memory to GPU buffers.


cudaMemcpy(dev_a, a, ARRAYSIZE * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, ARRAYSIZE * sizeof(int), cudaMemcpyHostToDevice);

// Launch a kernel on the GPU with one thread for each element.
addition<<<1, ARRAYSIZE>>>(dev_a, dev_b, dev_c);

// cudaDeviceSynchronize waits for the kernel to finish, and returns


// any errors encountered during the launch.
cudaDeviceSynchronize();

// Copy output vector from GPU buffer to host memory.


cudaMemcpy(c, dev_c, ARRAYSIZE * sizeof(int), cudaMemcpyDeviceToHost);

for(i=0;i<ARRAYSIZE;i++){
printf("%d + %d = %d\n",a[i],b[i],c[i]);
}

6. Run the program and observe the output.


7. Change the ARRAYSIZE value to 100.
8. Run the program. Screen capture your code and the output.

ACTIVITY 5: APPLY blockIdx.x

1. Add a new cell for text.


2. Set the text to Activity 5.
3. Add a new cell for code.

6
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

4. In the Code cell, create a C program name with CUDA Activity5.cu


5. Copy the source code of Activity4b.cu. Change the threadIdx.x in the addition function
definition to blockIdx.x.

%%writefile Activity5.cu
#include <stdio.h>
#define ARRAYSIZE 5

global void addition(int *X, int *Y, int *Z)


{
int i = blockIdx.x;
Z[i] = X[i] + Y[i];
}

int main()
{
int i;
int a[ARRAYSIZE];
int b[ARRAYSIZE];
int c[ARRAYSIZE];
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;

for(i=0;i<ARRAYSIZE;i++){
a[i]=(i+1)*10;
}

for(i=0;i<ARRAYSIZE;i++){
b[i]=(i+1)*100;
}

// Allocate GPU buffers for three vectors (two input, one output) .
cudaMalloc((void**)&dev_a, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_b, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_c, ARRAYSIZE * sizeof(int));

// Copy input vectors from host memory to GPU buffers.


cudaMemcpy(dev_a, a, ARRAYSIZE * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, ARRAYSIZE * sizeof(int), cudaMemcpyHostToDevice);

// Launch a kernel on the GPU with one thread for each element.
addition<<<1, ARRAYSIZE>>>(dev_a, dev_b, dev_c);

7
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

// cudaDeviceSynchronize waits for the kernel to finish, and returns


// any errors encountered during the launch.
cudaDeviceSynchronize();

// Copy output vector from GPU buffer to host memory.


cudaMemcpy(c, dev_c, ARRAYSIZE * sizeof(int), cudaMemcpyDeviceToHost);

for(i=0;i<ARRAYSIZE;i++){
printf("%d + %d = %d\n",a[i],b[i],c[i]);
}
}

Run the program.


You should now see only the first index has the correct result. Others are zero.
6. The grid implementation concept works as follows:

blockIdx.x 0
threadIdx.x 0
c=a+b

7. Now, modify the launch kernel code. By default, the value is <<<1, ARRAYSIZE>>>. The
block size is 1 and the thread number is set according to the size value. Now change the
value to the highlighted code below.
addition<<<ARRAYSIZE, 1>>>(dev_a, dev_b, dev_c);
8. The grid implementation concept works as follows:
blockIdx.x 0 blockIdx.x 1 blockIdx.x 2 blockIdx.x 3 blockIdx.x 4
threadIdx.x 0 threadIdx.x 0 threadIdx.x 0 threadIdx.x 0 threadIdx.x 0
c=a+b c=a+b c=a+b c=a+b c=a+b

9. Run the program. Screen capture your code and the output.
10. Increase the array size to 100. Observe the output.
11. Increase the array size to 1000. Observe the output.
12. Increase the array size to 10000. Observe the output.

ACTIVITY 6: COMBINATION OF blockIdx.x and threadIdx.x

1. Add a new cell for text.


2. Set the text to Activity 6.
3. Add a new cell for code.
4. In the Code cell, create a C program name with CUDA Activity6.cu
5. Copy the source code of Activity5.cu.

8
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

6. We are going to work with 10 data. This data is going to be divided into 2 blocks. Each block
consists of 5 threads.
7. The grid implementation concept works as follows:
blockIdx.x 0 blockIdx.x 1
threadIdx.x 0 threadIdx.x 0
c=a+b c=a+b

threadIdx.x 1 threadIdx.x 1
c=a+b c=a+b

threadIdx.x 2 threadIdx.x 2
c=a+b c=a+b

threadIdx.x 3 threadIdx.x 3
c=a+b c=a+b

threadIdx.x 4 threadIdx.x 4
c=a+b c=a+b
8. Change the ARRAYSIZE to 10.
9. Set the thread number to 5.

#define NUMTHREAD 5
10. Set the block size by dividing the total data with the thread number.
#define BLOCKSIZE ARRAYSIZE/NUMTHREAD
11. Replace this code int i = blockIdx.x; with the following code.

int i = threadIdx.x + blockIdx.x * NUMTHREAD;

12. Now, modify the launch kernel code. Total array size is 10. There is 5 threads per block. The
program expects to have 2 blocks with 5 threads per block.

addition<<<BLOCKSIZE, NUMTHREAD>>>(dev_a, dev_b, dev_c);

13. Run the program. Screen capture your code and the output.
14. Increase the array size to 100. Observe the output.
15. Increase the array size to 1000. Observe the output.
16. Increase the array size to 10000. Observe the output.

ACTIVITY 7: MULTIPLE KERNEL LAUNCHERS IN A PROGRAM

1. Add a new cell for text.


2. Set the text to Activity 7.
3. Add a new cell for code.
4. In the Code cell, create a C program name with CUDA Activity7.cu
5. Copy the source code of Activity6.cu.
6. Set the ARRAYSIZE to 10, and thread number to 5.
7. Add a new user defined function, multiplication. This function calculates the multiplication
between array a and b.

9
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

global void multiplication(int *L, int *M, int *N)


{
int i = threadIdx.x + blockIdx.x * NUMTHREAD;
N[i] = L[i] * M[i];
}

8. Declare an array to handle the multiplication results, and its pointer.


9. Add the kernel launcher to call multiplication function. Apply the total block and thread
similar to the addition function call. Save the result into the newly created array in 7.
10. Create a new for loop to display the multiplication results.
11. Run the program. Screen capture your code and the output.

10
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

Instruction: Write/place your answer in the specified column given.


Marking Scheme
Marks 0 1-4 5
Completion None Partially complete Complete

QUESTION MARKS ANSWER


Activity 1
Activity 2
Activity 3
Activity 4
ACTIVITY4B:
INCREASING
ARRAY SIZE
AND
AUTOMATE
THE
INITIALIZATION
Run the
program. Screen
capture your
code and the
output.

11
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

12
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

13
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

Activity 5
APPLY
blockIdx.x Run
the program.
Screen capture
your code and
the output.

14
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

Activity 6
COMBINATION
OF blockIdx.x
and threadIdx.x
Run the
program. Screen
capture your
code and the
output.

15
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

Activity 7

16
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

MULTIPLE
KERNEL
LAUNCHERS
IN A PROGRAM
Run the
program. Screen
capture your
code and the
output.

17
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

TOTAL

18
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN

Convert this word document into pdf and rename the file to:
CSNB594CSNB4423 Lab 5 <section><student Name>.pdf before the submission.

Submission type: Online

19

You might also like