Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
Instruction: Follow all instructions given. Write your answer into the table located at the
final page of this document. DO NOT REMOVE ANY PAGE.
Note:
We are not going to use Microsoft Visual Studio/Code or Xcode on Mac to avoid any
unnecessary configuration.
1
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
%%writefile Activity2.cu
5. Within the same cell, continue in the next line with the following code:
#include<stdio.h>
int main(void) {
printf("Hello, World!\n");
return 0;
}
6. Click Run or Ctrl+Enter
7. We are going to compile this program using nvcc compiler. Then we are going to execute
this program.
%%shell
nvcc Activity2.cu -o outputActivity2
./outputActivity2
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello, World!\n");
return 0;
}
2. Recompile and rerun this program.
2
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
Block 0
Thread 0
3
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
Figure 1
2. The grid implementation concept works as follows:
blockIdx.x 0
threadIdx.x 0 threadIdx.x 1 threadIdx.x 2 threadIdx.x 3 threadIdx.x 4
c=a+b c=a+b c=a+b c=a+b c=a+b
%%writefile Activity4a.cu
7. Within the same cell, continue in the next line with the following code:
#include <stdio.h>
#define ARRAYSIZE 5
int main()
{
int a[ARRAYSIZE] = { 1, 2, 3, 4, 5 };
int b[ARRAYSIZE] = { 10, 20, 30, 40, 50 };
int c[ARRAYSIZE] = { 0 };
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
// Allocate GPU buffers for three vectors (two input, one output).
cudaMalloc((void**)&dev_a, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_b, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_c, ARRAYSIZE * sizeof(int));
// Launch a kernel on the GPU with one thread for each element.
addition<<<1, ARRAYSIZE>>>(dev_a, dev_b, dev_c);
4
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
8. Identify the important codes here that allows memory copy from host to device (GPU), and
device to host.
9. Run the program and observe the output.
5. Within the same cell, continue in the next line with the following code:
%%writefile Activity4b.cu
#include <stdio.h>
#define ARRAYSIZE 5
int main()
{
int i;
int a[ARRAYSIZE];
int b[ARRAYSIZE];
int c[ARRAYSIZE];
int *dev_a = 0;
int *dev_b = 0;
5
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
int *dev_c = 0;
for(i=0;i<ARRAYSIZE;i++){
a[i]=(i+1)*10;
}
for(i=0;i<ARRAYSIZE;i++){
b[i]=(i+1)*100;
}
// Allocate GPU buffers for three vectors (two input, one output) .
cudaMalloc((void**)&dev_a, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_b, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_c, ARRAYSIZE * sizeof(int));
// Launch a kernel on the GPU with one thread for each element.
addition<<<1, ARRAYSIZE>>>(dev_a, dev_b, dev_c);
for(i=0;i<ARRAYSIZE;i++){
printf("%d + %d = %d\n",a[i],b[i],c[i]);
}
6
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
%%writefile Activity5.cu
#include <stdio.h>
#define ARRAYSIZE 5
int main()
{
int i;
int a[ARRAYSIZE];
int b[ARRAYSIZE];
int c[ARRAYSIZE];
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
for(i=0;i<ARRAYSIZE;i++){
a[i]=(i+1)*10;
}
for(i=0;i<ARRAYSIZE;i++){
b[i]=(i+1)*100;
}
// Allocate GPU buffers for three vectors (two input, one output) .
cudaMalloc((void**)&dev_a, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_b, ARRAYSIZE * sizeof(int));
cudaMalloc((void**)&dev_c, ARRAYSIZE * sizeof(int));
// Launch a kernel on the GPU with one thread for each element.
addition<<<1, ARRAYSIZE>>>(dev_a, dev_b, dev_c);
7
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
for(i=0;i<ARRAYSIZE;i++){
printf("%d + %d = %d\n",a[i],b[i],c[i]);
}
}
blockIdx.x 0
threadIdx.x 0
c=a+b
7. Now, modify the launch kernel code. By default, the value is <<<1, ARRAYSIZE>>>. The
block size is 1 and the thread number is set according to the size value. Now change the
value to the highlighted code below.
addition<<<ARRAYSIZE, 1>>>(dev_a, dev_b, dev_c);
8. The grid implementation concept works as follows:
blockIdx.x 0 blockIdx.x 1 blockIdx.x 2 blockIdx.x 3 blockIdx.x 4
threadIdx.x 0 threadIdx.x 0 threadIdx.x 0 threadIdx.x 0 threadIdx.x 0
c=a+b c=a+b c=a+b c=a+b c=a+b
9. Run the program. Screen capture your code and the output.
10. Increase the array size to 100. Observe the output.
11. Increase the array size to 1000. Observe the output.
12. Increase the array size to 10000. Observe the output.
8
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
6. We are going to work with 10 data. This data is going to be divided into 2 blocks. Each block
consists of 5 threads.
7. The grid implementation concept works as follows:
blockIdx.x 0 blockIdx.x 1
threadIdx.x 0 threadIdx.x 0
c=a+b c=a+b
threadIdx.x 1 threadIdx.x 1
c=a+b c=a+b
threadIdx.x 2 threadIdx.x 2
c=a+b c=a+b
threadIdx.x 3 threadIdx.x 3
c=a+b c=a+b
threadIdx.x 4 threadIdx.x 4
c=a+b c=a+b
8. Change the ARRAYSIZE to 10.
9. Set the thread number to 5.
#define NUMTHREAD 5
10. Set the block size by dividing the total data with the thread number.
#define BLOCKSIZE ARRAYSIZE/NUMTHREAD
11. Replace this code int i = blockIdx.x; with the following code.
12. Now, modify the launch kernel code. Total array size is 10. There is 5 threads per block. The
program expects to have 2 blocks with 5 threads per block.
13. Run the program. Screen capture your code and the output.
14. Increase the array size to 100. Observe the output.
15. Increase the array size to 1000. Observe the output.
16. Increase the array size to 10000. Observe the output.
9
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
10
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
11
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
12
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
13
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
Activity 5
APPLY
blockIdx.x Run
the program.
Screen capture
your code and
the output.
14
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
Activity 6
COMBINATION
OF blockIdx.x
and threadIdx.x
Run the
program. Screen
capture your
code and the
output.
15
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
Activity 7
16
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
MULTIPLE
KERNEL
LAUNCHERS
IN A PROGRAM
Run the
program. Screen
capture your
code and the
output.
17
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
TOTAL
18
CSNB594/CSNB4423 Parallel Computing (2022)
Lab 5 – CUDA
Student ID: SW0104101 Student Name: HARVEEN A/L VELAN
Convert this word document into pdf and rename the file to:
CSNB594CSNB4423 Lab 5 <section><student Name>.pdf before the submission.
19