Cuda Notes From Udacity Lecture

The document discusses efficient GPU programming techniques. It introduces CUDA kernel programming and describes how to cube values in parallel by assigning each thread to cube one element. It then discusses GPU programming patterns like map, reduce, gather, scatter and stencil that assign work to threads. It emphasizes optimizing memory access by maximizing arithmetic intensity, using faster shared memory over global memory, and coalescing memory reads. It also introduces the use of atomic operations and shared memory to coordinate thread work and avoid races when writing to global memory.

Uploaded by

J G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

178 views3 pages

Cuda Notes From Udacity Lecture

Uploaded by

J G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

/* Lesson 1 --Code from Quiz */

#include <stdio.h>
__global__ void cube(float * d_out, float * d_in){
int thid= threadIdx.x;
float num = d_in[thid];
d_out[thid] = num * num * num;
}
int main(int argc, char ** argv) {
const int ARRAY_SIZE = 96;
const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);
// generate the input array on the host
float h_in[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
h_in[i] = float(i);
}
float h_out[ARRAY_SIZE];
// declare GPU memory pointers
float * d_in;
float * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel
cube<<<1, ARRAY_SIZE>>>(d_out, d_in);
// copy back the result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
// print out the resulting array
for (int i =0; i < ARRAY_SIZE; i++) {
printf("%f", h_out[i]);
printf(((i % 4) != 3) ? "\t" : "\n");
}
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
/* LESSON 2 */
many threads solving a problem by working together
parallel communication patters
map one to one
transpose one to one
gather many to one
scatter one to many
stencil several to run

reduce all to one

scan/sort all to all
shared memory in block
shared global memory
thread specific memory
__synctreads(); is crucial when you seperate read/write values
must ensure all values are written before you can start reading thems
maximize arithmetic intensity math/memory
-minimize the time spent on memory per thread
-local memory > shared memory >> global memory
__shared__ float sh_array[128];
sh_array[index] = array[index] copies from global to shared
__syncthreads(); makes that operation is complete
make sure to coalesce memory --read from very close memory blocks
memory that has the lifetime of the threadblock
get an issue of having to write a lot of iterations to one array
has a class of functions called atomics
ie)
atomicAdd(&g[i],1) //adds 1 to g[i]
work around using atomicCAS() you can do anything!
__global__ void increment_naive(int *g)
{
// which thread is this?
int i = blockIdx.x * blockDim.x + threadIdx.x;
// each thread to increment consecutive elements, wrapping at ARRAY_SIZE
i = i % ARRAY_SIZE;
g[i] = g[i] + 1;
}
__global__ void increment_atomic(int *g)
{
// which thread is this?
int i = blockIdx.x * blockDim.x + threadIdx.x;
// each thread to increment consecutive elements, wrapping at ARRAY_SIZE
i = i % ARRAY_SIZE;
atomicAdd(& g[i], 1);
}
//Summary
gather scatter stensil transpose
SM, threads blocks, ordering
local, global, shared, atomics

//Efficient GPU programming

higharithmetic intensity --move to faster memory if you need to
local > shared > global
use coalesced global memory if you need global
avoid diverging threads (bad if statment design and bad for loops)
forced syncing after loops. make em all go through it n times

Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
Moving To Parallel - Addition of 2 Matrices
No ratings yet
Moving To Parallel - Addition of 2 Matrices
14 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
5 Computation
No ratings yet
5 Computation
13 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Cuda
No ratings yet
Cuda
4 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
PC Cuda Assignment-2
No ratings yet
PC Cuda Assignment-2
29 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
CUDA Programming Quiz
100% (5)
CUDA Programming Quiz
4 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
Web GPU
0% (1)
Web GPU
40 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
Parallel Scan in C CUda
No ratings yet
Parallel Scan in C CUda
3 pages
CUDA Memory Architecture Explained
No ratings yet
CUDA Memory Architecture Explained
28 pages
Week 11
No ratings yet
Week 11
21 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
HPC Codes
No ratings yet
HPC Codes
14 pages
Cuda 4.1
No ratings yet
Cuda 4.1
2 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
CUDA Lab Guide for Students
No ratings yet
CUDA Lab Guide for Students
19 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Rishi
No ratings yet
Rishi
30 pages
Input: Output: 1. Sub String Program
No ratings yet
Input: Output: 1. Sub String Program
8 pages
Google Colab Solution Activity
No ratings yet
Google Colab Solution Activity
5 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
Class 10
No ratings yet
Class 10
13 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
No ratings yet
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
52 pages
CUDA Matrix Multiplication Quiz
No ratings yet
CUDA Matrix Multiplication Quiz
12 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15
No ratings yet
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15
42 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
CUDA Class Lecture02
No ratings yet
CUDA Class Lecture02
24 pages
6 Computation
No ratings yet
6 Computation
11 pages
HPC File
No ratings yet
HPC File
22 pages
Processors
No ratings yet
Processors
25 pages
SAP Ariba & S/4HANA Integration Guide
No ratings yet
SAP Ariba & S/4HANA Integration Guide
11 pages
SEO & Jr. Developer Profile
No ratings yet
SEO & Jr. Developer Profile
4 pages
Delphi Xe: Intraweb Xi
0% (2)
Delphi Xe: Intraweb Xi
6 pages
Prateek Resume 3.4 Years Exp
No ratings yet
Prateek Resume 3.4 Years Exp
4 pages
Winsock LSP (Layered Service Provider) Guide
No ratings yet
Winsock LSP (Layered Service Provider) Guide
10 pages
Wix License Note
No ratings yet
Wix License Note
2 pages
Oracle Incentive Compensation: Test Run For Calculation Setup From Scratch On Test Instance
No ratings yet
Oracle Incentive Compensation: Test Run For Calculation Setup From Scratch On Test Instance
46 pages
HTML 5 Forms
No ratings yet
HTML 5 Forms
4 pages
Dod STD 2168
No ratings yet
Dod STD 2168
14 pages
React - Js - Introduction and How It Works
No ratings yet
React - Js - Introduction and How It Works
1 page
CNS 320 Week7 Lecture
100% (1)
CNS 320 Week7 Lecture
62 pages
Software Asset Management: BSA Marketing Manager - Ada Kong
No ratings yet
Software Asset Management: BSA Marketing Manager - Ada Kong
32 pages
Resume Anchalyadav
No ratings yet
Resume Anchalyadav
2 pages
BCSL 013
No ratings yet
BCSL 013
3 pages
Rehans Resume QA
No ratings yet
Rehans Resume QA
1 page
Soumyajit Behera's Tech Resume
No ratings yet
Soumyajit Behera's Tech Resume
1 page
BJIT - Understanding - GIT - GERRIT V 1.0.0
No ratings yet
BJIT - Understanding - GIT - GERRIT V 1.0.0
48 pages
History: Tiobe Programming Language Popularity Index
No ratings yet
History: Tiobe Programming Language Popularity Index
1 page
CS8711 Cloud Computing Lab Manual
No ratings yet
CS8711 Cloud Computing Lab Manual
47 pages
Electrodes
No ratings yet
Electrodes
114 pages
Nutanix NCP MCA
No ratings yet
Nutanix NCP MCA
39 pages
Building A Mobile App For Volunteer Management and Coordination
100% (1)
Building A Mobile App For Volunteer Management and Coordination
13 pages
The Computer
89% (27)
The Computer
5 pages
Installing and Configuring VMware Dynamic Environment Manager. VMware Dynamic Environment Manager 9.9
No ratings yet
Installing and Configuring VMware Dynamic Environment Manager. VMware Dynamic Environment Manager 9.9
51 pages
BR80WS03
No ratings yet
BR80WS03
9 pages
Coin Master - Home
No ratings yet
Coin Master - Home
1 page
SE Exp 11
No ratings yet
SE Exp 11
4 pages
Desktop Native Application
No ratings yet
Desktop Native Application
6 pages
I Sem - Nep - Bca Foc Session 05
No ratings yet
I Sem - Nep - Bca Foc Session 05
3 pages
HTML Quizz
No ratings yet
HTML Quizz
11 pages

Cuda Notes From Udacity Lecture

Uploaded by

Cuda Notes From Udacity Lecture

Uploaded by

/* Lesson 1 --Code from Quiz */

reduce all to one

//Efficient GPU programming

You might also like