[go: up one dir, main page]

0% found this document useful (0 votes)
9 views26 pages

Section 2 TR

Uploaded by

Ashwini Purkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views26 pages

Section 2 TR

Uploaded by

Ashwini Purkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 26

***************************

Understanding the device better


***************************

In the previous section we learn about supercomputing parallel computing and the
relationship of those with CUDA. In the later part of that section we learnt
basic steps of CUDA program and how to launch a kernel. Then we move on to
Implement simple kernel to sum up two arrays, and if you remember we did launch
our sum_kernel using different launch configuration. Then we observe the change
in performance or execution time when launching kernels with different launch
configurations and we found out the best configuration to launch that particular
kernel using trial and error method. But if we had better understanding on
how our device or GPU looked like and what parts of the device effect the
execution of certain type of kernels then we would How come to logical conclusion
beforehand on how the performance of kernels varies with different launch
configurations. In this section we are going to learn just that. We are going to
unviel. what hide behind the fancy cover of NVIDIA gpus and we are going to
learn how to use certain details on specific GPUs to get maximum performance out
of our kernels. Here you can see NVIDA Geforce GTX 970 and Titan-V GPUs. Your
GPU unit would also look more or less like this. But as CUDA developers we
are not interested on this fancy look. If we remove the fancy cover you can see
inside hardware of this GPU unit. But we are not interested in that look
either. We are more interested in architectural design of these GPUs which will
look more or less like this, depending on what microarchitecture you are looking
at. This diagram shows you one layout of microarchitecture called Maxwell which
GTX 900 family product are belong to. The GPU architecture is built around a
scalable array of streaming multiprocessor or SMs and GPU hardware parallelism
is achieved through by replication of this architectural building block. In this
diagram, stream multiprocessor is boxed with red colored line. We can count 16
such streaming multi processors in this GPU. those streaming multiprocessors have
access to device inbuilt global memory, and L2 cache and other than that, there
are some memory controls and cache in this architectural diagram as well. Ok,
Let's now take a closer look at streaming multiprocessor. The key components of
streaming multiprocessor are listed in this diagram. Notice this SM logical
diagram is for SM in Fermi microarchitecture. Modern SM units are somewhat
different than this, but we can clearly identify basic components from this
diagram. We have CUDA cores, shared memory and L1 cache, registers, load and
store units warp schedulers and special function units. A CUDA core is like CPU
core and it will execute instructions fetch into that core. There is shared
memory and L1 cache to facilitate faster memory access. Registers are limited
amount of on-chip memory to store on the fly values for execution. Load-Store
units are there to perform memory load and store requests. Warp schedulers is
going to determine which warp or which set of threads are going to execute next.
Notice these components are on-chip components. So we have to specially careful
how we used resources like registers shared memory and L1 cache, as those are
very limited resources on SM. And warp schedulers will take resource consumption
factor when it deciding which warp is going to execute next. If you look at more
modern SM, you may realize it is bit more complicated than this. Here this diagram
shows you SM for Volta microarchitecture. All these green squares are cores. But
these SM have different types of cores like tensor cores integer operation
execution cores, FP32 cores which optimized to execute 32 bit floating point
operations faster and FP64 cores which build to execute 64 bit floating point
operations much faster. We will look at each of these changes in upcoming
videos. But for now we can focus on simple set of components we saw in the
previous slide. Computer architectures can be classified into four categories
according to Flynns taxonomy. SISD stands for single instructions single data
and it represent sequential computer which exploits no parallelism in either the
instruction or data streams. Here single control unit will fetch a single
instruction stream from memory. The control unit then generates appropriate
controls signal to direct signal processing element to operate on single data
stream that is on operation at a time. SIMD stands for single instruction
multiple data. This model represents the organization of single computer
containing control unit, processor unit and memory unit. Instructions are going
to execute sequentially. This can be achieved by pipelining or multiple function
units. MISD stands for multiple instruction single data. Here multiple
instruction operate on one data stream. This is an uncommon architecture
which is generally used for fault tolerance. Here heterogeneous systems will
operate on same data stream and must agree on the results. Examples of this
model includes fault tolerant applications like the space shuttle flight control
computers. MIMD stand for multiple instruction multiple data. and this
represent multiple autonomous processors simultaneously executing different
instruction on different data. MIMD architectures includes multicore super
sacaler processors and distributed systems using either one one shared memory
space or distributed memory space. CUDA follows different version of computer
architectures than these ones called SIMT. SIMT is an extension of SIMD model
with multithreading. Which means that single instruction is going to run on
multiple threads. In CUDA, thread block is going to execute in single streaming
multiprocessor. Multiple thread blocks can be execute simultaneously on same
streaming multiprocessor depending on the resource limitation of that SM. But
one thread block cannot be executing in multiple SMs. If the device cannot run
single thread block in one SM, for example if the needed shared memory count
for particular block is grater than the available shared memory in SM, then
error will return for that kernel launch. When a block is schedule to execute on
SM, it is execute in a way that single instruction run by multiple threads. We
will discuss this more in upcoming videos. In the previous section, we arrange
set of threads to set of blocks on a grid. At least that's what we able to
observe from software point of view. On hardware level we can map these
execution as shown in this diagram. The grid for kernel launch is going to map
to device. Thread blocks in this grid can be divide across SM in the device. A
single thread in a thread block is going to execute by one single core. And
since we have multiple cores in single SM same instruction for multiple threads
can execute in SM, to adhere the SIMT convention. When programming in CUDA you
have to be vigilant about what kind of devices that you are going to run your
application as different CUDA microarchitectures contain different and enhanced
capabilities. Throughout this course you will learn about these capabilities and
how to utilized those to get maximum performance output.

***************************
All About Warps
***************************

When launching a kernel it seems that all the threads in the kernel runs in
parallel. From logical point of view, this is true. But from the hardware
point of view, not all the threads physical execute in parallel at the same
time. For example, let say we have a grid with 1 million threads with each
thread block having 512 threads. Now from the software point of view we assume
that all of one million threads in the grid executes paralell. But if we ran
this kernel on a device with 13 SMs each having 128 cores so altogether we
have only 13 *128 which is 1664 cores. So this device can only execute 1664
threads parallel. So there is no way of running 1 million threads parallel in
this device. Also, one block is going to run in single SM. and here we have
only 128 threads per SM. But since we have 512 threads per thread block, there
is now way of parallel execution of all the threads belong to even one thread
block as well. So, what here is that, in the device thread blocks are divided in
to smaller units called warps each having 32 consecutive threads. And all the
warps for given thread block is going to execute in a single SM. For example in
the scenario we considered before we had 512 threads in a block.l these threads are
going divided in to 16 warps. And since we have 128 cores per SMs here, so
only 4 warps can execute at a given time. Now there is no particular order
which warp is going to execute next. That is depend on the facts like rediness
of memory for threads in those warps. So here we can define warps as the basic
unit of execution in a SM. And when we launch a gird, the thread blocks are
going to distributed among SMs. Once a thread block is schedule to a SM,
threads in that thread block are further partitioned in to warps. A warp consist
of 32 consecutive threads and all threads in a warp are executed in SIMT
fashion. That is all the threads execute the same instruction and each thread
carries out that operation on its own private data. A thread block can be
configured 1, 2 or 3D however from the hardware point of view, all threads are
arrange to a single dimension. Here, each thread is going to have unique id.
For example, if we have 1D block with 128 threads in X dimension those threads
will be divided into four warps. Warp 0 will have threads with threadIdx.x value
0 to 31 and warp one will have threads with threadIdx.x value to 32 to 63 Warp 2
will have threads with threadIdx.x value to 64 to 95, warp 3 will have
threads with threadIdx.x value 96 to 127. If you have 2D thread block with 64
threads in X dimension and 2 threads in Y dimension still from the hardware
point of view threads will be divided in to 4 warps in a single dimension. This
slide shows you a 2D thread block with 40 threads in X dimension. And 2 threads
in Y dimension. From the application perspective there are 80 threads laid out
in two dimensional grid. However from the hardware point of view we have 80
threads in x dimension. So one can suggest that hardware allocate 3 warps for
this thread block result in total of 96 hardware threads to support 80 software
threads in this way. If so only 16 threads will be in active state in final
warp. But this is not true. Warp exists within a thread block. If we do it
in this way, second warp is going to have threads from 2 threads block. So here
if we consider in this way, we are suggesting that threads belong to two thread
blocks residing in one warp which is an inaccurate statement. So what
really happen here is that, to handle 80 software threads in this case CUDA runtime
will allocate 4 warps or 128 threads as shown in this diagram. In this
case only 8 threads in both second and fourth warps are going to be in active
state. All other warps for these warps are going to be in inactive state when
our device execute this thread block. Warp size is 32 and even to run a
thread block with single thread still CUDA runtime will assign single warp which
means 32 threads. In this case only one thread will be in active state and all
other 31 threads will be in inactive state. The problem here is that, even
though those are inactive threads which means those threads are not going to
execute instructions, still resource allocations like shared memory for block will
be done considering number of warps. So having inactive threads in a warp
will be a great waste of resources in SM. Therefore we tend to have value which
are multiplication of 32 as block size in X dimension, so that we can guarantee
that all the threads in a warp are going to be in active state. OK let's do a
simple exercise to show you above mentioned facts. In this example, We are going
to launch a kernel with two dimensional grid and each is going to printout
threadIdx, . blockIdx and warp index values. There are no any built in
variables to indicate the warp index. But we can easily calculate this value by
dividing threadIdx value by warp size which is 32. So this value represent index
of a warp in a block. So for single thread block we are going to have different
warp value for each 32 consecutive thread. But for different thread blocks there
will be warps with same warp index. Here our grid has 2 blocks in X dimension
and 2 blocks in Y dimension. And each block is going to have 42 threads in X
dimension. So all together we are going to launch this kernel with 168 threads.
Now 42 is a far from ideal value we should have as a block size. But here I am
using this value for demonstration purpose. Now let's look at how our warp
allocation may look like. Again remember, warp cannot have threads from 2 thread
blocks. So to handle 168 software threads in this case, we are going to have 256
hardware threads or 8 warps. And here we have 4 warps with only 10 active
threads in each. Other threads in these warps is going to be in inactive state.
Ok, let's printout above mention value for this grid. Let's name our kernel
as print details of a warp. Frist we need to calculate global thread index here.
But remember here we are going to have two dimensional grid. So, we have to
consider Y dimension in our calculation as well. So let me calculate this
first. And then we need warp index. And we can calculate it by dividing
threadIdx.x value by 32. Since we are having 2D grid here, let me calculate
unique block index to identify each block separately in the output. We can do
this by multiplying block index in Y dimension by number of blocks in X dimension
and then add blockId in X dimension. Then let me printout threadIdx.x,
blockIdx.x, blockIdx.y, gid and warpId values. Now in our main function we
are going to launch our kernel. So let me set block and gird variables here. Our
block size will be 42 in X dimension. And our grid is two dimensional one with
two thread blocks in each dimension. With these parameters you can launch our
kernel now. Ok, Let me run this example now. Now if you carefully
investigate the output you will see that second warp for each thread block have
only 10 outputs as we discussed in the presentation. But first warp in each
thread block have 32 outputs. For example second warp for block with global
blockId 3 have only 10 outputs as I highlighted before. So, Run this example in
your machine and realize the fact I mentioned in the presentation with it's
outputs.

***************************
Warp Divergence
***************************

Let's discuss an important concept associate with warp execution called warp
divergence
As we discussed threads in a warp are going to execute same instruction
But what if we code our program in a way that if will force some threads in the
warp to execute different instructions
For example, let's say we have a grid with blocks which have 32 threads
So there will be only one warp to execute
And this grid is going to execute following condition checks in the kernel
Due to this condition checks threads with threadIdx.x less than 16 will execute
statements in if block and threads with threadIdx.x greater than 16 will execute
else block
But we know warps are execute in SIMT fashion
So every thread in this warp has to execute same instruction
So what happen here is that, if threads in a warp diverge then warp serially
execute each branch path disablingthreads that do not take that path
So warp will execute if block while disabling second half of the thread in the warp
and then it will execute instruction in else block while disabling first half of
the warp
So if we have multiple condition checks which causes warp to diverge then it
will be a significant performance penalty
Let's examine another example now
Let's say we are going to execute a kernel with code shown in this slide
Here we have condition check which forces threads with odd number as thread id to
follow different path than the threads with even number as thread id
So all the threads in the warp with even number as thread id will execute if
statement and all the threads in the warp with odd number as thread id will
execute else statement
To adhere SIMT execution convention, CUDA runtime will force each thread to
sequentially go through all the statements in both if and else block while
disabling the threads that does not take that path logically
Such a execution for just 4 thread in that particular warp is shown here
Here blue colored boxes represent coherent code which are going to execute by all
the threads in the warp
Only thread with even number as thread id which in this case thread with thread id
0 and 2 are going to execute green colored instruction or if clause
And on that time threads with thread id 1 and 3 have to stall its execution
And when threads with thread id 1 and 3 execute else block threads with threads
with thread id 0 and 2 have to stall the execution
In this example, the amount of parallelism in the warp execution was cut by half
Only 16 threads were actively executing at a time while other 16 were disable
With more conditional branches the loss of parallelism would be even greater
So be aware in the presence of control flow statements like if and else statements
switch statements which gives us a hint of divergent code
But remember presence of a control flow statement is not always result in
divergent warp
Warp is diverge when there are multiple path of execution with in the same warp
So condition checks which result in all the threads in a warp to execute in same
path, will not induce any warp divergence
Consider this example
Here "tid" represent threadIdx.x value for single dimensional block
Here if we take 64 as our block size CUDA runtime will allocate 2 warps to
execute this thread block
First 32 thread in our thread block will have tid from 0 to 31
So tid/32 will be 0 so first 32 threads or first warp for our thread block will
execute the if block and none of the threads first warp is going to execute the
else block
And for second 32 threads tid values will 32 to 63, So for these threads tid / 32
will be 1 So all the threads in the second warp are going to excute else block
So here first warp is going to execute if block and second warp is going to
execute else block
So even though we have condition check here, there is now warp divergence
But if we change the condition check as shown here, then first half of first warp
will execute if statement and second half of first warp will execute else
statement
And this is true for second warp as well
So if we use condition in this way, there is warp divergence and performance will
be degraded
So be vigilant when you are using condition checks in a kernel
And always try to find a proper condition check which does not induce any warp
divergence
You can measure whether your kernel have divergent code or not by using
performance metric with nvprof profiling tool
Performance metrics are measurements of different properties of your kernel
And there are metrics called branch efficiency which indicate presence of a
divergent code in a kernel
Let's first look at what this branch efficiency metric is
Branch is a path of execution for the threads in a warp
For example, in a code if we have condition which will force threads with odd
values as thread id to execute if block and threads even value as thread id to
execute else block Then there are two branches here
If branch and else branch
if we have a code which forces threads with in same warp to execute in 3 different
path, then we have 3 branches
And keep in mind here threads with in single warp should follow different path
Otherwise there is no branch divergence as already saw
Branch efficiency is defined as ration of non-divergent branch to total branches
And it can be calculated using following equation
For example, if we consider this condition check, so half of the warp is going to
execute if block and other half is going to execute else block, so we have two
branches and we need to consider one of them as main execution path so we have one
divergent branch so the branch efficiency will be (2 - 1)/ 2 *100% which is 50
percent
Branch efficiency can be measure for a kernel using nvprof profiling tool
Let's see how this is done
Here, I am going to launch two kernels
Both these kernel will perform similar functionality
But one of them do it using divergent code
So lets add our kernels here
Our first kernel is not going to have any divergent code
So let's name it as code without divergence
Here, I am going to need global thread id value, so let me calculate it first
Then all I'm going to do is set values to a couple of float variables based on
whether the warp id of that thread is even value or odd value
So here let me define two new floats variables
Now we need warp id to perform our task
So let's quickly add variable called global warp id and calculate it's value by
dividing global thread id by warp size
Now this warp id calculation is very subjective one
There are no this type of usage of global warp id value in any other cases
But here for simplicity I am just using it
Ok, now if warp id is even value I am going to assign 100 to A variable and 50 to
B variable
Otherwise let's assign 200 to A variable and 75 to B variable
That's it for this kernel
Now our second kernel is almost similar to this one so let me copy and paste it
here and change its name to divergent code
Ok here I'm going to assign values to A and B variables based on whether the
thread id is even or not
I remember in the previous kernel we assign values based on warp id not thread id
But in this case we are going to assign values based on threadId is even value or
not
So we don't need warp id here
So let me get rid of that code and change the condition check in "if" statement in
this way
As we discussed in the presentation, this condition check will force consecutive
threads to take different path of execution
In our main function, I have already defined blocks and grid variable so we can
launch our kernels here

Now I'm going to compile this code with nvcc compiler


So in command prompt, you can specify this command, here -o option specify the
output file were executable code is going to store and then the input code file
So let's compile this
In the compiler output there are some warnings about unused variables This is
because we have not perform any effective task in out kernels
We have only assign some values
So for now you can ignore those warnings
Then I am going to profile our program with nvprof profiling tool with this
commond
Here we have specified performance metric using --metrics option and we are going
to observer branch efficiency metric So I have specify that as well
Notice for nvprof you have to specify the executable program name which we create
using compiler before
Ok let's run the profiler
In the output
you can see that there are no difference in our branch efficiency metrics for both
kernels
This is because, when we are compiling with nvcc, if our compiler find out any
inefficient code like the divergent code we have here, compiler will optimize that
code segment with predicates to avoid branch divergence
And that's is why, we cannot see any performance different in these kernels
Now let me recompile our program by disabling optimizations
Now if you specify -G as option, then compiler will disable device optimization
-G option will generate will generate debug information for device code and also
it will disable all optimizations for deivce
So let me compiler our program again with this option
Now let's run nvprof tool again with same command
Now in the output you can see our kernel
without divergence have 100% branch efficiency because all the threads in a warp
in this case executes in the same path
But kernel with divergent code have only 83% percent of branch efficiency
Now when you calculate branch efficiency manually, you will most probably not end
up with value like this
There are a couple of reasons for that
First one is, in our divergent code kernel "if else" statement only occupy part of
the code
There are a few other statements which are going to run by both threads with even
number as thread id and odd number as thread id
So when nvprof calculate branch efficiency it will take those portions of code to
account as well
The second reason is that, even though we turn off compiler options still CUDA
compiler perform some basic optimizations on our code
So due to these reasons we ended up with a 83% of branch efficiency rather than
50%

***************************
Resource Paritioning and Latency Hiding 1
***************************

In the previous videos I told you that blocks are scheduled in the SM depending on
the resource limitations of that SM
In this video we are going to look at those resource limitations and how exactly
this scheduling works
The local execution context of a warp mainly consists of following resources
program counters, registers and shared memory
The execution context of each warp processed by a SM is maintain on-chip during
the entire life time of a warp
Therefore switching from one execution context to another has no cost
Now this is expected as GPU are designed to execute 1000s of threads and if the
context switching took more time then the overall execution will be going to be
delayed
However this zero cost context switching is achieved by making GPU threads more
lite weight and limiting its capabilities
Out of above mention resources number of registers and shared memory can be
directly controlled by the programmers
Each SM has set of 32 bit registers stored in a register file that are
partitioned among threads and and fixed amount of shared memory that is
partitioned amount thread blocks
The number of thread blocks warps that can simultaneously reside on a SM for a
given kernel depend on the number of registers and amount of shared memory
available on SM and required by the kernel
For example as shown in this diagram, if each thread for given kernel consumes
more registers fewer warps can be reside on a SM
If you can reduce the number of registers a kernel consumes more warps will be
processed simultaneously
Likewise if the thread block consumes more shared memory fewer thread blocks are
processed simultaneously by a SM
If you can reduce the amount of shared memory used by each block then more
thread blocks can be processed simultaneously
What this means is that resource availability generally limit the number of
resident thread blocks per SM
The maximum number of registers and amount of shared memory per SM, are vary
for device of different compute capabilities
If there are insufficient registers or shared memory on each SM to process at
least one thread block, then kernel launch will fail
This table lists some of the technical limitations like maximum number of
concurrent blocks can reside in SM maximum concurrent warps per SM, number of 32
bit registers per SM maximum number of 32 registers can be used by single thread
and shared memory for SM for several GPU micro architectures
Some of these values can be retrieve with CUDA get device properties function
and some of them are listed in particular microarchitectures documentation
For example, in a device with compute capability 3 or higher we have 64000 32
bit registers per SM
And these registers should be shared across all the thread blocks run in that
particular SM
If you look further you can see that one thread can only use 255 registers only
If we use more than that in our program those excessive registers going to spill
over to memory called local memory and that will reduce the performance of the
program greatly
So depending on device compute capability, amount of resources we can consume is
going to change
Also we can optimize the performance of a kernel depending on the device that is
going to execute on
Let's now look at categorization of blocks and warps
A thread block is called an active block when compute resources such as
registers and shared memory have been allocated to it
Warps in such an active block is called active warps
Active warps can be further classified in to following three types
Selected warps, stalled warps and eligible warp
The warp scheduler on a SM select active warps on each execution cycle and
dispatch them to execution unit
A warp that is actively executing is called selected warp
If an active warp is ready for execution but currently executing it is an
eligible warp
If warp is not ready for execution it is a stalled warp
A warp is eligible for execution if both of the following condition is met
32 CUDA cores should be available for the execution, and all arguments to the
current instructions should be ready
The number of active warps at any time from launch to completion on
microarchitectures with compute capability 3 or above must be less than or equal
to the architectural limit of 64 concurrent warps
If warp stalls, the warp scheduler picks up an eligible warp to execute in its
place
Because compute resource are partition among warps and kept on chip during
entire life time of a warp, switching warp context will not cost any execution
cycle.

***************************
Resource Paritioning and Latency Hiding 2
***************************

As we discussed in the previous video SM is resources constraint environment


So excessive usage of resources like registers and shared memory from single
thread, causes SM to have only smaller number of active warps residing on that
SM at a given time
resulting underutilization of computing resources
In this video we are going to look at another concept called latency hiding
which highlights the need of enough active warps to utilize the computation
power optimally
So what is latency or instruction latency
Well latency is the number of clock cycles between instruction being issued and
been completed
When controlling unit in the core issues an instruction for arithmetic operation
to ALU or arithmetic and logic unit, ALU might take some clock cycles to
complete that issued instruction and in case of memory operations, this is the
number of clock cycles between memory instrunction being issued and memory
arrive at the destination
The table shown here is the part of analysis of the GPGPU pipeline latancy done
by Michael Anderson And John Lucas for the embedded system microarchitecture
symposium at Berlin
It highlights the latencies of the integer, logic and floating point operations
for 4 NVIDIA microarchitectures
Here you can see that for Maxwell gm107 chips add and sub operations have
instruction latency of 6 clock cycles
Max, min and mad or multiplication add and multiplication operation have
instruction latency of 12 to 13 clock cycles where as operations like division
and remainder have instruction latency more than 200 clock cycles
And logical operators like AND, Or XOR have instruction latency of 6 clock
cycles
And this table shows you instruction latency for memory load and store
operations for different types of memories like global memory, local memory, shared
memory texture memory and constant memory
We haven't discussed these memory types yet, but for now focus on the latency
values only
As you can here, for Maxwell gm107 chips L1 cache have memory latency of 194
clock cycles
DRAM access has 350 memory latency
But shared memory access have only 28 clock cycles of latency which highlights
the importance of shared memory
So when warp is executing, if it perform arithmetic operation that warp is going
to hold for 15 or so clock cycles
and this will be a performance penalty
And in case of memory operations this holding time can raise to hundreds of
clock cycles which will degrade the performance of our application greatly
But fortunately with CUDA latency hiding mechanisms if we have enough number of
eligible wraps residing on SM We can hide this instruction latencies
Ok, let's look at how this latency auditing mechanism works
As I already mentioned to you execution context of each warp processed by a SM
are maintain on-chip during the entire lifetime of a warp
Therefore switching from one execution context to another has no cost
So to hide above mention latency we are going to have large number of eligible
warps
so that when one warp holds, then warp scheduler can dispatch another eligible
warps to the core
For example, consider below diagram
This diagram demonstrate the execution of a warp in SM and the current execution
cycle we are at
Let's say this warp is going to execute an arithmetic operation which take 20
instruction cycles to complete
So as you can see our warp is going to execute the arithmetic operation in first
execution cycle then it has to hold for 20 clock cycles
So in next 20 clock cycles there will not be any execution in these cores
and in 21st clock cycle after the results arrived Execution continues
But if we have more eligible warps residing in SM, then we could switch the
execution context to new warp, so that stalling cores can continue execution for
that new warp
if that new warp also belong to the same kernel as the first warp, new warp will
also hold for 20 clock cycles, after the execution of instructions
So if we have 20 such warps, we can switch context with all of those warps and
completely occupy these computation cores while results are ready for the
first warp
And when results for first warp is ready, then we can switch the context back to
it
So there will not be any latency as we filled the execution cycles which first
warp has to stall with other eligible warps
So the question is for given kernel how many warps we need to hide the latencies
of the instructions
Let me answer that question with a small calculation
if our SM have 128 cores which means we can run 4 warps concurrently in one
SM
As we saw earlier, we need 20 eligible warps to hide instruction latency for one
warp
So to hide latency of 4 parallel warps we need 4 * 20 = 80 eligible warps ready
in the kernel to hide the latency of the discussed arithmetic operation per SM
if we have 13 such SM in our device, then overall we need 80 *13 = 1040 warps
to be in eligible state to hide the latency of our device
Now how about memory instruction latency
To calculate how many warps needed to hide memory instruction latency, Let's
consider a device with Maxwell architecture which has 350 cycles of DRAM latency
Now our approach is to find how many bytes of data can be transferred within 350
crook cycle period and then we can calculate the needed number of warp for
particular kernel to hide this memory latency
Let's consider GTX 970 device which belongs to Maxwell microarchitecture and
calculate specific values
This device has 196Gb/S DRAM bandwidth
So first we how to convert this bandwidth information to bytes for cycle value
since we are considering latency in instruction cycles
For that you have to know the memory clock speed of the given device
You can find out that value by running NVDIA SMI tool with following command
So open up the command window and go to the folder where NVIDIA SMI tool locates
In my PC It's in C\Program files\Nvidia corporations\ nvsmi folder
Then execute the above command
And you will end up with the output like this
In this output under the max clock section You can see the maximum memory
clock
Max clock speed represent the frequency of memory clock or how many cycles of
clock ticking per second
So in my device, I have 3.6 GHz of max clock speed for memory transfer
Which means 3.6 Billions clock cycles per second
So to get to that level of maximum number of bytes which can be transferred for
cycle, we can divide this bandwidth from this Max clock speed
So for my device that will be 196 / 3.6, which can be rounded up to 54 bytes per
clock
As shown in this diagram, this value represent how many bytes of data can be
transfer from all the SMs in my device to DRAM through DDR 5 memory bus
Now we have memory latency of 350 clock cycles
So within this 350 clock cycles we can transfer 350*54 =18900 bytes between
DRAN and the SM
If the kernel uses 4 bytes of memory like one integer from memory for each
thread, then we need 18900/4 which is 4725 thread to hide memory latency
Which means 128 eligible warps should be reside in the device to hide the memory
transfer latency
If I have 13 SM, then we can calculate needed warps per SM by 148/ 13 =12
warps per SM So, each SM should have at least 12 eligible warps to hide the
memory latency
In this way you can calculate how many warp you need to have in order to hide
the arithmetic and memory latencies and higher of those values indicate which
latency type effect the application most
And depending which on the instruction latency effect the application most, we
can categorize any CUDA application into two categories
If memory latency are the ones which contribute most to the latency, then we
refer for such application as bandwidth bound or memory bound applications
If the latency of arithmetic operations effect most to the execution of
application, then we say application is computation bound application
Once we categorize our application into one of those categories then we can
first focus on optimizing the application by following steps specific to
optimize that category of applications.

***************************
Occupancy
***************************

In this video we are going to discuss about one of the key concepts in CUDA
programming called occupancy
Occupancy is the ratio of active warps to allowed maximum number of warps per SM
and it can be calculated using following equation
In a given CUDA core, instructions are executed sequentially
When one warp stalls because of the latency in arithmetic or memory operation,
that SM switches context to another eligible warp
So ideally if one warp stalls, we want to have enough number of warps to keep
the cores occupied
So if the occupancy of a particular kernel is a high value that means even
though one warp stalls execution there will be other eligible warp so that of
cores always will be busy
In this equation, term max number of warps in SM can be found out by either
looking at by device microarchitecture's documentation or by querying the device
with max threads per SM property and dividing it by warp size which is 32
On the other hand, active warp counts per SM depends on available resources in
device and resource consumption of your kernel
So let me show you an example calculation of this value
let's assume a kernel uses 48 register per thread and 4096 bytes of shared
memory per block
Here there are two factors which affect the number of active warp in a SM
Register usage and shared memory usage
So here is what we are going to do now
I am going to calculate allowed warp count considering these resource separately
and minimum of those two values will be the one which limits the active warps
per SM
So let's first find the allowed active warps considering the register usage of
our threads
Now here, one thread uses 48 registers
So for a given warp, register usage will be 48 multiplied by threads per warp
which is 32, which equals to 1536
And if I consider GTX 970 as the device I run this kernel, then we have total of
65536 registers per SM
So we can divide total register value per SM by by register consumption of our
single warp to find out allowed active warps So that value will 65536/1536 =
42.6
But this is not our last value
There is a concept called warp allocation granularity value
In SM, usually depending on available number of warp schedulers, warp allocation
happen in group wise
For example, if our device has warp allocation granularity value 4, this means
warp allocation happen in group of 4 warps
We might only need 1 warp to execute particular kernel
Still SM will allocate 4 warps as its granularity is 4
Now I have GTX 970 device here
And it has 4 as warp allocation granularity
So SM can allocate warps count which are multiplier of value 4 only
Therefore in this case allowed warp based on register usage will be 40
Now let's calculate allowed warp count based on the shared memory usage
GTX 970 has 98304 bytes of shared memory per SM
And these memory will be shared by all the thread blocks in the SM
Here in our kernel we used 4096 bytes of shared memory for block
So based on this values allowed thread block count will be 98304/4096 = 24
And if we consider 128 as our thread block size, which means 4 warps per
block So based on shared memory we can have 96 warps per SM
But remember for device with compute capability 5.2, we can have only 64 warps
at max per SM
So this means our active warps count per SM does not limited by shared memory
usage
So here register usage limits allowed active warps and that value is 40
So our occupancy will be 40/64 which is 63% of occupancy
Knowing how to calculate the occupancy value is very important to get the
grid on the occupancy concept
But CUDA toolkit provide tool called CUDA occupancy calculator which does this
calculation quite easily
So let me show you how to use CUDA occupancy calculator now
This tool is actually a excel sheet
This excel sheet has 4 tabs
Calculator tab which we are going to use in a minute, help tab which explain how
to use occupancy calculator GPU data tab which contain device query information
for device with different compute capabilities and of course copyright tab
In the calculator tab there are different section highlighted in the different
colors, and there are some graph which highlights the co-relation between occupancy
and varying block size, shared memory usage per block and register per thread
for given configuration
In this green section we can set details of our device
So first select compute capability of your device
When you select compute capability, below gray area, This area will be
populated with device information
There you can see the physical limits for device with selected compute
capabilities
Second information you have to put here is the shared memory size which you
can select from the drop down box or you can fill it manually by looking at the
device limitation section
As 3rd parameter you can provide cache mode for execution
For now you can leave it behind and move on to the next section
We will learn about this parameter when we are discussing about caching in CUDA
Then in the orange section we have to specify the execution configuration and
memory usage of our kernel
You can fill the threads per block based on kernel launch parameters
To find out the registers per thread and shared memory per block you can compile
your program in nvcc with ptxas options which will output the information about
registers, local memory, shared memory, and constant memory usage for each kernel
in given CUDA file
For example here I have program called occupancy test, in this occupancy test
kernel, I am using 8 integer variables X1 - X8 and those together and store the
results in to the global memory
Now let me compile this program with above given options to nvcc compiler
So go to the location where you have your CUDA file and run the command shown
in here
So the outcome will look like this
In this output, in the last line you can see the register usage of reach thread
for our kernel
And if there is shared memory usage then that will appear here as well
As you will see in upcoming videos we can allocate shared memory dynamically as
well
If so, you have to explicitly add those values to the shared memory as well
So, if I add these values to the occupancy calculator, you can see GPU occupancy
data in the blue colored section
Now change these values a bit and observe the change in outcome when we
change the device configuration input or resource usage per kernel
Also down below there is a yellow section which highlights what we obtained from
our manual calculation
Here you can see that maximum thread blocks per multiprocessor has been limited
by registers per multiprocessor and that warp value is 40 as we calculated
before
Also here you can see that calculated occupancy value which in this case 63%
for more information on how to use occupancy calculator you can refer the help
tab and it includes all the information I just present to you
But keep in mind that higher occupancy does not necessarily means higher
performance
If the kernel is not bandwidth bound or computation bound then increase in
occupancy will necessarily increase the performance
in-fact making changes to increase occupancy can have other effects such as
additional instructions more register spills to local memory which is an off-
chip memory and more divergent branches
As with any optimization you should experiment to see how changes affect the
wall clock time of the kernel execution
For bandwidth bound applications most probably increase in occupancy can help to
hide the latency of memory access
And therefore improve performance
Ok, with all these knowledge, let me present you few basic point that would
help your application to scale better in GPU
Keep the number of threads per block, a multiplier of warp size
Avoid small block size as too few threads per block lead to hardware limit on
number of warps per SM to be reached before all the recourses are fully utilized
So start at least with 128 threads per block
Adjust block size up and down according to the kernel resource requirements
Keep the number of blocks much greater than the number of SMs to expose
sufficient parallelism to your device
Conduct experiment to discover the best execution configuration and resource
usage
Ok now you know how occupancy is for performance of our kernel
But keep in mind that occupancy is not the only factor effecting the performance
and after a certain level of occupancy reached further increase may not lead to
performance improvement
As you progress with this course you will learn these factors as well

***************************
Profile Driven Optimization with NVPROF
***************************

Profiling is a very important aspect of CUDA programming which gives us insight


about our program execution based on different grid configurations and different
resource usages
In this video, we are going to discuss about standard profiling tools comes with
CUDA tool kit called nvprof
Now in upcoming CUDA tool section we will give you in-depth insights on how to
use nvprof and couple of more tools as well
The focus of this video is to give a head start on nvprof, so that you can use the
tool throughout the course
Different usages of nvprof will be discussed throughout this course
But in this video we will only discuss what we need to understand profile driven
optimization concepts
In profile driven optimization
We change our program based on the profiling output
Concretely in this video you are going to implement array summation in GPU
Here, we will provide different implementation based on different grid
configurations and we will observe the performance variations based on these
configuration via nvprof tool
Before explaining the tool, Let me first show you the program we are going to
profile here
The program we are going to use here is the sum_array program which we
implemented in the previous section with some modifications
I have modified that program to take command line arguments
Now this is quite a common approach, when we are using programm profiling, as
we need to change the grid configurations and resource usages like shared memory
dynamically
If we haven't used this type of mechanism then to change grid configurations or
resource usages we have to manually change code and compile it again and again
before profiling
Ok, in this kernel, as the argument we will provide whether we need our grid to
be 1D or 2D grid

As the second argument we provide the input size of the array


These arrays will be randomly initialized
In case of two dimensional grid, we can provide the number of thread in X
dimension as 3rd argument
And as 4th argument we will be providing block size in X dimension and 5th
argument will be the block size in Y dimension in case of two dimensional blocks
Now these size value will be given as power of 2
For example, if we need 1 megabyte of data which is 2 to the power 20 bytes, so
we can give 20 as the second argument
The implementation is pretty simple
The difference here is the addition of condition checks to direct the flow of
the main program to decide which type of kernel its going to execute
So I encourage you to go through the code for this video after watching the
video
And also I have use result validation with CPU as well
So we can validate our results
But I have not include any timing related codes here, since we are using these
codes to profiling purposes
Note when profiling more additional information on execution will be recorded by
the profiler
So it is not recommend to time your kernel while profiling since profiling will
add pretty big overhead to the instruction execution
Especially nvprof should not use to compare the execution time with CPU
implementations
But if you have couple of kernels, and if you need to compare those two
kernels, then you can use nvprof
With these things in mind let's move on to our main focus of this video which
is the profiling with nvprof
Nvprof have 3 modes of operations
Summary mode, gpu and api trace mode, event metrics summary mode, and event
metrics trace mode
Out of these, right now we will focus on summary mode and event metrics summary
mode only
But throughout the span of this course depending on the needs, we will use other
modes of nvprof as well
Our first step is to compile the CUDA program we are going to profile
So let me do that now
Now since we are going to measure timing as well, keep in mind not to include
any flags which will add overhead to the executable code
For example don't compile the code with -g or -G flags which will disable the
compiler optimizations
After that you can use syntax shown in the presentation to profile the
application
So for the nvprof tool, first we can give options like metrics, events so on
Then we have to specify the application and the application argument
Let me run nvprof with our program without any options
The outcome is in summary mode
Summary mode is the default operation mode for nvprof
In this mode nvprof outputs a single result line for each kernel functions
and each type of CUDA memory copy perform by the application
Let me explain this output
Now we have not specified any options to nvprof and that's why we got the
default output or output in summary mode
Also we have not specified and arguments to our program as well
So here our application use default size and our kernel launched with 1D grid
Also, you can see that results of GPU and CPU are same, which verify the
validity of execution we just did
In the summary mode, you can see the GPU activity section first
Kernel executions, memory copy functions will be categorized under this section
Now we have transferred two arrays to device from the host
And you can see that in output, there are two CUDA memory copy calls happen in
host to device direaction
That operation took 54 percentage of the total time of execution
Now we have 2 CUDA memory copy function in host to device direction
So here nvprof records the minimum value of those calls and maximum value and
the average values for those calls
But we did only single memory copy in device to host direction
So all the values: min, max and average values are same
And we have our kernel execution, which took only 2% amount of total execution
time
Now you can see the significance of memory transferring overhead here
And then you can see the API calls and there profiling details as well
Some of these functions happen underneath to accommodate what we done in the
code
To have the output in event metric summary mode we have to specify these
options to nvprof tool
For example let's say I need to look at some metrics for my application
Metrics are different measurements of your CUDA applications like warp
divergence, efficiency of SM, global memory load efficiency and etc
This slide highlights few of the important metrics we are going to use
throughout this course
We will discuss event parts in upcoming section
Ok let's look at global memory load efficiency SM efficiency, and
achieved_occupancy for different configurations of our application
First let's run all application with default configuration which is one
dimensional grid with block having 128 threads
In the command prompt, you have to specify the metrics you want to measure using
double dash metrics option and metrics should be comma separated
Ok, let me run this command now
Now as you can see from the output, it has list metrics names with metric
description and values for corresponding metrics
In this case we have 100% efficiency in global memory loads and SM efficiency is
almost 99% and we have 0.89 or 89% achieved occupancy for the default
configuration
Now here is what we are going to do, we are going to increase our array size
to 32 megabytes which is 2 to the power 25 number of bytes and then we are going
to profile the sum array application using following grid configurations while
reporting the profiling details
In the first configuration we are going to arrange all the threads in to 1D grid
Block size is 128 threads in X dimension
So the arguments to our application will be 0,25,0,7
First 0 to specify whether we need 1D grid or not, 32 megabytes means 2 to the
power 25, so we have 25 as second argument
Since this is one dimensional grid we will not worry about size in X dimension
as it is same as the input size, And then block size in X dimension which is
128 equals to the 2 to the power 7 So, 7 is the next argument
In second configuration, we are going to have 2D grid with dimension shown in
this diagram
Here, also we have 32 megabytes of data and we are going to arrange our grid to
have 2 to the power 20 threads in X dimension and 32 or 2 to the power 5 threads
in Y dimension
And our block size will be 128 threads in X dimension and 4 threads in Y
dimension
Now keep in mind our max block size is 1024
So you how to make sure that your block size is not exceed the 1024 thread per
block restriction
So, in this case, our arguments will be 1, 25, 20, 7 and 2
Next configuration is almost similar to the second one
The only difference is that I have change the book size in X dimension value to
256.
So in 3rd configuration we have 256 * 4 threads which is 1024 threads per block
Now this is the maximum value that thread block can have
If you increase this value further then you're kernel launch will be failed
Now let me run our profiler with each of these arguments set
Let's run one dimensional grid first
Now as you can see, nvprof has given us the information about metrics
Now let's run two dimensional configuration
And finally let's run the 2D configuration with 256 threads in threads block's X
dimension
Now if you compare these outcomes you can see that occupancy values has been
different for each of these launches
Ok, My purpose of this video was to show you how to use a nvprof tool for
profiling purpose so you can have insight about your program and we are going to
use this tool extensively in upcoming lectures for profiling purposes
And if you want to learn more about nvprof you can type nvprof help command
So if you run that command, you can see from the output, it has all the
details about commands options metrics and events.

***************************
Parallel reduction as synchronization example
***************************

In any parallel programming paradigm synchronization between threads is vital


aspect of programming
Here, we are working with more than 1 thread, and often we need to order the
operations perform in these threads
So, in this video, we are going to discuss about couple of synchronization
functions in CUDA, cudaDeviceSynchronize and syncthreads() functions
cudaDeviceSynchronize function provides global synchronization between host and
device
In CUDA, often asynchronous calls like kernel launches are made from the host
Using cudaDeviceSynchronize function We can block the host application
execution until all the device operations like kernel execution are finished
And if you look at the programmes we execute so far, you can see the
synchronisation in almost every program
On the other hand, syncthreads() function provides the synchronization with in a
block
And this function should call from only device code
And it will force threads with a block to wait until all the threads in that
particular block to reach that particular point in code
And the global and shared memory access made by all the threads in the block
prior to the synchronization point or syncthread() function call will be visible
to other threads in the block after the synchronization point
And remember this synchronization happened between threads with in a block
For example, if we launch a kernel with thread block having 128 thread so
there will be 4 warps executing the instruction in this kernel
In this diagram, each of those warps represented by an arrow
As we discussed earlier, CUDA follows SIMT execution paradigm
With in a wrap, all the 32 threads execute same instruction
But between warps there are no such guarantees
So let's say warp 2 starts execution first
This execution will be independent from other warps
But if our kernel has syncthreads() function in the code, then warp 2 has to
wait in that statement until all other warps reach that point
So all the threads in the block will reach that synchronize point and then only
any warps for this block can execute further instructions
Ok let's see a usage of syncthreads() function with parallel reduction
algorithm implementation
General problem of performing commulative and associative operations across a
vector is known as the reduction problem
In our case we are going to implement accumulation of array of integers as
parallel reduction algorithm
Sequential implementation of this algorithm is quite easy
You just have to iterate through the array while adding each element to a global
variable
Even though this is straightforward operation, if we have thousands of mega
bytes of array, then parallel implementation would outperform the sequential
implementation
But remember, usually when we are converting this in to parallel code, it will
be much complex than straight forward sequential implementation
Addition is commutative operation
So we can sum the array in any order
So here is what we are going to do
We are going to first partition the input vector in to smaller data chunks and
each data chunk will be sums up separately and finally, we will add these
partial sums to get total sum
This diagram demonstrate steps we are going to follow exactly
In the implementation we will use thread block size as our data chunk size
So each thread is responsible for adding up given data chunk
After thread block finish its accumulation The result will be stored in to
separate array
Each slot in this new array will populate with partial sum of each thread block
So the size of this new array, would be the number of thread blocks in our grid,
if we use 1D grid
Which means the value of the grid size in X dimension is our this new array size
After we calculate partial sum of all the thread blocks then we can transfer
this partial sum list to host side to do final summation, or if this partial sum
array is big enough, then we can again perform reduction in GPU for that array as
well
Now let's see how to sum up particular data chunk
There are multiple approaches to sum up a data chuck
We will look at few of those in upcoming videos
But here we are going to look at the approach called neighboring pairs
accumulation
In this neighboring pairs approach we are going to calculate sum of a block
in iterative manner, and in each iteration selected elements are paired with their
neighbor from given offset
for the first iteration We are going to set offset value as 1 and in each
iteration this offset value will be multiplied by two
And number of threads which are going to do any effective work will be divided
by this offset value in each iteration
Let's say, we select our data chunk size as 8 elements
So we need thread block with 8 threads as well
In the first iteration we set offset to 1
And in this iteration only T0, T2, T4, T6 threads are going to perform the
summation
T0 thread is going to sum up elements in 0th index in the input array with
element 1 offset away from 0th index which is the element in the first index and
store the results back to 0th index in original array
T2 is going to sum up next two elements and store the results back to index 2 in
the array
T4 is going to sum up next two elements and store the results back to index 4 in
the same array
And T6 is going to sum up final two elements in the data chunk and store the
results back to index 6 in the input array
So, after the first iteration, our data chunk would look like this
Notice only half of the threads in the thread block did any effective work in
the first iteration and also, we store the summation again to the memory that we
are load from
This kind of accumulation referred to as in-place reduction where we store
the memory to the same location we load the memory from
Input to the second iteration would be the result calculated from the first
iteration
So first element in the array now contains the summation of first and second
element in the original array
3rd element in the array for the second iteration have the summation of 3rd and
4th element in the original array and so on
In the second iteration we are going to sum elements which are in twice the
distance than in first iteration
So the offset value will be 2
And only 2 threads are going to perform any effective work in this iteration
T0 and T4 threads
T0 will sum up 1st and 3rd elements in the array and store it back to 1st
element again
Notice after this calculation first element value is the summation of the first
four element in the original array
In the original array we have 3, 5, 1, 2 as first four elements
which is sum up to 11, and that's what we got as the first element after
second iteration
And T4 will sum up 5th and 7th element in the array and store it back to the 5th
element
And this value correspond to the summation of second 4 elements in the original
array
In the next iteration our offset value will be twice as previous iteration
So we have 4 as our offset size and only one thread, T0 is going to perform the
summation
Now input to this iterations is the output from previous iteration, so when
T0 sum up the 1st and 5th element in the input array, output array is the summation
of all element in the original data chunk
So after 3rd iteration we have the summation of all 8 elements in the data chunk
as the first element in the data chunk
Then we will store this value in separate array which contains the partial sum
like this for all the data chunks we made
Code segment for summing up data chunk in neighbored pairs approach is shown in
this slide
As I mentioned earlier, we are going to start the iteration with offset as
one and then we are going to multiply the offset value in each iteration
And with in each iteration we have to limit which thread is going to perform the
summation
For example, in the first iteration, T0, T2, T4, T6 threads did the summation
And in the second iterations only T0 and T4 threads perform the summation
So to accommodate this kind of limitation, we have to check the thread id before
performing the summation
This condition check guarantee that only necessary threads in each iteration are
going to execute the summation
For example, in the 1st iteration, our offset value will be one, so the
threads with even number thread ids will only be able to execute the summation
and in the second iteration offset value is 2, so the threads with thread id
which is a multiplication of 4 will be able to execute the summation and so on
If you notice in this implementation we have to make sure that all the threads in
the block should finish execution of one iterations before any of threads move
on to the next iteration
For example, if we have 128 size thread block, then we will have 4 warps and all
4 warps have to finish 1st iteration before moving on to the next iteration
But warps can execute in any order
So when we have thread block with multiple warps we don't have any way of
guarantee this restriction
This is where syncthreads() function comes to the action
After each iteration, we are going to use synctrhead() function
So every thread in the thread block have to wait until all the threads in
that particular thread block finished this iteration
Ok let's see the implementation now
As a practice, we are going to check the validity of calculated results from the
GPU implementation by comparing it with value got from this CPU implementation
So here we need one function to perform the reduction in CPU and one function to
compare the results of CPU and GPU implementations
And as I told you, neighbored pair is only one way of implementing parallel
reduction, and in upcoming videos we are going to discuss more ways of
implementing parallel reduction
So if I put this function in a common place then in the upcoming implementations
we can access these functions again and again
So here I have a common header which include such common definitions
So let's add these functions to the common header
First we have to declare reduction_cpu function in common header
Then we can define the function in common.cpp file
Here, all you have to do, is iterate through array while adding each element to
the global variable in this way
I Compare result function, will perform comparing of given two integers for
equality
So let's declare this function in common header file as well
then we can define it in common.cpp file
We can first print out the both values is being compared, then if the values
are equal we can print out the values equal to the console and if the values are
different we can print out value are different to the console
Now to our reduction file
Let's name our kernel as reduction neighbored pairs and it will take 3 arguments
Input array and the array to store partial sums and the size of the input array
We will get back to kernel implementation shortly, but now let's ,move on to our
main function
Now in most of our implementations our main function will look similar
So here, I am going to completely explain to you what we are going to do in main
function, but in upcoming videos we will not focus on main function, so we can
simply focus on our kernel implementations
In the main function, first, we are going to define our size variable and
byte size variable which is the actual number of bytes taken by our input array,
which in this case 128 mega bytes of data, and our block size will be 128
threads
And then we need to declare a pointer to hold input array and host side pointer
to partial sum array and allocate memory for those pointers
Note that here we are going to initialize our input array randomly with value 0
to 10
For that I have use initialized function defined in the common header
And then we can calculate CPU results and store it to the variable called CPU
results
Then we need our kernel launch parameters block and grid variables
So let me calculate those values as well
And then usually I'm printing out the grid and block size parameters, so in
the output we can clearly identify our grid configuration
If you notice, I have not yet allocate memory for partial sum array
Size of the partial sum array is equal to number of blocks in our grid and only
now we know that
So here let me take the number of bytes needed to hold partial sum array and
then we can allocate memory for that
Now we need device pointers
So let me declare those pointers here
And then we can allocate memory for those pointer using cudaMalloc() function
with particular pointer and size of memory we need
Next we have to set initial value to partial sum array to 0

So here we can use cudaMemset() function for that


And then we need to transfer the input array to device from host
So here I am going to use cudaMemCpy() function with necessary pointers and our
memory copy direction is host to device in this statement
Then we can launch the kernel with given arguments and after that we have to
wait until our kernel to finish, so here, we can use the cudaDeviceSynchronize()
function we discussed in the presentation
And then we can transfer the partial sum array back to host
Now to calculate the final results from the partial sum array, we have to
iterate through partial sum array and add it to global variable
So here , I am going to iterate through all the elements and perform that
summation
And then we can check the validity of results by passing both CPU calculated
results and GPU calculated results for our previously defined compared results
function
And finally we can clean up the memory we alocated in host and device by calling
cudaFree and free functions
Ok that's it for the main function, And in the upcoming implementation our
main function will look most likely these, so I'm not going to explain there
Now let's look at our kernel implementation
In the kernel we need two variables tid and gid to hold thread id in block and
global id of a thread in a grid so let me defined those and assign those values
And then we need to perform boundary check to avoid invalid memory access from
the threads
And then we have for loop I shown in the pseudo code
Here we have set initial offset value to 1 and then we are going to multiply it
by 2 in each iteration until the offset is less than the block size in X
dimension
And then we have a if condition to filter the threads which can perform
summation for given iteration
And after we perform summation, we need synchthread() function call
So all the threads in a block have to finish one iteration before any of threads
in that particular thread block execute the next iteration
And after all the iteration ends in every data block first elements have the
summation of that block
So we can assign that value to our partial sum array
Now the index of this partial sum array is equal to the block id for particular
block
So the zeroth thread in each block will each block will store the sum of that
block in to the partial sum array in this way
That's it for our kernel implementation
Now let's run this program
Ok, in the output it printed out that results are same
Which means both CPU and GPU calculation yeild the same results
which means our implementation of reduction algorithm with CUDA is correct
Ok, now you are aware of how to use syncthreads() function to synchronize the
threads with in a block
But you have to be careful when you are using syncthread function
Specially if your code contains condition checks
We perform syncthreads() function call to make sure that all the threads in a
block reach one point before any of the thread in particular block moves forward
But if we are using syncthreads with in a condition check then that condition
check might not allow some of the threads in that particular thread block to
reach the syncthread() statement based on the condition
This will result in paradox and the outcome will be undefined value
So be mindful when you are using syncthreads() function specially in the
presence of condition checks

***************************
Parallel reduction as warp divergence example
***************************

As I already explained to you, if the threads with in a warp executes in a


different path then there is a warp divergence in the kernel

And our device will arrange all these paths together and execute the code
sequentially which will result in heavy performance penalties
Unfortunately our previous parallel reduction implementation also has warp
divergence
For example, in this diagram you can see 128 threads belong to a single thread
block and each slot in this table belongs to a one thread and I have wrote down
the thread ids for each slot as well
In our previous implementation, in the first iteration only threads with even
number as thread id perform any effective work
So only 50% of threads perform any work And these threads spread across all
four warps of this thread block as well
In the second iteration only threads with thread id which is multiplier of value
4 perform any summation
So only quarter of threads perform the summation and those threads also spread
across all the warps as well
So in each iteration other than last couple of ones, we have warp divergence in
every warp in the thread block
In final two iterations, only couple of warps will show warps divergence
since the threads perform sum is spread across only couple of warps
So our previous reduction implementation has lots of divergence code
Luckily there are more than one way of avoiding warp divergence in this type of
kernels
Here, we are going to implements reduction algorithm in two new ways to avoid
warp divergence
As the first solution we will make neighboring threads to perform effective work
in each iteration
In the second method we are going to use interleaved pair approach as a solution
for divergence in reduction algorithm
Let's look at each of these approaches now
In the first approach we are going to make sure that all these neighboring
thread are the ones performing this summation
For example in previous neighboring pair approach when we perform algorithm on 8
element block, in the first iteration only threads with even number as thread id
perform the summation
Threads with odd value as thread id did not used to add the values
However in our new approach, we make sure that neighboring threads perform the
summation
So in the corresponding first iteration first four consecutive threads will
perform the summation
Ok let's see an example of this approach
Let's say we have a data block with 8 elements
And in the first iteration, first four threads going to perform the summation
So T0 thread will add first two elements, and store the results back to the
first index, and T1 thread will add next 2 elements and store the results back
to the second index and so on
Then when it come to next iteration still consecutive threads will be the one's
execute the summation
In this case T0 and T1
In our previous implementation in the second iteration T0 and T4 where the ones
execute the summation in second iteration
And in the third iteration only T0 will be the one execute the summation
If we use 128 as data block size so our thread block size will also be 128
So in the first iteration first 64 threads or first two warps will perform the
summation and second two warps will not do anything
But still there is no warp divergence since with in a warp all the threads follow
the same path
In the next iteration Only first 32 threads, or first warp will perform
the summation and all other three warps will not do anything
In the 3rd iteration however, only first 16 threads or half of the first warp
will be performing the summation so from this iteration onwards, there will be
warp divergence
But notice, here we have one warp with divergence per iteration from this
iteration onwards
But if you remember, in our initial neighbored pair approach when we consider
128 size data block all four warps had warp divergence until last two iteration
Ok let's see the implementation now
The main function for this implementation is almost similar to what we have in
the previous neighbored pairs implementation
So I will not go through that
So let's look at our kernel now
In the kernel
Let's set local thread id value and then calculate global thread Id value first
Then we have to set local memory pointer with given offset for the corresponding
thread block, so that we can access global memory using this local pointer for
this thread block
Then we can have our boundary check now

And then we are going to perform usual iteration while multiplying offset by
two in each iteration
Now we need consecutive threads to perform summation
For that we calculate index value for each thread in the block based on the
thread id and offset value
We can use this condition check to limit the threads which are going to perform
the summation
And we need all the threads in the block to finish executing one iteration
before any of threads in that block move on to the next one so here, we are
going to have syncthreads() function call as well
After all the iteration ends first element in the thread block will have the
partial sum for this thread block so we have to store it to the our partial sum
array
Now let's run this implementation and check the validity
Ok, in the output you can see that it printed out GPU and CPU results are
same
So our implementation is a valid one
Ok, Let's move on to the next way of solving warp divergence using interleaved
pair approach
In this approach also we are forcing the summation of elements to happen in
consecutive threads
And we are going to start the offset of the elements which are going to added
together in a iteration from reverse order compared to the previous approaches
For example, in the first iteration we will set our offset value to half of the
block size
So if we consider data block with 8 elements, offset value will be 4, hence in
the first iteration first thread will accumulate first element and fifth element
in the data block
Second thread will accumulate second and sixth elements in the data block and
so on
In the second iteration we will divide offset by half
So for second iteration, offset value will be 2 hence first thread will
accumulate first and third element, and second thread will accumulate second and
fourth element in the current data block
Notice here, we are performing in-place reduction so output of one iteration
will be the input to the next iteration so first element for the second
iteration contains the summation of first and fifth element in the original
array, and third element for second iteration have third and seventh element in
the original array
and after second iteration first element in the array now have summation
of first third fifth and seventh element in the original array
In the last iteration our offset will be one and when the offset reached one we
stop iterating and first thread will accumulate first and second element in the
current data block
Now after this step, first element in our array now contains the summation of
all 8 elements in our array
Ok, let's see the implementation now
This is almost similar to what we done in previous implementation
The difference is that we initialized offset value to half of the block size and
then we keep dividing it by 2 in each iteration
And in the condition check, since we need consecutive threads to execute the
accumulation steps we check whether the thread id is less than the offset
So in the first iteration only first half of the thread block will perform the
summation
In the next iteration only first quarter of thread in the block perform the
summation and so on
And here also we need all the threads in one thread block to execute one
iteration before moving in to next iteration henceforth I have used syncthread()
function here as well
Then we have to store the summation to partial sum array as well
Now if you run this example, you will see that GPU and CPU result are same So
this implementation also valid one.

***************************
Parallel reduction with loop unrolling
***************************

In this video we will look at important concept call unrolling in CUDA programming
Unrolling in CUDA means verity of things
However goal of any kind of unrolling is to improve the performance by reducing
instruction overhead and creating more independent instructions to schedule
Let's first look at simple unrolling concept called loop unrolling
In loop unrolling rather than writing the body of a loop once and using a loop
to execute repeatedly, the body is written in the code multiple times
By doing so we can reduce or entirely remove the enclosing loops iterations
and by reducing or remove iterations of an enclosing loop, we reduce loop
maintenance instruction like condition checks in the loop and number of
dependent instructions as one iteration have to perform after another
we call number of copies we made in the loop body as loop unrolling factor
For example, here we have loop we used to accumulate an array in CPU
If the array have 100 elements, then we need to iterate 100 times with this
implementation
However if we replace this loop from the following in loop then with the loop
body we accumulate 2 elements ith element and I+1 th element, so the loop will
iterate for half of the original size, which in this case 50
And remember, this kind of loop unrolling is most effective at improving
performance for sequential array processing loops, where the number of
iterations are known prior to execution of the loop
Unrolling we are going to apply here is somewhat different than our loop
unrolling example in this sequence code
You can refer unrolling we are going to perform here as thread block unrolling
But the main purpose, reduction of instruction overhead remain same
In the all previous implementations, we create threads equal to number of
elements in the arrays
But if we can reduce the number of thread blocks we need to perform our parallel
reduction
Now here is what we are going to do
Before the partial_sum loop in the interleaved pairs kernel we are going to
manually sum up two data blocks to one block
So the amount of thread need to reduction will be reduce by half
With this simple manual addition, we can reduce device workload almost by half
So after we made this change first and second data blocks will be sum up
together by first thread block, and next data blocks will be sum up together by
second thread block and so on
Ok, let's implement this approach now
Keep in mind all these reduction implementations done by assuming that we are
going to launch one dimensional grid
Let me name our kernel as reduction_unrolling_blocks2
And it has same argument list as our previous kernels
To access each single element in each data block I need local tid value here
So let me assign threadIdx.x to a variable called tid
Now this step is optional
As you can access this directly using threadIdx.x variable
But I like to use "tid" variable
So that's about it
Since we are unrolling two consecutive data blocks using one thread block each
of our thread blocks have to access data block twice the distance as the thread
block id
So to calculate element index for each thread based on the data block location
we need data block offset based on the thread block id
So let me first calculate data block offset here
Now we can simply add tid value to block offset to get global data element index
And then let me calculate local data pointer, so that our threads can access
elements for this thread block using that pointer as well
Now we can unroll two data blocks
So here if the index is with in the size of our array, we can sum up two
elements from two consecutive data blocks

Now before doing anything else, we have to make sure that all of the threads in
our thread block finished thread block unrolling part
We can achieve that by using syncthread() function call here
Then we can perform partial accumulations step for our unrolled data block in
iterative manner
So here I'm going to use interleaved pair approach to some up data blocks, So
let me copy that cord segment from our previous interleaved pair implementation
And that's it for our kernel
The difference here is that we how manually unrolled two data blocks at the
beginning of our code
Now what if we want to unroll 4 data blocks instead of 2
Actually the implementation is almost similar to this one
All you have to do is change in the block offset value here to calculate offset
for four blocks, so here instead of multiplying by 2 we will multiply it by 4
So let me quickly add the kernel for that unrolling as well
Let me copy the unrolled two blocks kernel and paste it here, and change the
name to reduction unrolling blocks 4
Then we can change the block offset variable in this way
Then in unrolling statement, we have to perform 4 unrollings
So here, our boundary check will be index + another 3 blocks, because we are
going to unroll 4 blocks, so we have to check those indices are with in the size
of our array
And the we can assign values from each consecutive data blocks for variables and
finally add those together
And that's it for 4 blocks unrolling kernel
If you want to unroll 8 thread blocks, all you have to do is to change the
block offset variable and preform unrolling for 8 elements
Then you will have the 8 block unrolled kernel
Now when you launch these kernels in main function, make sure that you launch
the kernel with correct amount of thread blocks
For example, for 2 block unrolling kernel, we should have half of thread blocks
than the previous implementations
So you have to divide the grid size in X dimension by two
And for 4 block unrolled kernel, you should have quarter of the threads than
usual grid and so on.

***************************
Parallel reduction with warp unrolling
***************************

***************************
Parallel reduction with complete unrolling
***************************
***************************
Performance comparision of reduction kernels
***************************

***************************
CUDA Dynamic Parallelism
***************************

***************************
Reduction with dynamic parallelism
***************************

***************************
Sumary
***************************

You might also like