Section 2 TR
Section 2 TR
In the previous section we learn about supercomputing parallel computing and the
relationship of those with CUDA. In the later part of that section we learnt
basic steps of CUDA program and how to launch a kernel. Then we move on to
Implement simple kernel to sum up two arrays, and if you remember we did launch
our sum_kernel using different launch configuration. Then we observe the change
in performance or execution time when launching kernels with different launch
configurations and we found out the best configuration to launch that particular
kernel using trial and error method. But if we had better understanding on
how our device or GPU looked like and what parts of the device effect the
execution of certain type of kernels then we would How come to logical conclusion
beforehand on how the performance of kernels varies with different launch
configurations. In this section we are going to learn just that. We are going to
unviel. what hide behind the fancy cover of NVIDIA gpus and we are going to
learn how to use certain details on specific GPUs to get maximum performance out
of our kernels. Here you can see NVIDA Geforce GTX 970 and Titan-V GPUs. Your
GPU unit would also look more or less like this. But as CUDA developers we
are not interested on this fancy look. If we remove the fancy cover you can see
inside hardware of this GPU unit. But we are not interested in that look
either. We are more interested in architectural design of these GPUs which will
look more or less like this, depending on what microarchitecture you are looking
at. This diagram shows you one layout of microarchitecture called Maxwell which
GTX 900 family product are belong to. The GPU architecture is built around a
scalable array of streaming multiprocessor or SMs and GPU hardware parallelism
is achieved through by replication of this architectural building block. In this
diagram, stream multiprocessor is boxed with red colored line. We can count 16
such streaming multi processors in this GPU. those streaming multiprocessors have
access to device inbuilt global memory, and L2 cache and other than that, there
are some memory controls and cache in this architectural diagram as well. Ok,
Let's now take a closer look at streaming multiprocessor. The key components of
streaming multiprocessor are listed in this diagram. Notice this SM logical
diagram is for SM in Fermi microarchitecture. Modern SM units are somewhat
different than this, but we can clearly identify basic components from this
diagram. We have CUDA cores, shared memory and L1 cache, registers, load and
store units warp schedulers and special function units. A CUDA core is like CPU
core and it will execute instructions fetch into that core. There is shared
memory and L1 cache to facilitate faster memory access. Registers are limited
amount of on-chip memory to store on the fly values for execution. Load-Store
units are there to perform memory load and store requests. Warp schedulers is
going to determine which warp or which set of threads are going to execute next.
Notice these components are on-chip components. So we have to specially careful
how we used resources like registers shared memory and L1 cache, as those are
very limited resources on SM. And warp schedulers will take resource consumption
factor when it deciding which warp is going to execute next. If you look at more
modern SM, you may realize it is bit more complicated than this. Here this diagram
shows you SM for Volta microarchitecture. All these green squares are cores. But
these SM have different types of cores like tensor cores integer operation
execution cores, FP32 cores which optimized to execute 32 bit floating point
operations faster and FP64 cores which build to execute 64 bit floating point
operations much faster. We will look at each of these changes in upcoming
videos. But for now we can focus on simple set of components we saw in the
previous slide. Computer architectures can be classified into four categories
according to Flynns taxonomy. SISD stands for single instructions single data
and it represent sequential computer which exploits no parallelism in either the
instruction or data streams. Here single control unit will fetch a single
instruction stream from memory. The control unit then generates appropriate
controls signal to direct signal processing element to operate on single data
stream that is on operation at a time. SIMD stands for single instruction
multiple data. This model represents the organization of single computer
containing control unit, processor unit and memory unit. Instructions are going
to execute sequentially. This can be achieved by pipelining or multiple function
units. MISD stands for multiple instruction single data. Here multiple
instruction operate on one data stream. This is an uncommon architecture
which is generally used for fault tolerance. Here heterogeneous systems will
operate on same data stream and must agree on the results. Examples of this
model includes fault tolerant applications like the space shuttle flight control
computers. MIMD stand for multiple instruction multiple data. and this
represent multiple autonomous processors simultaneously executing different
instruction on different data. MIMD architectures includes multicore super
sacaler processors and distributed systems using either one one shared memory
space or distributed memory space. CUDA follows different version of computer
architectures than these ones called SIMT. SIMT is an extension of SIMD model
with multithreading. Which means that single instruction is going to run on
multiple threads. In CUDA, thread block is going to execute in single streaming
multiprocessor. Multiple thread blocks can be execute simultaneously on same
streaming multiprocessor depending on the resource limitation of that SM. But
one thread block cannot be executing in multiple SMs. If the device cannot run
single thread block in one SM, for example if the needed shared memory count
for particular block is grater than the available shared memory in SM, then
error will return for that kernel launch. When a block is schedule to execute on
SM, it is execute in a way that single instruction run by multiple threads. We
will discuss this more in upcoming videos. In the previous section, we arrange
set of threads to set of blocks on a grid. At least that's what we able to
observe from software point of view. On hardware level we can map these
execution as shown in this diagram. The grid for kernel launch is going to map
to device. Thread blocks in this grid can be divide across SM in the device. A
single thread in a thread block is going to execute by one single core. And
since we have multiple cores in single SM same instruction for multiple threads
can execute in SM, to adhere the SIMT convention. When programming in CUDA you
have to be vigilant about what kind of devices that you are going to run your
application as different CUDA microarchitectures contain different and enhanced
capabilities. Throughout this course you will learn about these capabilities and
how to utilized those to get maximum performance output.
***************************
All About Warps
***************************
When launching a kernel it seems that all the threads in the kernel runs in
parallel. From logical point of view, this is true. But from the hardware
point of view, not all the threads physical execute in parallel at the same
time. For example, let say we have a grid with 1 million threads with each
thread block having 512 threads. Now from the software point of view we assume
that all of one million threads in the grid executes paralell. But if we ran
this kernel on a device with 13 SMs each having 128 cores so altogether we
have only 13 *128 which is 1664 cores. So this device can only execute 1664
threads parallel. So there is no way of running 1 million threads parallel in
this device. Also, one block is going to run in single SM. and here we have
only 128 threads per SM. But since we have 512 threads per thread block, there
is now way of parallel execution of all the threads belong to even one thread
block as well. So, what here is that, in the device thread blocks are divided in
to smaller units called warps each having 32 consecutive threads. And all the
warps for given thread block is going to execute in a single SM. For example in
the scenario we considered before we had 512 threads in a block.l these threads are
going divided in to 16 warps. And since we have 128 cores per SMs here, so
only 4 warps can execute at a given time. Now there is no particular order
which warp is going to execute next. That is depend on the facts like rediness
of memory for threads in those warps. So here we can define warps as the basic
unit of execution in a SM. And when we launch a gird, the thread blocks are
going to distributed among SMs. Once a thread block is schedule to a SM,
threads in that thread block are further partitioned in to warps. A warp consist
of 32 consecutive threads and all threads in a warp are executed in SIMT
fashion. That is all the threads execute the same instruction and each thread
carries out that operation on its own private data. A thread block can be
configured 1, 2 or 3D however from the hardware point of view, all threads are
arrange to a single dimension. Here, each thread is going to have unique id.
For example, if we have 1D block with 128 threads in X dimension those threads
will be divided into four warps. Warp 0 will have threads with threadIdx.x value
0 to 31 and warp one will have threads with threadIdx.x value to 32 to 63 Warp 2
will have threads with threadIdx.x value to 64 to 95, warp 3 will have
threads with threadIdx.x value 96 to 127. If you have 2D thread block with 64
threads in X dimension and 2 threads in Y dimension still from the hardware
point of view threads will be divided in to 4 warps in a single dimension. This
slide shows you a 2D thread block with 40 threads in X dimension. And 2 threads
in Y dimension. From the application perspective there are 80 threads laid out
in two dimensional grid. However from the hardware point of view we have 80
threads in x dimension. So one can suggest that hardware allocate 3 warps for
this thread block result in total of 96 hardware threads to support 80 software
threads in this way. If so only 16 threads will be in active state in final
warp. But this is not true. Warp exists within a thread block. If we do it
in this way, second warp is going to have threads from 2 threads block. So here
if we consider in this way, we are suggesting that threads belong to two thread
blocks residing in one warp which is an inaccurate statement. So what
really happen here is that, to handle 80 software threads in this case CUDA runtime
will allocate 4 warps or 128 threads as shown in this diagram. In this
case only 8 threads in both second and fourth warps are going to be in active
state. All other warps for these warps are going to be in inactive state when
our device execute this thread block. Warp size is 32 and even to run a
thread block with single thread still CUDA runtime will assign single warp which
means 32 threads. In this case only one thread will be in active state and all
other 31 threads will be in inactive state. The problem here is that, even
though those are inactive threads which means those threads are not going to
execute instructions, still resource allocations like shared memory for block will
be done considering number of warps. So having inactive threads in a warp
will be a great waste of resources in SM. Therefore we tend to have value which
are multiplication of 32 as block size in X dimension, so that we can guarantee
that all the threads in a warp are going to be in active state. OK let's do a
simple exercise to show you above mentioned facts. In this example, We are going
to launch a kernel with two dimensional grid and each is going to printout
threadIdx, . blockIdx and warp index values. There are no any built in
variables to indicate the warp index. But we can easily calculate this value by
dividing threadIdx value by warp size which is 32. So this value represent index
of a warp in a block. So for single thread block we are going to have different
warp value for each 32 consecutive thread. But for different thread blocks there
will be warps with same warp index. Here our grid has 2 blocks in X dimension
and 2 blocks in Y dimension. And each block is going to have 42 threads in X
dimension. So all together we are going to launch this kernel with 168 threads.
Now 42 is a far from ideal value we should have as a block size. But here I am
using this value for demonstration purpose. Now let's look at how our warp
allocation may look like. Again remember, warp cannot have threads from 2 thread
blocks. So to handle 168 software threads in this case, we are going to have 256
hardware threads or 8 warps. And here we have 4 warps with only 10 active
threads in each. Other threads in these warps is going to be in inactive state.
Ok, let's printout above mention value for this grid. Let's name our kernel
as print details of a warp. Frist we need to calculate global thread index here.
But remember here we are going to have two dimensional grid. So, we have to
consider Y dimension in our calculation as well. So let me calculate this
first. And then we need warp index. And we can calculate it by dividing
threadIdx.x value by 32. Since we are having 2D grid here, let me calculate
unique block index to identify each block separately in the output. We can do
this by multiplying block index in Y dimension by number of blocks in X dimension
and then add blockId in X dimension. Then let me printout threadIdx.x,
blockIdx.x, blockIdx.y, gid and warpId values. Now in our main function we
are going to launch our kernel. So let me set block and gird variables here. Our
block size will be 42 in X dimension. And our grid is two dimensional one with
two thread blocks in each dimension. With these parameters you can launch our
kernel now. Ok, Let me run this example now. Now if you carefully
investigate the output you will see that second warp for each thread block have
only 10 outputs as we discussed in the presentation. But first warp in each
thread block have 32 outputs. For example second warp for block with global
blockId 3 have only 10 outputs as I highlighted before. So, Run this example in
your machine and realize the fact I mentioned in the presentation with it's
outputs.
***************************
Warp Divergence
***************************
Let's discuss an important concept associate with warp execution called warp
divergence
As we discussed threads in a warp are going to execute same instruction
But what if we code our program in a way that if will force some threads in the
warp to execute different instructions
For example, let's say we have a grid with blocks which have 32 threads
So there will be only one warp to execute
And this grid is going to execute following condition checks in the kernel
Due to this condition checks threads with threadIdx.x less than 16 will execute
statements in if block and threads with threadIdx.x greater than 16 will execute
else block
But we know warps are execute in SIMT fashion
So every thread in this warp has to execute same instruction
So what happen here is that, if threads in a warp diverge then warp serially
execute each branch path disablingthreads that do not take that path
So warp will execute if block while disabling second half of the thread in the warp
and then it will execute instruction in else block while disabling first half of
the warp
So if we have multiple condition checks which causes warp to diverge then it
will be a significant performance penalty
Let's examine another example now
Let's say we are going to execute a kernel with code shown in this slide
Here we have condition check which forces threads with odd number as thread id to
follow different path than the threads with even number as thread id
So all the threads in the warp with even number as thread id will execute if
statement and all the threads in the warp with odd number as thread id will
execute else statement
To adhere SIMT execution convention, CUDA runtime will force each thread to
sequentially go through all the statements in both if and else block while
disabling the threads that does not take that path logically
Such a execution for just 4 thread in that particular warp is shown here
Here blue colored boxes represent coherent code which are going to execute by all
the threads in the warp
Only thread with even number as thread id which in this case thread with thread id
0 and 2 are going to execute green colored instruction or if clause
And on that time threads with thread id 1 and 3 have to stall its execution
And when threads with thread id 1 and 3 execute else block threads with threads
with thread id 0 and 2 have to stall the execution
In this example, the amount of parallelism in the warp execution was cut by half
Only 16 threads were actively executing at a time while other 16 were disable
With more conditional branches the loss of parallelism would be even greater
So be aware in the presence of control flow statements like if and else statements
switch statements which gives us a hint of divergent code
But remember presence of a control flow statement is not always result in
divergent warp
Warp is diverge when there are multiple path of execution with in the same warp
So condition checks which result in all the threads in a warp to execute in same
path, will not induce any warp divergence
Consider this example
Here "tid" represent threadIdx.x value for single dimensional block
Here if we take 64 as our block size CUDA runtime will allocate 2 warps to
execute this thread block
First 32 thread in our thread block will have tid from 0 to 31
So tid/32 will be 0 so first 32 threads or first warp for our thread block will
execute the if block and none of the threads first warp is going to execute the
else block
And for second 32 threads tid values will 32 to 63, So for these threads tid / 32
will be 1 So all the threads in the second warp are going to excute else block
So here first warp is going to execute if block and second warp is going to
execute else block
So even though we have condition check here, there is now warp divergence
But if we change the condition check as shown here, then first half of first warp
will execute if statement and second half of first warp will execute else
statement
And this is true for second warp as well
So if we use condition in this way, there is warp divergence and performance will
be degraded
So be vigilant when you are using condition checks in a kernel
And always try to find a proper condition check which does not induce any warp
divergence
You can measure whether your kernel have divergent code or not by using
performance metric with nvprof profiling tool
Performance metrics are measurements of different properties of your kernel
And there are metrics called branch efficiency which indicate presence of a
divergent code in a kernel
Let's first look at what this branch efficiency metric is
Branch is a path of execution for the threads in a warp
For example, in a code if we have condition which will force threads with odd
values as thread id to execute if block and threads even value as thread id to
execute else block Then there are two branches here
If branch and else branch
if we have a code which forces threads with in same warp to execute in 3 different
path, then we have 3 branches
And keep in mind here threads with in single warp should follow different path
Otherwise there is no branch divergence as already saw
Branch efficiency is defined as ration of non-divergent branch to total branches
And it can be calculated using following equation
For example, if we consider this condition check, so half of the warp is going to
execute if block and other half is going to execute else block, so we have two
branches and we need to consider one of them as main execution path so we have one
divergent branch so the branch efficiency will be (2 - 1)/ 2 *100% which is 50
percent
Branch efficiency can be measure for a kernel using nvprof profiling tool
Let's see how this is done
Here, I am going to launch two kernels
Both these kernel will perform similar functionality
But one of them do it using divergent code
So lets add our kernels here
Our first kernel is not going to have any divergent code
So let's name it as code without divergence
Here, I am going to need global thread id value, so let me calculate it first
Then all I'm going to do is set values to a couple of float variables based on
whether the warp id of that thread is even value or odd value
So here let me define two new floats variables
Now we need warp id to perform our task
So let's quickly add variable called global warp id and calculate it's value by
dividing global thread id by warp size
Now this warp id calculation is very subjective one
There are no this type of usage of global warp id value in any other cases
But here for simplicity I am just using it
Ok, now if warp id is even value I am going to assign 100 to A variable and 50 to
B variable
Otherwise let's assign 200 to A variable and 75 to B variable
That's it for this kernel
Now our second kernel is almost similar to this one so let me copy and paste it
here and change its name to divergent code
Ok here I'm going to assign values to A and B variables based on whether the
thread id is even or not
I remember in the previous kernel we assign values based on warp id not thread id
But in this case we are going to assign values based on threadId is even value or
not
So we don't need warp id here
So let me get rid of that code and change the condition check in "if" statement in
this way
As we discussed in the presentation, this condition check will force consecutive
threads to take different path of execution
In our main function, I have already defined blocks and grid variable so we can
launch our kernels here
***************************
Resource Paritioning and Latency Hiding 1
***************************
In the previous videos I told you that blocks are scheduled in the SM depending on
the resource limitations of that SM
In this video we are going to look at those resource limitations and how exactly
this scheduling works
The local execution context of a warp mainly consists of following resources
program counters, registers and shared memory
The execution context of each warp processed by a SM is maintain on-chip during
the entire life time of a warp
Therefore switching from one execution context to another has no cost
Now this is expected as GPU are designed to execute 1000s of threads and if the
context switching took more time then the overall execution will be going to be
delayed
However this zero cost context switching is achieved by making GPU threads more
lite weight and limiting its capabilities
Out of above mention resources number of registers and shared memory can be
directly controlled by the programmers
Each SM has set of 32 bit registers stored in a register file that are
partitioned among threads and and fixed amount of shared memory that is
partitioned amount thread blocks
The number of thread blocks warps that can simultaneously reside on a SM for a
given kernel depend on the number of registers and amount of shared memory
available on SM and required by the kernel
For example as shown in this diagram, if each thread for given kernel consumes
more registers fewer warps can be reside on a SM
If you can reduce the number of registers a kernel consumes more warps will be
processed simultaneously
Likewise if the thread block consumes more shared memory fewer thread blocks are
processed simultaneously by a SM
If you can reduce the amount of shared memory used by each block then more
thread blocks can be processed simultaneously
What this means is that resource availability generally limit the number of
resident thread blocks per SM
The maximum number of registers and amount of shared memory per SM, are vary
for device of different compute capabilities
If there are insufficient registers or shared memory on each SM to process at
least one thread block, then kernel launch will fail
This table lists some of the technical limitations like maximum number of
concurrent blocks can reside in SM maximum concurrent warps per SM, number of 32
bit registers per SM maximum number of 32 registers can be used by single thread
and shared memory for SM for several GPU micro architectures
Some of these values can be retrieve with CUDA get device properties function
and some of them are listed in particular microarchitectures documentation
For example, in a device with compute capability 3 or higher we have 64000 32
bit registers per SM
And these registers should be shared across all the thread blocks run in that
particular SM
If you look further you can see that one thread can only use 255 registers only
If we use more than that in our program those excessive registers going to spill
over to memory called local memory and that will reduce the performance of the
program greatly
So depending on device compute capability, amount of resources we can consume is
going to change
Also we can optimize the performance of a kernel depending on the device that is
going to execute on
Let's now look at categorization of blocks and warps
A thread block is called an active block when compute resources such as
registers and shared memory have been allocated to it
Warps in such an active block is called active warps
Active warps can be further classified in to following three types
Selected warps, stalled warps and eligible warp
The warp scheduler on a SM select active warps on each execution cycle and
dispatch them to execution unit
A warp that is actively executing is called selected warp
If an active warp is ready for execution but currently executing it is an
eligible warp
If warp is not ready for execution it is a stalled warp
A warp is eligible for execution if both of the following condition is met
32 CUDA cores should be available for the execution, and all arguments to the
current instructions should be ready
The number of active warps at any time from launch to completion on
microarchitectures with compute capability 3 or above must be less than or equal
to the architectural limit of 64 concurrent warps
If warp stalls, the warp scheduler picks up an eligible warp to execute in its
place
Because compute resource are partition among warps and kept on chip during
entire life time of a warp, switching warp context will not cost any execution
cycle.
***************************
Resource Paritioning and Latency Hiding 2
***************************
***************************
Occupancy
***************************
In this video we are going to discuss about one of the key concepts in CUDA
programming called occupancy
Occupancy is the ratio of active warps to allowed maximum number of warps per SM
and it can be calculated using following equation
In a given CUDA core, instructions are executed sequentially
When one warp stalls because of the latency in arithmetic or memory operation,
that SM switches context to another eligible warp
So ideally if one warp stalls, we want to have enough number of warps to keep
the cores occupied
So if the occupancy of a particular kernel is a high value that means even
though one warp stalls execution there will be other eligible warp so that of
cores always will be busy
In this equation, term max number of warps in SM can be found out by either
looking at by device microarchitecture's documentation or by querying the device
with max threads per SM property and dividing it by warp size which is 32
On the other hand, active warp counts per SM depends on available resources in
device and resource consumption of your kernel
So let me show you an example calculation of this value
let's assume a kernel uses 48 register per thread and 4096 bytes of shared
memory per block
Here there are two factors which affect the number of active warp in a SM
Register usage and shared memory usage
So here is what we are going to do now
I am going to calculate allowed warp count considering these resource separately
and minimum of those two values will be the one which limits the active warps
per SM
So let's first find the allowed active warps considering the register usage of
our threads
Now here, one thread uses 48 registers
So for a given warp, register usage will be 48 multiplied by threads per warp
which is 32, which equals to 1536
And if I consider GTX 970 as the device I run this kernel, then we have total of
65536 registers per SM
So we can divide total register value per SM by by register consumption of our
single warp to find out allowed active warps So that value will 65536/1536 =
42.6
But this is not our last value
There is a concept called warp allocation granularity value
In SM, usually depending on available number of warp schedulers, warp allocation
happen in group wise
For example, if our device has warp allocation granularity value 4, this means
warp allocation happen in group of 4 warps
We might only need 1 warp to execute particular kernel
Still SM will allocate 4 warps as its granularity is 4
Now I have GTX 970 device here
And it has 4 as warp allocation granularity
So SM can allocate warps count which are multiplier of value 4 only
Therefore in this case allowed warp based on register usage will be 40
Now let's calculate allowed warp count based on the shared memory usage
GTX 970 has 98304 bytes of shared memory per SM
And these memory will be shared by all the thread blocks in the SM
Here in our kernel we used 4096 bytes of shared memory for block
So based on this values allowed thread block count will be 98304/4096 = 24
And if we consider 128 as our thread block size, which means 4 warps per
block So based on shared memory we can have 96 warps per SM
But remember for device with compute capability 5.2, we can have only 64 warps
at max per SM
So this means our active warps count per SM does not limited by shared memory
usage
So here register usage limits allowed active warps and that value is 40
So our occupancy will be 40/64 which is 63% of occupancy
Knowing how to calculate the occupancy value is very important to get the
grid on the occupancy concept
But CUDA toolkit provide tool called CUDA occupancy calculator which does this
calculation quite easily
So let me show you how to use CUDA occupancy calculator now
This tool is actually a excel sheet
This excel sheet has 4 tabs
Calculator tab which we are going to use in a minute, help tab which explain how
to use occupancy calculator GPU data tab which contain device query information
for device with different compute capabilities and of course copyright tab
In the calculator tab there are different section highlighted in the different
colors, and there are some graph which highlights the co-relation between occupancy
and varying block size, shared memory usage per block and register per thread
for given configuration
In this green section we can set details of our device
So first select compute capability of your device
When you select compute capability, below gray area, This area will be
populated with device information
There you can see the physical limits for device with selected compute
capabilities
Second information you have to put here is the shared memory size which you
can select from the drop down box or you can fill it manually by looking at the
device limitation section
As 3rd parameter you can provide cache mode for execution
For now you can leave it behind and move on to the next section
We will learn about this parameter when we are discussing about caching in CUDA
Then in the orange section we have to specify the execution configuration and
memory usage of our kernel
You can fill the threads per block based on kernel launch parameters
To find out the registers per thread and shared memory per block you can compile
your program in nvcc with ptxas options which will output the information about
registers, local memory, shared memory, and constant memory usage for each kernel
in given CUDA file
For example here I have program called occupancy test, in this occupancy test
kernel, I am using 8 integer variables X1 - X8 and those together and store the
results in to the global memory
Now let me compile this program with above given options to nvcc compiler
So go to the location where you have your CUDA file and run the command shown
in here
So the outcome will look like this
In this output, in the last line you can see the register usage of reach thread
for our kernel
And if there is shared memory usage then that will appear here as well
As you will see in upcoming videos we can allocate shared memory dynamically as
well
If so, you have to explicitly add those values to the shared memory as well
So, if I add these values to the occupancy calculator, you can see GPU occupancy
data in the blue colored section
Now change these values a bit and observe the change in outcome when we
change the device configuration input or resource usage per kernel
Also down below there is a yellow section which highlights what we obtained from
our manual calculation
Here you can see that maximum thread blocks per multiprocessor has been limited
by registers per multiprocessor and that warp value is 40 as we calculated
before
Also here you can see that calculated occupancy value which in this case 63%
for more information on how to use occupancy calculator you can refer the help
tab and it includes all the information I just present to you
But keep in mind that higher occupancy does not necessarily means higher
performance
If the kernel is not bandwidth bound or computation bound then increase in
occupancy will necessarily increase the performance
in-fact making changes to increase occupancy can have other effects such as
additional instructions more register spills to local memory which is an off-
chip memory and more divergent branches
As with any optimization you should experiment to see how changes affect the
wall clock time of the kernel execution
For bandwidth bound applications most probably increase in occupancy can help to
hide the latency of memory access
And therefore improve performance
Ok, with all these knowledge, let me present you few basic point that would
help your application to scale better in GPU
Keep the number of threads per block, a multiplier of warp size
Avoid small block size as too few threads per block lead to hardware limit on
number of warps per SM to be reached before all the recourses are fully utilized
So start at least with 128 threads per block
Adjust block size up and down according to the kernel resource requirements
Keep the number of blocks much greater than the number of SMs to expose
sufficient parallelism to your device
Conduct experiment to discover the best execution configuration and resource
usage
Ok now you know how occupancy is for performance of our kernel
But keep in mind that occupancy is not the only factor effecting the performance
and after a certain level of occupancy reached further increase may not lead to
performance improvement
As you progress with this course you will learn these factors as well
***************************
Profile Driven Optimization with NVPROF
***************************
***************************
Parallel reduction as synchronization example
***************************
***************************
Parallel reduction as warp divergence example
***************************
And our device will arrange all these paths together and execute the code
sequentially which will result in heavy performance penalties
Unfortunately our previous parallel reduction implementation also has warp
divergence
For example, in this diagram you can see 128 threads belong to a single thread
block and each slot in this table belongs to a one thread and I have wrote down
the thread ids for each slot as well
In our previous implementation, in the first iteration only threads with even
number as thread id perform any effective work
So only 50% of threads perform any work And these threads spread across all
four warps of this thread block as well
In the second iteration only threads with thread id which is multiplier of value
4 perform any summation
So only quarter of threads perform the summation and those threads also spread
across all the warps as well
So in each iteration other than last couple of ones, we have warp divergence in
every warp in the thread block
In final two iterations, only couple of warps will show warps divergence
since the threads perform sum is spread across only couple of warps
So our previous reduction implementation has lots of divergence code
Luckily there are more than one way of avoiding warp divergence in this type of
kernels
Here, we are going to implements reduction algorithm in two new ways to avoid
warp divergence
As the first solution we will make neighboring threads to perform effective work
in each iteration
In the second method we are going to use interleaved pair approach as a solution
for divergence in reduction algorithm
Let's look at each of these approaches now
In the first approach we are going to make sure that all these neighboring
thread are the ones performing this summation
For example in previous neighboring pair approach when we perform algorithm on 8
element block, in the first iteration only threads with even number as thread id
perform the summation
Threads with odd value as thread id did not used to add the values
However in our new approach, we make sure that neighboring threads perform the
summation
So in the corresponding first iteration first four consecutive threads will
perform the summation
Ok let's see an example of this approach
Let's say we have a data block with 8 elements
And in the first iteration, first four threads going to perform the summation
So T0 thread will add first two elements, and store the results back to the
first index, and T1 thread will add next 2 elements and store the results back
to the second index and so on
Then when it come to next iteration still consecutive threads will be the one's
execute the summation
In this case T0 and T1
In our previous implementation in the second iteration T0 and T4 where the ones
execute the summation in second iteration
And in the third iteration only T0 will be the one execute the summation
If we use 128 as data block size so our thread block size will also be 128
So in the first iteration first 64 threads or first two warps will perform the
summation and second two warps will not do anything
But still there is no warp divergence since with in a warp all the threads follow
the same path
In the next iteration Only first 32 threads, or first warp will perform
the summation and all other three warps will not do anything
In the 3rd iteration however, only first 16 threads or half of the first warp
will be performing the summation so from this iteration onwards, there will be
warp divergence
But notice, here we have one warp with divergence per iteration from this
iteration onwards
But if you remember, in our initial neighbored pair approach when we consider
128 size data block all four warps had warp divergence until last two iteration
Ok let's see the implementation now
The main function for this implementation is almost similar to what we have in
the previous neighbored pairs implementation
So I will not go through that
So let's look at our kernel now
In the kernel
Let's set local thread id value and then calculate global thread Id value first
Then we have to set local memory pointer with given offset for the corresponding
thread block, so that we can access global memory using this local pointer for
this thread block
Then we can have our boundary check now
And then we are going to perform usual iteration while multiplying offset by
two in each iteration
Now we need consecutive threads to perform summation
For that we calculate index value for each thread in the block based on the
thread id and offset value
We can use this condition check to limit the threads which are going to perform
the summation
And we need all the threads in the block to finish executing one iteration
before any of threads in that block move on to the next one so here, we are
going to have syncthreads() function call as well
After all the iteration ends first element in the thread block will have the
partial sum for this thread block so we have to store it to the our partial sum
array
Now let's run this implementation and check the validity
Ok, in the output you can see that it printed out GPU and CPU results are
same
So our implementation is a valid one
Ok, Let's move on to the next way of solving warp divergence using interleaved
pair approach
In this approach also we are forcing the summation of elements to happen in
consecutive threads
And we are going to start the offset of the elements which are going to added
together in a iteration from reverse order compared to the previous approaches
For example, in the first iteration we will set our offset value to half of the
block size
So if we consider data block with 8 elements, offset value will be 4, hence in
the first iteration first thread will accumulate first element and fifth element
in the data block
Second thread will accumulate second and sixth elements in the data block and
so on
In the second iteration we will divide offset by half
So for second iteration, offset value will be 2 hence first thread will
accumulate first and third element, and second thread will accumulate second and
fourth element in the current data block
Notice here, we are performing in-place reduction so output of one iteration
will be the input to the next iteration so first element for the second
iteration contains the summation of first and fifth element in the original
array, and third element for second iteration have third and seventh element in
the original array
and after second iteration first element in the array now have summation
of first third fifth and seventh element in the original array
In the last iteration our offset will be one and when the offset reached one we
stop iterating and first thread will accumulate first and second element in the
current data block
Now after this step, first element in our array now contains the summation of
all 8 elements in our array
Ok, let's see the implementation now
This is almost similar to what we done in previous implementation
The difference is that we initialized offset value to half of the block size and
then we keep dividing it by 2 in each iteration
And in the condition check, since we need consecutive threads to execute the
accumulation steps we check whether the thread id is less than the offset
So in the first iteration only first half of the thread block will perform the
summation
In the next iteration only first quarter of thread in the block perform the
summation and so on
And here also we need all the threads in one thread block to execute one
iteration before moving in to next iteration henceforth I have used syncthread()
function here as well
Then we have to store the summation to partial sum array as well
Now if you run this example, you will see that GPU and CPU result are same So
this implementation also valid one.
***************************
Parallel reduction with loop unrolling
***************************
In this video we will look at important concept call unrolling in CUDA programming
Unrolling in CUDA means verity of things
However goal of any kind of unrolling is to improve the performance by reducing
instruction overhead and creating more independent instructions to schedule
Let's first look at simple unrolling concept called loop unrolling
In loop unrolling rather than writing the body of a loop once and using a loop
to execute repeatedly, the body is written in the code multiple times
By doing so we can reduce or entirely remove the enclosing loops iterations
and by reducing or remove iterations of an enclosing loop, we reduce loop
maintenance instruction like condition checks in the loop and number of
dependent instructions as one iteration have to perform after another
we call number of copies we made in the loop body as loop unrolling factor
For example, here we have loop we used to accumulate an array in CPU
If the array have 100 elements, then we need to iterate 100 times with this
implementation
However if we replace this loop from the following in loop then with the loop
body we accumulate 2 elements ith element and I+1 th element, so the loop will
iterate for half of the original size, which in this case 50
And remember, this kind of loop unrolling is most effective at improving
performance for sequential array processing loops, where the number of
iterations are known prior to execution of the loop
Unrolling we are going to apply here is somewhat different than our loop
unrolling example in this sequence code
You can refer unrolling we are going to perform here as thread block unrolling
But the main purpose, reduction of instruction overhead remain same
In the all previous implementations, we create threads equal to number of
elements in the arrays
But if we can reduce the number of thread blocks we need to perform our parallel
reduction
Now here is what we are going to do
Before the partial_sum loop in the interleaved pairs kernel we are going to
manually sum up two data blocks to one block
So the amount of thread need to reduction will be reduce by half
With this simple manual addition, we can reduce device workload almost by half
So after we made this change first and second data blocks will be sum up
together by first thread block, and next data blocks will be sum up together by
second thread block and so on
Ok, let's implement this approach now
Keep in mind all these reduction implementations done by assuming that we are
going to launch one dimensional grid
Let me name our kernel as reduction_unrolling_blocks2
And it has same argument list as our previous kernels
To access each single element in each data block I need local tid value here
So let me assign threadIdx.x to a variable called tid
Now this step is optional
As you can access this directly using threadIdx.x variable
But I like to use "tid" variable
So that's about it
Since we are unrolling two consecutive data blocks using one thread block each
of our thread blocks have to access data block twice the distance as the thread
block id
So to calculate element index for each thread based on the data block location
we need data block offset based on the thread block id
So let me first calculate data block offset here
Now we can simply add tid value to block offset to get global data element index
And then let me calculate local data pointer, so that our threads can access
elements for this thread block using that pointer as well
Now we can unroll two data blocks
So here if the index is with in the size of our array, we can sum up two
elements from two consecutive data blocks
Now before doing anything else, we have to make sure that all of the threads in
our thread block finished thread block unrolling part
We can achieve that by using syncthread() function call here
Then we can perform partial accumulations step for our unrolled data block in
iterative manner
So here I'm going to use interleaved pair approach to some up data blocks, So
let me copy that cord segment from our previous interleaved pair implementation
And that's it for our kernel
The difference here is that we how manually unrolled two data blocks at the
beginning of our code
Now what if we want to unroll 4 data blocks instead of 2
Actually the implementation is almost similar to this one
All you have to do is change in the block offset value here to calculate offset
for four blocks, so here instead of multiplying by 2 we will multiply it by 4
So let me quickly add the kernel for that unrolling as well
Let me copy the unrolled two blocks kernel and paste it here, and change the
name to reduction unrolling blocks 4
Then we can change the block offset variable in this way
Then in unrolling statement, we have to perform 4 unrollings
So here, our boundary check will be index + another 3 blocks, because we are
going to unroll 4 blocks, so we have to check those indices are with in the size
of our array
And the we can assign values from each consecutive data blocks for variables and
finally add those together
And that's it for 4 blocks unrolling kernel
If you want to unroll 8 thread blocks, all you have to do is to change the
block offset variable and preform unrolling for 8 elements
Then you will have the 8 block unrolled kernel
Now when you launch these kernels in main function, make sure that you launch
the kernel with correct amount of thread blocks
For example, for 2 block unrolling kernel, we should have half of thread blocks
than the previous implementations
So you have to divide the grid size in X dimension by two
And for 4 block unrolled kernel, you should have quarter of the threads than
usual grid and so on.
***************************
Parallel reduction with warp unrolling
***************************
***************************
Parallel reduction with complete unrolling
***************************
***************************
Performance comparision of reduction kernels
***************************
***************************
CUDA Dynamic Parallelism
***************************
***************************
Reduction with dynamic parallelism
***************************
***************************
Sumary
***************************