BCS702 Module 2 Textbook
BCS702 Module 2 Textbook
MODULE 2
or, at best, unpredictable performance, since each time a running process accessed
memory, it might access local memory—that is, memory belonging to the core on
which it was executing—or remote memory, memory belonging to another core. Ac-
cessing remote memory can take hundreds or even thousands of times longer than
accessing local memory. As an example, consider the following pseudocode for a
shared-memory vector addition:
shared int n = . . . ;
shared double x [ n ] , y [ n ] ;
private i n t i , my_first_element , my_last_element ;
my_first_element = . . . ;
my_last_element = . . . ;
/ ∗ I n i t i a l i z e x and y ∗ /
. . .
We first declare two shared arrays. Then, on the basis of the process’s rank, we de-
termine which elements of the array “belong” to which process. After initializing the
arrays, each process adds its assigned elements. If the assigned elements of x and y
have been allocated so that the elements assigned to each process are in the memory
attached to the core the process is running on, then this code should be very fast.
However, if, for example, all of x is assigned to core 0 and all of y is assigned to
core 1, then the performance is likely to be terrible, since each time the assignment
x [ i ] += y [ i ] is executed, the process will need to refer to remote memory.
Partitioned global address space, or PGAS, languages provide some of the
mechanisms of shared-memory programs. However, they provide the programmer
with tools to avoid the problem we just discussed. Private variables are allocated in
the local memory of the core on which the process is executing, and the distribution
of the data in shared data structures is controlled by the programmer. So, for example,
she knows which elements of a shared array are in which process’s local memory.
There are several projects currently working on the development of PGAS lan-
guages. See, for example, [8,54].
The GPU itself will have one or more processors. Each of these processors is
capable of running hundreds or thousands of threads. In the systems we’ll be using,
the processors share a large block of memory, but each individual processor has a
small block of much faster memory that can only be accessed by threads running on
that processor. These blocks of faster memory can be thought of as a programmer-
managed cache.
The threads running on a processor are typically divided into groups: the threads
within a group use the SIMD model, and two threads in different groups can run
independently. The threads in a SIMD group may not run in lockstep. That is, they
may not all execute the same instruction at the same time. However, no thread in
the group will execute the next instruction until all the threads in the group have
completed executing the current instruction. If the threads in a group are executing a
branch, it may be necessary to idle some of the threads. For example, suppose there
are 32 threads in a SIMD group, and each thread has a private variable rank_in_gp
that ranges from 0 to 31. Suppose also that the threads are executing the following
code:
/ / Thread p r i v a t e v a r i a b l e s
i n t rank_in_gp , my_x ;
...
i f ( rank_in_gp < 16)
m y _ x += 1 ;
else
m y _ x −= 1 ;
Then the threads with rank < 16 will execute the first assignment, while the threads
with rank ≥ 16 are idle. After the threads with rank < 16 are done, the roles will be re-
versed: the threads with rank < 16 will be idle, while the threads with rank ≥ 16 will
execute the second assignment. Of course, idling half the threads for two instructions
isn’t a very efficient use of the available resources. So it’s up to the programmer to
minimize branching, where the threads within a SIMD group take different branches.
Another issue in GPU programming that’s different from CPU programming is
how the threads are scheduled to execute. GPUs use a hardware scheduler (unlike
CPUs, which use software to schedule threads and processes), and this hardware
scheduler uses very little overhead. However, the scheduler will choose to execute
an instruction when all the threads in the SIMD group are ready. In the preceding
example, before executing the test, we would want the variable rank_in_gp stored in
a register by each thread. So, to maximize use of the hardware, we usually create
a large number of SIMD groups. When this is the case, groups that aren’t ready to
execute (e.g., they’re waiting for data from memory, or waiting for the completion of
a previous instruction) can be idled, and the scheduler can choose a SIMD group that
is ready.
60 CHAPTER 2 Parallel hardware and parallel software
To partially address these issues, we’ll be making these assumptions and follow-
ing these rules when our parallel programs need to do I/O:
• In distributed-memory programs, only process 0 will access stdin. In shared-
memory programs, only the master thread or thread 0 will access stdin.
• In both distributed-memory and shared-memory programs, all the
processes/threads can access stdout and stderr.
• However, because of the nondeterministic order of output to stdout, in most cases
only a single process/thread will be used for all output to stdout. The principal
exception will be output for debugging a program. In this situation, we’ll often
have multiple processes/threads writing to stdout.
• Only a single process/thread will attempt to access any single file other than stdin,
stdout, or stderr. So, for example, each process/thread can open its own, private
file for reading or writing, but no two processes/threads will open the same file.
• Debug output should always include the rank or ID of the process/thread that’s
generating the output.
2.5.2 GPUs
In most cases, the host code in our GPU programs will carry out all I/O. Since we’ll
only be running one process/thread on the host, the standard C I/O functions should
behave as they do in ordinary serial C programs.
The exception to the rule that we use the host for I/O is that when we are debug-
ging our GPU code, we’ll want to be able to write to stdout and/or stderr. In the
systems we use, each thread can write to stdout, and, as with MIMD programs, the
order of the output is nondeterministic. Also in the systems we use, no GPU thread
has access to stderr, stdin, or secondary storage.
2.6 Performance
Of course our main purpose in writing parallel programs is usually increased per-
formance. So what can we expect? And how can we evaluate our programs? In this
section, we’ll start by looking at the performance of homogeneous MIMD systems.
So we’ll assume that all of the cores have the same architecture. Since this is not the
case for GPUs, we’ll talk about the performance of GPUs in a separate subsection.
of our parallel program is Tparallel = Tserial /p. When this happens, we say that our
parallel program has linear speedup.
In practice, we usually don’t get perfect linear speedup, because the use of mul-
tiple processes/threads almost invariably introduces some overhead. For example,
shared-memory programs will almost always have critical sections, which will re-
quire that we use some mutual exclusion mechanism, such as a mutex. The calls to
the mutex functions are the overhead that’s not present in the serial program, and
the use of the mutex forces the parallel program to serialize execution of the criti-
cal section. Distributed-memory programs will almost always need to transmit data
across the network, which is usually much slower than local memory access. Serial
programs, on the other hand, won’t have these overheads. Thus it will be unusual
for us to find that our parallel programs get linear speedup. Furthermore, it’s likely
that the overheads will increase as we increase the number of processes or threads.
For example, more threads will probably mean more threads need to access a critical
section, and more processes will probably mean more data needs to be transmitted
across the network.
So if we define the speedup of a parallel program to be
Tserial
S= ,
Tparallel
then linear speedup has S = p. Furthermore, since as p increases we expect the par-
allel overhead to increase, we also expect S to become a smaller and smaller fraction
of the ideal, linear speedup p. Another way of saying this is that S/p will probably
get smaller and smaller as p increases. Table 2.4 shows an example of the changes in
S and S/p as p increases.1 This value, S/p, is sometimes called the efficiency of the
parallel program. If we substitute the formula for S, we see that the efficiency is
Tserial
S Tparallel
E= =
p p
Tserial
= .
p · Tparallel
If the serial run-time has been taken on the same type of core that the parallel
system is using, we can think of efficiency as the average utilization of the parallel
1 These data are taken from Chapter 3. See Tables 3.6 and 3.7.
2.6 Performance 63
cores on solving the problem. That is, the efficiency can be thought of as the fraction
of the parallel run-time that’s spent, on average, by each core working on solving
the original problem. The remainder of the parallel run-time is the parallel overhead.
This can be seen by simply multiplying the efficiency and the parallel run-time:
Tserial Tserial
E · Tparallel = · Tparallel = .
p · Tparallel p
For example, suppose we have Tserial = 24 ms, p = 8, and Tparallel = 4 ms. Then
24 3
E= = ,
8·4 4
and, on average, each process/thread spends 3/4 · 4 = 3 ms on solving the original
problem, and 4 − 3 = 1 ms in parallel overhead.
Many parallel programs are developed by explicitly dividing the work of the serial
program among the processes/threads and adding in the necessary “parallel over-
head,” such as mutual exclusion or communication. Therefore if Toverhead denotes
this parallel overhead, it’s often the case that
When this formula applies, it’s clear that the parallel efficiency is just the fraction
of the parallel run-time spent by the parallel program in the original problem, be-
cause the formula divides the parallel run-time into the time on the original problem,
Tserial /p, and the time spent in parallel overhead, Toverhead .
We’ve already seen that Tparallel , S, and E depend on p, the number of processes or
threads. We also need to keep in mind that Tparallel , S, E, and Tserial all depend on the
problem size. For example, if we halve and double the problem size of the program,
whose speedups are shown in Table 2.4, we get the speedups and efficiencies shown
in Table 2.5. The speedups are plotted in Fig. 2.18, and the efficiencies are plotted in
Fig. 2.19.
We see that in this example, when we increase the problem size, the speedups and
the efficiencies increase, while they decrease when we decrease the problem size.
64 CHAPTER 2 Parallel hardware and parallel software
FIGURE 2.18
Speedups of parallel program on different problem sizes.
FIGURE 2.19
Efficiencies of parallel program on different problem sizes.
This behavior is quite common, because in many parallel programs, as the problem
size is increased but the number of processes/threads is fixed, the parallel overhead
grows much more slowly than the time spent in solving the original problem. That is,
2.6 Performance 65
if we think of Tserial and Toverhead as functions of the problem size, Tserial grows much
faster as the problem size is increased. Exercise 2.15 goes into more detail.
A final issue to consider is what values of Tserial should be used when reporting
speedups and efficiencies. Some authors say that Tserial should be the run-time of the
fastest program on the fastest processor available. However, since it’s often the case
that we think of efficiency as the utilization of the cores on the parallel system, in
practice, most authors use a serial program, on which the parallel program was based
and run it on a single processor of the parallel system. So if we were studying the
performance of a parallel shell sort program, authors in the first group might use a
serial radix sort or quicksort on a single core of the fastest system available, while
authors in the second group would use a serial shell sort on a single processor of the
parallel system. We’ll generally use the second approach.
Tserial 20
S= = .
0.9 × Tserial /p + 0.1 × Tserial 18/p + 2
Now as p gets larger and larger, 0.9 × Tserial /p = 18/p gets closer and closer to 0,
so the total parallel run-time can’t be smaller than 0.1 × Tserial = 2. That is, the de-
nominator in S can’t be smaller than 0.1 × Tserial = 2. The fraction S must therefore
satisfy the inequality
Tserial 20
S≤ = = 10.
0.1 × Tserial 2
That is, S ≤ 10. This is saying that even though we’ve done a perfect job in paral-
lelizing 90% of the program, and even if we have, say, 1000 cores, we’ll never get a
speedup better than 10.
More generally, if a fraction r of our serial program remains unparallelized,
then Amdahl’s law says we can’t get a speedup better than 1/r. In our example,
r = 1 − 0.9 = 1/10, so we couldn’t get a speedup better than 10. Therefore if a
66 CHAPTER 2 Parallel hardware and parallel software
fraction r of our serial program is “inherently serial,” that is, cannot possibly be par-
allelized, then we can’t possibly get a speedup better than 1/r. Thus even if r is quite
small—say, 1/100—and we have a system with thousands of cores, we can’t possibly
get a speedup better than 100.
This is pretty daunting. Should we give up and go home? Well, no. There are
several reasons not to be too worried by Amdahl’s law. First, it doesn’t take into con-
sideration the problem size. For many problems, as we increase the problem size,
the “inherently serial” fraction of the program decreases in size; a more mathemat-
ical version of this statement is known as Gustafson’s law [25]. Second, there are
thousands of programs used by scientists and engineers that routinely obtain huge
speedups on large distributed-memory systems. Finally, is a small speedup so awful?
In many cases, obtaining a speedup of 5 or 10 is more than adequate, especially if the
effort involved in developing the parallel program wasn’t very large.
n n
E= = .
p(n/p + 1) n + p
xn kn n
= = .
xn + kp k(n + p) n + p
2.6 Performance 67
In other words, if we increase the problem size at the same rate that we increase the
number of processes/threads, then the efficiency will be unchanged, and our program
is scalable.
There are a couple of cases that have special names. If, when we increase the
number of processes/threads, we can keep the efficiency fixed without increasing the
problem size, the program is said to be strongly scalable. If we can keep the efficiency
fixed by increasing the problem size at the same rate as we increase the number of
processes/threads, then the program is said to be weakly scalable. The program in our
example would be weakly scalable.
wouldn’t be counted as CPU time, since no function that’s been called by the process
is active. However, it should count in our evaluation of the overall run-time, since it
may be a real cost in our program. If each time the program is run, the process has
to wait, ignoring the time it spends waiting would give a misleading picture of the
actual run-time of the program.
Thus when you see an article reporting the run-time of a parallel program, the
reported time is usually “wall clock” time. That is, the authors of the article report
the time that has elapsed between the start and finish of execution of the code that the
user is interested in. If the user could see the execution of the program, she would
hit the start button on her stopwatch when it begins execution and hit the stop button
when it stops execution. Of course, she can’t see her code executing, but she can
modify the source code so that it looks something like this:
double start , finish ;
. . .
start = Get_current_time ( ) ;
/ ∗ Code t h a t we want t o t i m e ∗ /
. . .
finish = Get_current_time ( ) ;
printf ( " The elapsed time = % e seconds \ n " , f i n i s h −s t a r t );
The function Get_current_time() is a hypothetical function that’s supposed to return
the number of seconds that have elapsed since some fixed time in the past. It’s just
a placeholder. The actual function that is used will depend on the API. For example,
MPI has a function MPI_Wtime that could be used here, and the OpenMP API for
shared-memory programming has a function omp_get_wtime. Both functions return
wall clock time instead of CPU time.
There may be an issue with the resolution of the timer function. The resolu-
tion is the unit of measurement on the timer. It’s the duration of the shortest event
that can have a nonzero time. Some timer functions have resolutions in milliseconds
(10−3 seconds), and when instructions can take times that are less than a nanosecond
(10−9 seconds), a program may have to execute millions of instructions before the
timer reports a nonzero time. Many APIs provide a function that reports the resolu-
tion of the timer. Other APIs specify that a timer must have a given resolution. In
either case we, as the programmers, need to check these values.
When we’re timing parallel programs, we need to be a little more careful about
how the timings are taken. In our example, the code that we want to time is probably
being executed by multiple processes or threads, and our original timing will result
in the output of p elapsed times:
private double start , finish ;
. . .
start = Get_current_time ( ) ;
/ ∗ Code t h a t we want t o t i m e ∗ /
. . .
finish = Get_current_time ( ) ;
printf ( " The elapsed time = % e seconds \ n " , f i n i s h −s t a r t );
2.6 Performance 69
However, what we’re usually interested in is a single time: the time that has elapsed
from when the first process/thread began execution of the code to the time the last
process/thread finished execution of the code. We often can’t obtain this exactly, since
there may not be any correspondence between the clock on one node and the clock
on another node. We usually settle for a compromise that looks something like this:
/ ∗ Code t h a t we want t o t i m e ∗ /
. . .
/ ∗ F i n d t h e max a c r o s s a l l p r o c e s s e s / t h r e a d s ∗ /
global_elapsed = Global_max ( my_elapsed ) ;
i f ( m y _ r a n k == 0 )
printf ( " The elapsed time = % e seconds \ n " ,
global_elapsed ) ;
Here, we first execute a barrier function that approximately synchronizes all of the
processes/threads. We would like for all the processes/threads to return from the
call simultaneously, but such a function usually can only guarantee that all the pro-
cesses/threads have started the call when the first process/thread returns. We then
execute the code as before, and each process/thread finds the time it took. Then all
the processes/threads call a global maximum function, which returns the largest of
the elapsed times, and process/thread 0 prints it out.
We also need to be aware of the variability in timings. When we run a program
several times, it’s extremely likely that the elapsed time will be different for each run.
This will be true, even if each time we run the program we use the same input and
the same systems. It might seem that the best way to deal with this would be to report
either a mean or a median run-time. However, it’s unlikely that some outside event
could actually make our program run faster than its best possible run-time. So instead
of reporting the mean or median time, we usually report the minimum time.
Running more than one thread per core can cause dramatic increases in the vari-
ability of timings. More importantly, if we run more than one thread per core, the
system will have to take extra time to schedule and deschedule cores, and this will
add to the overall run-time. Therefore we rarely run more than one thread per core.
Finally, as a practical matter, since our programs won’t be designed for high-
performance I/O, we’ll usually not include I/O in our reported run-times.
70 CHAPTER 2 Parallel hardware and parallel software