Lecture 08
Lecture 08
Majd F. Sakr
Lecture Motivation
Limits and performance considerations
Which parallel architecture/model is best for your
application
Lecture Outline
Parallel Computing Design Considerations
Limits and Costs of Parallel Computing
Parallel Computing Performance Analysis
Examples of Problems Solved By Parallel Computing
How to Parallelize
Automatic vs. Manual Parallelization
Design Considerations
1. Can the Problem be parallelized?
2. Program’s hotspots & bottlenecks?
3. Partitioning
4. Communications
5. Synchronization
6. Data Dependencies
7. Load Balancing
8. Granularity
9. Input/Output
4
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Programmer Directed:
The programmer uses compiler flags to explicitly tell the compiler how to
parallelize the code.
Parallelism Inhibitors:
Control vs. data dependencies
Examples:
Parallelizable Problem: Multiply each element
of the array by 2
Non‐parallelizable Problem: Fibonacci sequence
Handling Data Dependencies
Parallelism Slow‐down:
Communications bottleneck
6
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Bottlenecks:
What are they? slow areas
Can we redesign the algorithm to reduce
/eliminate bottlenecks?
Source: http://zubinmehta.files.wordpress.com/
7
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
(3) Partitioning/Decomposition
Dividing the problem into chunks/parts of work that can
be distributed to multiple tasks.
Ways to Partition?
Domain Decomposition
Functional Decomposition
http://tinypic.com/view.php?pic=a29ah0&s=3
8
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
(3) Partitioning/Decomposition
Task 1 Task 5
Functional
Decomposition F1 Loop
(3) Partitioning/Decomposition
Domain Decomposition
The data given to the problem is divided into chunks.
Each chunk is given to a task that performs the operation on it.
Tasks run in parallel.
A: Array of 50 elements
(4) Communication
Do we need inter‐task communication?
11
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
(4) Communication
When is task communication needed?
Complex problems
Problem is complex, it can be partitioned into tasks, but tasks need
to share data with each other to complete the computations.
Tightly coupled
1. Costs:
Time Cost:
– Time waiting to synchronize
– Bandwidth saturation
– Bandwidth VS Latency
Resources Costs:
– Machine cycles and resources
Bandwidth saturation
Competing communication traffic fill the network bandwidth.
Bandwidth VS Latency
Latency is the time it takes to send a minimum size message.
Bandwidth is the amount of data that can be transferred per time unit.
Small messages experience both delays. It’s better to package multiple
small messages in one big message.
2) Visibility
Task communication is more visible in
some models (ex: Message Passing
Model), and is under the programmer’s
control.
5) Efficiency
Efficiency varies based on the applications requirements and
the programmer’s decisions. Programmer must decide:
Which factors should have more impact on task communication.
Which implementation to use for the chosen model.
What type of communication needs to be used.
Network(s) type and number.
20
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Task B Tasks
reaching
the barrier
Task C at different
times
Task D
21
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Process B Process C
X is blocked by process A.
x=a*2
Process B can not access it
CPU f=5
until process A unblocks it!
z=x-f
22
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Task A Task B
Task A Task B
Syn
Syn
ck
Syn, A Ack
Three- Way Two Way
Handshake Ack Handshake
Communication
Communication …
…
23
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
RF PC
Shred c IR
x CU
a Memory
f
b y l l=b-a
z
m=2+d
m d
24
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
25
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Function 1 Function 1 T1
Function 2 Function 2 T2
Function 3 Function 3 T3
Function 4 Function 4 T4
26
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
27
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
(8) Granularity
Computation : Communication ratio
Fine‐grain Parallelism:
Small amounts of computation events are done between communication
events.
Make load balancing easier.
High communication overhead.
Small opportunity to enhance performance.
If its too fine, communication overhead takes much longer than computations.
Coarse‐grain Parallelism:
Large amounts of computational events are done between communication
events.
Large opportunity to enhance performance.
Harder to do load balancing efficiently .
Which is better?
28
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Lecture Outline
Parallel Computing Design Considerations
Limits and Costs of Parallel Computing
Parallel Computing Performance Analysis
Examples of Problems Solved By Parallel Computing
30
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
31
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Portability
Due to standardization, portability issues in
parallel programs are not very serious as they
used to be.
All of the usual portability issues in serial programs apply to parallel ones.
Even in with standardized APIs, implementation differences that required
code modifications exist.
Operating systems causes many portability issues.
Hardware architectures are highly variable which affect portability.
Memory required for parallel code is much more than corresponding serial code.
Data replication.
Overheads caused by using support libraries and subsystems.
Communication overhead
http://www.verseone.com/image/servers.jpg
Lecture Outline
Parallel Computing Design Considerations
Limits and Costs of Parallel Computing
Parallel Computing Performance Analysis
Examples of Problems Solved By Parallel Computing
Amdahl’s Law
Used to find the maximum expected improvement in an
entire application when only one part of it is improved.
https://computing.llnl.gov/tutorials/parallel_comp/
Amdahl's Law is used to predict the
maximum improvement in the speedup
when using multiple processors in
parallel.
40
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
https://computing.llnl.gov/tutorials/parallel_comp/
parallel, program’s speedup is limited
by the time required by the sequential
part of the program.
P = parallel fraction
N = number of processors
S = serial fraction
Scalability Limitation:
Problems in which parallel fraction increases as the problem size increases are more
scalable than those with fixed parallel fraction.
41
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Function F1 F2 F3 F4
Parallelizing which part of the application would mostly improve the performance?
Assume that parts: F1, F3, and 4 can all be parallelized, but F2 must be done serially.
What is the best performance speed up that could be reached by parallelizing those
parts?
Function F1 F2 F3 F4
Parallelizing which part of the application would mostly improve the performance?
F3
Assume that parts: F1, F3, and 4 can all be parallelized, but F2 must be done serially.
What is the best performance speed up that could be reached by parallelizing those
part? 10 times faster
Function F1 F2 F3 F4
Suppose that running the whole application requires 50 ms. From the last question,
what is the best running time that we can reach?
Function F1 F2 F3 F4
Suppose that running the whole application requires 50 ms. From the last question,
what is the best running time that we can reach?
Max speedup is 10, best running time is 5 ms
Lecture Outline
Parallel Computing Design Considerations
Limits and Costs of Parallel Computing
Parallel Computing Performance Analysis
Examples of Problems Solved By Parallel Computing
Parallelization Examples
Array Processing
Apply an the same operation or function on each element of the
array,
Independent, no communication is needed.
Embarrassingly parallel!
Divide the array into smaller chunks and distribute them over
processors so the operations can be done in parallel.
PI Calculation
Discussed previously in the Demo!
47
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Parallelization Examples
Array Processing
Apply an the same operation or function on each element of the
array,
Independent, no communication is needed.
Embarrassingly parallel!
Divide the array into smaller chunks and distribute them over
processors so the operations can be done in parallel.
48
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
do j = 1,n
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
if I am MASTER
initialize the array
send each WORKER info on part of array it owns
send each WORKER its portion of initial array
receive from each WORKER results
The initial temperature is zero on the boundaries and high in the middle. Boundary
temperature is held at zero.
Serial Code:
do iy = 2, ny - 1
do ix = 2, nx - 1
end do
15-319 Introduction to Cloud Computing Spring 2010 ©
Carnegie Mellon
Master process sends initial info to worker processes, checks whether convergence
is reached, and collects results.
if I am MASTER
initialize array
send each WORKER starting info and subarray
else if I am WORKER
receive from MASTER starting info and subarray
References
http://en.wikipedia.org/wiki/Parallel_programming_model
http://www.buyya.com/cluster/v2chap1.pdf
https://computing.llnl.gov/tutorials/parallel_comp/
http://www.acm.org/crossroads/xrds8‐3/programming.html
http://researchcomp.stanford.edu/hpc/archives/HPCparallel.pdf
http://www.mcs.anl.gov/~itf/dbpp/text/node9.html
http://en.wikipedia.org/wiki/Parallel_computing
http://www.azalisaudi.com/para/Para‐Week2‐TypesOfPara.pdf
55
15-319 Introduction to Cloud Computing Spring 2010 ©